You have this nice application running in your (cloud or on premises) environment and then a big scare hits. Suddenly you need to remove or mask different streams of data depending on all sorts of conditions your legal department is
Until your applications natively can do that, you might resort to a content filter that sits as a proxy between you and the application (technically it is a reverse proxy, but that's fine print).
- Needs to be a content filter, not a URL blocker
- Need to provide functionality for practical use out-of-the-box, but needs to be extensible (configuration over code)
- Filter based on mime-type and URL as standard, but extensible to use anything in the request or reply to decide what to filter
- Configurable FilterChain: a filter decides what to filter (with the mime-type as minimum condition) and hands actual filter operation to a chain of subfilters that do the actual stream manipulation
- configurable subfilters. E.g. a filter that can remove JSON nodes from JSON data should read the qualifier from a configuration, so the same filter class can be reused for different filter purposes
- CSS isn't on the radar yet, but contributions would be happily accepted
There are always a few lessons to be had, here are some from this project:
- http is a chunked beast. When you send larger amount of content, probability approaches 1 that your server uses chunked - until HTTP/2 resolves us from it. A hard choice needs to be made to either use a stream based processing of a chunk (think SAX) or collecting the Junks to be able to process a DOM. To be fully flexible I opted for a DOM/Object based approach, but you are free to create whatever you deem necessary
- Jsoup is a reliable HTML parser. It supports CSS selectors that make addressing HTML elements a breeze. Solves one of the hardest problems: targeting
- Targeting JSON data is much harder that it needs to be, the very moment Arrays appear in your JSON structure. There is RFC6901 JSON Pointer, but it targets exactly one element, while a typical use case would be: ?from the list (array) of discussion posts, pick the list of comments and those who have an eMail, mask them?. So I implemented 2 variations: a simple path style address
/discussion/posts/comments/emailwhich automatically traverses arrays and an XPath based approach where I convert JSON to a strict XML syntax and back. More detail here, examples in a future post
- Better documentation
- Code cleanup
- Deploy to Heroku button
- More filters
Go check it out and let me know what you think! (Yeah - documentation needs some work).
Caveat (a.k.a disclaimer): this is a prototype and work in progress, YMMV!