Filtering the Web using WebFilter

This document describes the WebFilter (formerly known as NoShit) extension to Cern's httpd web server which allows you to filter out annoying parts of web pages that you visit often.

Why to use WebFilter
How it works
How to download and install it
How to configure it
The copyright
Other implementations of the same idea
The future
What about feedback

Why to use WebFilter

You have probably noticed how many popular web sites that offer cool stuff sooner or later inevitably turn to advertising. They are very welcome to do that, of course, except if they try to place their shit on my computer screen. Instead of placing the ads on a separate page and linking to it as "A word from our sponsors" or "Advertisings", the ads are usually gifs that I'm forced to download because they appear in the middle of the information. However, I don't recall having rented out any of my time, bandwidth, screen real estate or brain capacity to anyone; so I decided to do something about these ads and filter them out of the web: that's what WebFilter does.

Of course, you can filter other things out of the web as well, not just ads. (Examples include: removing annoying big graphics or indecent language.) Not only can you remove stuff with WebFilter, you can change your perspective of the web in any way you like. (Examples include: adding annoying big graphics or indecent language.)

It is inevitable that internet usage will soon be billed by the byte; it is equally inevitable that WebFilter will then become the ultimate killer app. Animated gifs also help a lot, of course. :-)

(Back to the Table of Contents)

How WebFilter works

WebFilter is a patch to Cern's httpd web server. This server can act as a proxy, which means that your web browser (e.g. Mosaic, Netscape or Lynx), when asked to bring up a certain web page, doesn't contact the remote web server directly, but queries the proxy server (which runs on your own computer, or nearby) instead; this proxy then turns around, fetches the page from the remote web server and forwards it back to the browser. This is commonly used in order to hop over a security firewall or to implement caching of web pages, but it can be used to filter the web as well.

The WebFilter extension to httpd allows you to provide the proxy server with a list of URL templates and corresponding filter scripts. Whenever the proxy is asked to fetch a web page whose address matches one or more of the URL templates, it pipes the page through the corresponding filter scripts and presents the result to the browser. The advantage of this approach is that it works transparently with every browser, and that there is extreme flexibility about what you can do to a page, since the filter script can be any program whatsoever (for most cases, a sed, awk or perl script will do, however).

The idea is to run your own personalized WebFilter proxy server, tell your browser about it, and off you go. The proxy does the filtering and the browser doesn't even know about it. That's why this approach works with every browser.

The disadvantage is that you need to write filter scripts for all your favorite web pages that you want to change. Two reasons why this isn't as bad as it sounds:

Many sites with ads consist of a large collection of very similar, more or less static pages; think of Lycos or HotWired. Only one or two templates will take care of all of their pages, once and for all (well, at least until they change their layout...)
Many ad-filtering tasks can be done without even writing a filter script: simply by redirecting classes of URLs of offensive gifs to a small local picture.
I have created a filter script library with filters for many of the most annoying sites. People can grab filters from there or contribute their own. The library is ready to be read upon start-up by WebFilter. Unfortunately, it is outdated now. Any takers?

WebFilter should run on any platform that Cern httpd runs on, i.e. at least every Unix dialect. If you run something different, simply upgrade to Linux or FreeBSD. Sad but true: the web browser does not have to run on Unix, as long as it has a network connection to a Unix machine running the WebFilter proxy. That Unix machine would then have to be specified as proxy host to the browser. This is explained in detail later.

Because the filter scripts that are applied to a page depend solely on the URL used to access that page and since httpd is able to remap URLs, you can invent your own specialized form of URL notation, for example: http://somewhere.or.other/dir/file.html will be presented without the ads, http://somewhere.or.other/dir/file.html|shit will be presented with ads included, http://somewhere.or.other/dir/file.html|html will remove all Netscape extensions from the file, or http://somewhere.or.other/dir/file.html|noimage presents the page without the images.

These specialized URLs can only be used by people who use your WebFilter proxy, so they aren't suitable for being put on the web, but they are nice to fetch a given page in a certain format; you can also put them into your hotlist.

WebFilter can also act as a regular web server; the filtering support is then useful to decompress local documents on the fly or to add a common trailer to all local HTML pages that you serve to the outside. To make writing the configuration file more flexible and convenient, Cern's simple URL templates can optionally be replaced by fully featured extended regular expressions as used by GNU grep -E.

(Back to the Table of Contents)

How to download and install WebFilter

These are the instructions for using WebFilter as a personal filtering proxy server. I assume you have an account on a Unix machine from which you usually browse the web and you want to run the proxy on that same host.

(It is also possible to use WebFilter as a site-wide or even network-wide filtering proxy server or as a server for local documents to the outside world, or all at once; however, you are on your own there. Start with the httpd documentation and read the description of the WebFilter commands below and then use your imagination.)

Cd to some temporary directory
```
mkdir /tmp/webfilter; cd /tmp/webfilter
```
and fetch the Cern httpd source from ftp://ftp.w3.org/pub/httpd/w3c-httpd-3.0A.tar.gz and the WebFilter patches from http://visar.csustan.edu:8000/noshit/WebFilter_0.5.patch.gz.
Now unpack the httpd archive:
```
zcat w3c-httpd-3.0A.tar.gz | tar xvf -
```
and apply the WebFilter patches:
```
zcat WebFilter_0.5.patch.gz | patch -sp
```
You shouldn't get any error messages here. If you do, you've probably forgotten the "-p" option for patch, or Netscape has already uncompressed the patch, in which case you need to say:
```
cat WebFilter_0.5.patch | patch -sp
```
In case you currently use Socks to jump over a network security firewall, you'll have to compile WebFilter with socks support, which is explained in WWW/README-SOCKS. If you don't know what I'm talking about, relax.
Type
```
make
```
in order to start the compilation. If you get any errors during compilation, consult the README file in the directory WWW. If you suspect that the errors are due to the WebFilter patch, try first building an unpatched httpd, i.e. try to compile httpd without applying the WebFilter patches.
The patched httpd is Daemon/MACH/httpd3.0A+WebFilter_0.5 where MACH is your machine type. I move it to my directory of executables and call it simply webfilter:
```
mv Daemon/linux/httpd3.0A+WebFilter_0.5 ~/bin/webfilter
```
WebFilter needs a directory where it finds its configuration file and stores error and log files. I use ~/.webfilter:
```
mkdir ~/.webfilter
```
Four sample config files are provided in server-root/config. You want to use webfilter.conf:
```
cp server-root/config/webfilter.conf ~/.webfilter/conf
```
You'll probably want to change a couple of things in that file (at the very least the value of the directive ServerRoot, which should be the absolute path of your directory ~/.webfilter; it's not difficult because it comes with lots of comments.
If you want to use my (rather outdated) library of filterscripts that remove ads from many popular sites, copy it in place:
```
cp server-root/config/library.txt ~/.webfilter
```
You can always fetch the latest version of the filterscript library from http://math-www.uni-paderborn.de/~axel/NoShit/library.txt.
If you start WebFilter as root, then it will switch to user nobody, so make sure that nobody can read and write ~/.webfilter.

You can now start WebFilter with the line

webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/conf

To automatically start up your personal WebFilter proxy at login time, put the following into your ~/.login (I'm assuming that you use a csh-like login shell):

# Kill old webfilter, if any (~/.webfilter/httpd-pid contains the current
# server process's id):
if ( -r ~/.webfilter/httpd-pid ) then
  kill -9 `cat ~/.webfilter/httpd-pid`
endif
# Start new one (the -r option specifies the config files to read in; you
# need to give an absolute pathnames here, tilde notation is fine).
webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/conf
# Tell browsers about it. 7450 is the number specified with the
# Port directive in ~/.webfilter/conf. Not all browser use the
# value of the variable http_proxy though. We'll have to tell
# them separately.
setenv http_proxy "http://localhost:7450/"

Of course, we want to kill webfilter when we log out, so the following goes into ~/.logout:

# Kill proxy webfilter process:
if ( -r ~/.webfilter/httpd-pid ) then
  kill -9 `cat ~/.webfilter/httpd-pid`
  rm -f ~/.webfilter/httpd-pid
endif

Another possibility is to start the proxy only when you start your web browser, which can be accomplished with an alias in your ~/.cshrc file, for example if you use Mosaic:

alias Mosaic '(webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/conf; \
               \Mosaic; kill -9 `cat ~/.webfilter/httpd-pid`)'

Some browsers ignore the value of the environment variable http_proxy, and we have to tell them separately about the proxy.

Mosaic on Unix
Go to File->Proxy List->Add and fill in http for Scheme, localhost for Proxy Address and 7450 for Proxy Port. Then hit Commit and Save.
Netscape 1.2
Go to Options->Preferences->Proxies and enter localhost as the http proxy and 7450 as the port.
Netscape 2.0
Go to Options->Network Preferences->Proxies, choose Manual configuration, click on view and enter localhost as the http proxy and 7450 as the port.
WinWeb
Go to Options->Proxy Server... and enter localhost:7450 after http:.
Please let me know about the procedure for other browsers that don't use the http_proxy standard.
It is conceivable that certain sites will disallow access from WebFilter proxies. If this happens to you, just change the text in WWW/Daemon/Implementation/Version.make to
```
VD = 3.0A
```
and recompile with make. WebFilter will then look like an ordinary Cern httpd proxy to the outside world.

(Back to the Table of Contents)

How to configure WebFilter

The WebFilter extensions to Cern's httpd consist of three new configuration directives to be used in the configuration file ~/.webfilter/conf. They are explained here. For the other directives, consult the httpd online documentation. You need to be familiar with httpd's configuration before proceeding.

TestingFilters

The TestingFilters directive takes one argument, "On" or "Off" (without the quotes). It defaults to "Off". If "On", then webfilter will reload its configuration file prior to serving each request and, moreover, won't honor requests of the form "if-modified-since" but will send all documents unconditionally.

This is useful for interactively changing the configuration file, especially Filter directives: You can change something in the config file or filter scripts, save it and then reload a page from your browser. The effects will be visible immediately -- no need to send a signal to the proxy server or to flush the client's cache. If you operate like that, you shouldn't be too surprised by an occasional proxy crash though; it happens if you write a new config file to disk while the server is reading it.

Be aware, however, that if TestingFilters was "Off" when the server process was started and you switch it on, the server won't notice it until you force it to reload the config file with

  kill -1 `cat ~/.webfilter/httpd-pid`

When writing or changing filter scripts, you will want to check the proxy's error logs in ~/.webfilter for hints in case something unexpected happens.

In ordinary use, it is imperative to switch off the directive TestingFilters, because it slows down the proxy's operation considerably.

RegExp

All the rules in the WebFilter config file ~/.webfilter/conf that appear between the lines

RegExp On

and

RegExp Off

use fully featured extended regular expressions as URL templates. For the syntax of extended regular expressions refer to the GNU grep man page. Regular expressions in the WebFilter config file are implicitly anchored at the start and end of the string, meaning that you don't have to start them with ^ or end them with $. Note that the characters \|[](){}.?+*^$ have special meanings in extended regular expressions; if you want to use them literally, escape them with a preceding backslash.

If RegExp is on, then a special notation is used for replacement patterns (i.e. for the second arguments of Map, Redirect, Exec and Pass): you can refer back to the i-th parenthesized subexpression of the matching pattern with the notation \i where i is a digit. \0 refers to the full matching string. All non-backslashes and all backslash-escaped non-digits stand for themselves in a replacement pattern.

As an example, the following rule switches the first two path components of URLs belonging to the host www.weird.com and then redirects them to new.weird.com (assuming that RegExp has been switched on before):

Redirect http://www\.weird\.com/([^/]*)/([^/]*)(/.*)?  http://new.weird.com/\2/\1\3

You shouldn't use regular expression matching unless you really need it.

Filter

The Filter directive specifies which filter scripts to apply to which URLs. It is the main work horse of WebFilter. It takes two arguments, a URL template and a filter script.

The URL template specifies those URLs that should be piped through the filter script. It can either be an extended regular expression (if RegExp has been switched on before) or contain one or more '*'-characters, each of which being a place holder for an arbitrary sequence of letters.

If a given URL matches the template of a Filter directive, and then later is mapped to a different URL using the Map directive, and matches another Filter line's template further down, then the document will be piped through both filter scripts; first through the first, then through the second.

The second argument to Filter, the filter script, is a string that will be passed as argument to

/bin/sh -c

This argument can contain spaces and backslash-escaped newlines; it ends when the first unescaped newline is encountered. Escaped newlines become literal newlines in the string passed to the shell -- be aware that sometimes the shell actually needs an escaped newline itself, which you can get by preceding the newline with two backslashes in the WebFilter config file. Every backslash that's not followed by a newline is copied literally to the command string.

Since you can't always be sure what the value of the PATH environment variable and the current directory was when webfilter was started, it is best to use absolute pathnames for all programs (you can't use "~" for you home directory either, since some brain dead non-GNU shells don't know about it. "$HOME" is ok, though.)

Because the second argument is processed by sh -c, it can contain pipes, command sequences, file redirections and references to environment variables. If you don't need all this and just want to start a single program with a couple of arguments, it's best to precede this program's name with exec to speed up matters.

There's one problem with filterscripts that need an '-enclosed argument, and this argument itself needs to contain an '. This situation appears for example if you specify a filter script as a perl argument like so

Filter http://some.host.com/*    exec perl -e 'several;perl;commands;'

since ' is frequently used in perl commands. As a consequence of the shell's command line parsing, the only way to specify an ' as part of the perl commands is to write it as '\''.

In addition to the document's HTML text, your filter script will probably see its MIME header on standard input, because it's part of the http protocol version 1.0 and later (if either your browser or the server does not support HTTP/1.0, which is unlikely, no MIME header is sent). MIME headers typically look like this:

     HTTP/1.0 200 Document follows
     MIME-Version: 1.0
     Server: CERN/3.0
     Date: Wednesday, 30-Aug-95 20:49:59 GMT
     Content-Type: text/html
     Content-Length: 716
     Last-Modified: Saturday, 26-Aug-95 18:38:30 GMT

Every line is ended by a carriage return + line feed (Ctrl-M Ctrl-J), and the whole header is ended by an empty line. Usually, your script can ignore this header, but sometimes you want to look at its first line: if anything but a 200 appears there, then you're not getting the document you requested, but an error message or an authorization request of some kind. In this case, you'll probably don't want to do any filtering. Another case is if you want to prevent certain servers from placing cookies on your computer. Cookies are sent in the MIME header in a line starting with "Set-Cookie:", so it's trivial to filter them out. Note however that malignant sites can still set cookies using javascript; to prevent this, make your cookies file (somewhere in your browser's directory) non-writeable.

To find out what precisely your script sees on standard input, you can start out with filtering the document through the script

tee outfile

and examine outfile afterwards.

Make sure that your patterns are not too broad: you definitely don't want to pipe .gif files through your filter scripts!

If you want to block access to a particular set of URLs altogether, use Fail or Redirect and not Filter. This is much faster. It is especially suitable for shutting up the recently established junk companies that place their ad banners on other people's sites and use cookies to track virtually everybody's browsing habits. A single Fail or Redirect pattern will take care of them. Experience shows that most ad filtering can already be accomplished with a couple of good Redirect commands: for example, redirect all gifs from Yahoo to a small local gif saying "deleted". Use Filter only if you have to: it's more flexible, but much more complicated to use and slower than the alternatives.

Examples of filtering schemes

Here I explain how you can have several filtering schemes implemented on your proxy, and the client chooses which one to use. For examples of actually useful filter scripts, check out the filter script library.

Make sure to read the httpd documentation about the directives Pass and Map and the address mapping algorithm in general if you want to understand the following examples.

The first is the good old simple minded fascist approach: censor the shit, no questions asked. Don't let the client see the uncensored version (unless they are very smart, that is).

Filter  http://host.with.ads.1/*.html          exec ~/.webfilter/script/for/host/1
Filter  http://host.with.ads.2/some/dir/*.html exec ~/.webfilter/script/for/host/2
Pass http://*

This will let everything go through unfiltered except those specified URLs on host.with.ads.1 and host.with.ads.2; they will be filtered through the specified scripts.

Here's a more sophisticated example: we allow clients to see the unfiltered version if they specify the URL in one of the following formats:

  http://host.with.ads.1/some/page.html|shit
  http://host.with.ads.1|shit/some/page.html

The first format is good if only that single page should be unfiltered; with the second format, all partial references on page.html will also appear unfiltered if selected.

RegExp   On
Pass http://([^/]*)\|shit/(.*)        http://\1/\2
Pass http://(.*)\|shit                http://\1
RegExp   Off
Filter  http://host.with.ads.1/*.html          exec ~/.webfilter/script/for/host/1
Filter  http://host.with.ads.2/some/dir/*.html exec ~/.webfilter/script/for/host/2
Pass http://*

In the final example, we implement two filtering schemes, which can even be combined: adding |noshit to an URL will filter out ads, adding |nofilth removes indecent language, adding |noshit|nofilth does both and the unaltered URL gives the document unaltered. Again, we allow for the extension to be added after the hostname or after the document.


# First, move the extensions to the back:
RegExp on
Map http://([^/]*)\|noshit\|nofilth/(.*)  http://\1/\2|noshit|nofilth
Map http://([^/]*)\|nofilth/(.*)          http://\1/\2|nofilth
Map http://([^/]*)\|noshit/(.*)           http://\1/\2|noshit
RegExp off

# Now the nofilth filtering:
Filter http://host.number.1/*|nofilth          exec ~/.noshit/nofilth/host/1
Filter http://host.number.2/some/dir/*|nofilth exec ~/.noshit/nofilth/host/2

# Remove |nofilth extension:
Map *|nofilth *

# Now the noshit filtering:
Filter http://host.number.1/*|noshit           exec ~/.noshit/script/for/host/1
Filter http://host.number.2/some/dir/*|noshit  exec ~/.noshit/script/for/host/2

# Remove |noshit extension:
Map *|noshit *

# Pass everything through:
Pass http://*

Note that you can use these extended URLs in your hotlist or type them directly into your browser, but you can't put them as links on HTML pages that might be read by browsers not using your proxy.

(Back to the Table of Contents)

The Copyright

WebFilter is released under the GNU General Public License, our beloved little copyleft virus that infects everything it touches. Essentially this means that you can do with the WebFilter patches whatever you want unless you try to restrict someone else's right to do whatever they want.

WebFilter includes a slightly modified version of Tom Lord's regular expression library rx, which is also covered by the GPL.

(Back to the Table of Contents)

Other Implementations of the same Idea

WebFilter (then called NoShit) was firsted announced to the world on September 28, 1995. Since then, several programs implementing variations of the idea of filtering WWW proxies have appeared.

There was at some point a commercial filtering program around; it was implemented as a netscape plug-in for Windows 95 and Windows NT machines and was called Internet Fast Forward. The company, PrivNet, was bought by PGP.com and apparently Internet Fast Forward died in the process.

Two implementations of the WebFilter idea for Unix systems have been presented at WWW conferences; they both appear to be smaller, cleaner and more powerful than WebFilter. I have not tried them though, and they don't seem to be actively supported anymore. The first is OreO from the OSF and the second is V6 from Inria.

The Internet Junkbuster is a lean and mean proxy that is specifically designed to block advertising banners (specified by URL regular expression matching) and cookies. Less flexible than WebFilter, but much smaller, faster and easier to use. It gets the job done remarkably well. Runs on Unix, Windows NT and Windows 95 and is free, including the source code. I use Junkbuster myself these days. Here's the blockfile I use to get rid of advertising banners.

A similar free Linux and Windows based product is AdBuster.

Adkiller is a similar free product which runs on Macs, Unix and Windows.

WebWasher, written by Siemens, is a high quality ad filtering personal proxy, free for personal use. It removes ad banners based on size, gets rid of pop-up windows and stops animated graphics. It only works on Windows.

Proxomitron is a very flexible Windows based personal proxy which can alter HTML in many ways, for example removing ads, killing pop-up windows, change backgrounds, killing background sounds etc.

Taz is a free patch for the popular squid proxy which offers the same functionality as junkbuster, but is faster. Squid and Taz run on Unix systems, but the proxy can be used from any browser.

Abiprox is a perl based personal filtering proxy; it is extremely powerful and customizable, but not so easy to use. A more mature continuation of that same theme is FilterProxy.

Muffin is a free filtering proxy for the web written in Java; runs on all platforms. Similar to junkbuster, but more flexible, portable and powerful. It supports several "filters", one of which can delete images based on their width/height ratio (banner ads) and another one allows modifying the incoming HTML stream using a simple language, allowing for stripping other ads. Highly recommended.

ByProxy is a more general solution: it can act as a proxy for every IP service (most important are email and WWW) and modifies the in- and outgoing traffic based on completely general plug-ins. It's free, written in Java and should hence be widely portable. There is a plug-in for removing ads from webpages, but I don't know how well it works.

There are also several commercial ad-filtering products around, but don't waste your money. I won't give them any free advertising here :-)

Neither of these solutions can remove the annoying top-level advertising windows coming from various "free-internet" or "money-for-surfing" companies. You can use the wonderful Windows shareware utility NookMe for this purpose.

The WAB project in Z�rich uses a filtering proxy to prepare HTML pages for blind users.

Several video cassette recorders by RCS can skip recorded commercials during play-back. After recording a program, it goes back over the tape and looks for gaps of the proper length between fades-to-black; these are then marked as commercials. During playback, it automatically fast-forwards through these commercials. It's said to be about 90% accurate. Video Magazine reviewed a unit and found it worked well.

TiVo and ReplayTV are two Linux based devices which record TV in real time on a hard drive, allowing for delayed playback, during which the viewer can skip ads using a "skip 30 seconds" or "fast forward" button. They still cannot remove ads automatically, though.

Please check out the excellent collections of junk filtering software links at JunkBusters and by Francis Irving .

(Back to the Table of Contents)

The Future

looks dark. Prepare for mpeg movies on the web containing commercials, sound ads played while downloading a document, java animations with an ad in the middle, and lots of acrobat pdf files full of shit. Simplistic filters like WebFilter are powerless against all of these. I hope the AI people will get their act together in time.

The only idea I have right now is to promote a little widely recognized No-Ads gif that all ad-free sites could put on their home pages. Maybe you can design one?

I have already received two suggestions for a No-Ads logo to be put on ad-free web sites. Have a look and leave your comments.

(Back to the Table of Contents)

What about feedback

I have set up some space on the web where you can leave your suggestions and comments regarding WebFilter or read other people's remarks and respond to them. Check it out!

There was an article about Webfilter (when it was still called NoShit) on WebWeek and another one on Suck.

(Back to the Table of Contents)

Axel Boldt <axel@uni-paderborn.de>

Last changed: 30-Jul-2000