WebFilter
(formerly known as
NoShit
) extension to Cern's httpd web server which allows
you to filter out annoying parts of web pages that you visit often.
WebFilter
does.
Of course, you can filter other things out of the web as well, not
just ads. (Examples include: removing annoying big graphics or indecent
language.) Not only can you remove stuff with WebFilter
, you can
change your perspective of the web in any way you
like. (Examples include: adding annoying big graphics or indecent
language.)
It is inevitable that internet usage will soon be billed by the byte; it is equally inevitable that WebFilter will then become the ultimate killer app. Animated gifs also help a lot, of course. :-)
(Back to the Table of Contents)
WebFilter
is a patch to Cern's httpd
web server. This server can act as a
proxy, which means that your web browser (e.g. Mosaic, Netscape or
Lynx), when asked to bring up a certain web page, doesn't contact the
remote web server directly, but queries the proxy server (which runs on your
own computer, or nearby) instead; this proxy then turns around, fetches
the page from the remote web server and forwards it back to the
browser. This is commonly used in order to hop over a security firewall or
to implement caching of web pages, but it can be used to
filter the web as well.
The WebFilter
extension to httpd
allows you to provide the proxy server
with a list of URL templates and corresponding filter
scripts. Whenever the proxy is asked to fetch a web page whose address
matches one or more of the URL templates, it pipes the page through the
corresponding filter scripts and presents the result to the
browser. The advantage of this approach is that it works transparently
with every browser, and that there is extreme flexibility about what
you can do to a page, since the filter script can be any program
whatsoever (for most cases, a sed, awk or perl script will do,
however).
The idea is to run your own personalized WebFilter
proxy
server, tell your browser about it, and off you go. The proxy does the
filtering and the browser doesn't even know about it. That's why this
approach works with every browser.
The disadvantage is that you need to write filter scripts for all your favorite web pages that you want to change. Two reasons why this isn't as bad as it sounds:
WebFilter
. Unfortunately, it is outdated now. Any
takers?
WebFilter
should run on any platform that Cern httpd runs
on, i.e. at least every Unix dialect. If you run something different,
simply upgrade to Linux or FreeBSD. Sad
but true: the web browser does not have to run on Unix, as long as it
has a network connection to a Unix machine running the WebFilter
proxy. That Unix machine would then have to be specified as proxy host to
the browser. This is explained in detail later.
Because the filter scripts that are applied to a page depend solely on
the URL used to access that page and since httpd
is able to remap
URLs, you can invent your own specialized form of URL notation, for
example:
http://somewhere.or.other/dir/file.html
will be presented without the ads,
http://somewhere.or.other/dir/file.html|shit
will be presented with ads included,
http://somewhere.or.other/dir/file.html|html
will remove all Netscape extensions from the file, or
http://somewhere.or.other/dir/file.html|noimage
presents the page without the images.
These specialized URLs can only be used by people who use your WebFilter proxy, so they aren't suitable for being put on the web, but they are nice to fetch a given page in a certain format; you can also put them into your hotlist.
WebFilter
can also act as a regular web server; the
filtering support is then useful to decompress local documents on the
fly or to add a common trailer to all local HTML pages that you serve
to the outside. To make writing the configuration file more flexible
and convenient, Cern's simple URL templates can optionally be replaced
by fully featured extended regular expressions as used by GNU
grep -E
.
(Back to the Table of Contents)
WebFilter
as a
personal filtering proxy server. I assume you have an account on a
Unix machine from which you usually browse the web and you want to run
the proxy on that same host.
(It is also possible to use WebFilter
as a site-wide or even network-wide filtering proxy server or as a
server for local documents to the outside world, or all at once;
however, you are on your own there. Start with the httpd
documentation and read the description of the
WebFilter
commands below and then use your imagination.)
mkdir /tmp/webfilter; cd /tmp/webfilterand fetch the Cern httpd source from
ftp://ftp.w3.org/pub/httpd/w3c-httpd-3.0A.tar.gz
and the WebFilter patches from http://visar.csustan.edu:8000/noshit/WebFilter_0.5.patch.gz
.
httpd
archive:
zcat w3c-httpd-3.0A.tar.gz | tar xvf -and apply the
WebFilter
patches:
zcat WebFilter_0.5.patch.gz | patch -spYou shouldn't get any error messages here. If you do, you've probably forgotten the "-p" option for patch, or Netscape has already uncompressed the patch, in which case you need to say:
cat WebFilter_0.5.patch | patch -sp
Socks
to jump over a
network security
firewall, you'll have to compile WebFilter
with socks
support, which is explained in WWW/README-SOCKS
. If you
don't know what I'm talking about, relax.
makein order to start the compilation. If you get any errors during compilation, consult the
README
file in the
directory WWW
. If you suspect that the errors are due to the WebFilter
patch, try first building an unpatched httpd
, i.e. try
to compile httpd
without applying the WebFilter
patches.
httpd
is
Daemon/MACH/httpd3.0A+WebFilter_0.5
where
MACH
is your machine type. I move it to my directory of
executables and call it simply webfilter
:
mv Daemon/linux/httpd3.0A+WebFilter_0.5 ~/bin/webfilter
WebFilter
needs a directory where it finds its configuration file
and stores error and log files. I use ~/.webfilter
:
mkdir ~/.webfilterFour sample config files are provided in
server-root/config
. You want to use
webfilter.conf
:
cp server-root/config/webfilter.conf ~/.webfilter/confYou'll probably want to change a couple of things in that file (at the very least the value of the directive ServerRoot, which should be the absolute path of your directory
~/.webfilter
; it's not difficult because it comes with
lots of comments.
cp server-root/config/library.txt ~/.webfilterYou can always fetch the latest version of the filterscript library from
http://math-www.uni-paderborn.de/~axel/NoShit/library.txt
.
WebFilter
as root, then it will switch
to user nobody
, so make sure that nobody
can read and write ~/.webfilter
.
WebFilter
with the line
webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/confTo automatically start up your personal
WebFilter
proxy at login time, put
the following into your ~/.login
(I'm assuming that you
use a csh-like login shell):# Kill old webfilter, if any (~/.webfilter/httpd-pid contains the current # server process's id): if ( -r ~/.webfilter/httpd-pid ) then kill -9 `cat ~/.webfilter/httpd-pid` endif # Start new one (the -r option specifies the config files to read in; you # need to give an absolute pathnames here, tilde notation is fine). webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/conf # Tell browsers about it. 7450 is the number specified with the # Port directive in ~/.webfilter/conf. Not all browser use the # value of the variable http_proxy though. We'll have to tell # them separately. setenv http_proxy "http://localhost:7450/"Of course, we want to kill
webfilter
when we log out, so the
following goes into ~/.logout
:# Kill proxy webfilter process: if ( -r ~/.webfilter/httpd-pid ) then kill -9 `cat ~/.webfilter/httpd-pid` rm -f ~/.webfilter/httpd-pid endifAnother possibility is to start the proxy only when you start your web browser, which can be accomplished with an
alias
in your
~/.cshrc
file, for example if you use Mosaic
:
alias Mosaic '(webfilter -r ~/.webfilter/library.txt -r ~/.webfilter/conf; \ \Mosaic; kill -9 `cat ~/.webfilter/httpd-pid`)'
http_proxy
, and we have to tell them separately about
the proxy.
localhost
as the http proxy and
7450
as the port.
localhost
as the http proxy and
7450
as the port.
localhost:7450
after http:
.
http_proxy
standard.
WebFilter
proxies. If this happens to you, just change the text in
WWW/Daemon/Implementation/Version.make
to
VD = 3.0Aand recompile with
make
.
WebFilter
will then look like an ordinary Cern httpd
proxy to the
outside world.
(Back to the Table of Contents)
WebFilter
extensions to Cern's httpd
consist of three new configuration
directives to be used in the configuration file
~/.webfilter/conf
. They are explained here. For the other
directives, consult the httpd online
documentation. You need to be familiar with httpd's configuration
before proceeding.
TestingFilters
directive takes one argument, "On" or
"Off" (without the quotes). It defaults to "Off". If "On", then
webfilter
will reload its configuration file prior to serving
each request and, moreover, won't honor requests of the form
"if-modified-since" but will send all documents unconditionally.
This is useful for interactively changing the configuration file,
especially Filter
directives: You can change something in
the config file or filter scripts, save it and then reload a page from your browser. The
effects will be visible immediately -- no need to send a signal to
the proxy server or to flush the client's cache. If you operate like
that, you shouldn't be too surprised by an occasional proxy crash
though; it happens if you write a new config file to disk while the
server is reading it.
Be aware, however, that if TestingFilters
was "Off" when
the server process was started and you switch it on, the server won't
notice it until you force it to reload the config file with
kill -1 `cat ~/.webfilter/httpd-pid`When writing or changing filter scripts, you will want to check the proxy's error logs in
~/.webfilter
for hints in case
something unexpected happens.
In ordinary use, it is imperative to switch off the directive
TestingFilters
, because it slows down the proxy's
operation considerably.
WebFilter
config file ~/.webfilter/conf
that
appear between the lines
RegExp Onand
RegExp Offuse fully featured extended regular expressions as URL templates. For the syntax of extended regular expressions refer to the GNU
grep
man
page. Regular
expressions in the WebFilter
config file are implicitly anchored at the start and end of the
string, meaning that you don't have to start them with ^
or end them with $
. Note that the characters
\|[](){}.?+*^$
have special meanings in extended regular
expressions; if you want to use them literally, escape them with a
preceding backslash.
If RegExp
is on, then a special notation is used for
replacement patterns (i.e. for the second arguments of Map
, Redirect
, Exec
and
Pass
): you can refer back to the i-th parenthesized
subexpression of the matching pattern with the notation
\i
where i is a digit. \0
refers to
the full matching string. All non-backslashes and all backslash-escaped
non-digits stand for themselves in a replacement
pattern.
As an example, the following rule switches the first two path components
of URLs belonging to the host www.weird.com
and then
redirects them to new.weird.com
(assuming that RegExp
has been switched on before):
Redirect http://www\.weird\.com/([^/]*)/([^/]*)(/.*)? http://new.weird.com/\2/\1\3You shouldn't use regular expression matching unless you really need it.
Filter
directive specifies which filter scripts to
apply to which URLs. It is the main work horse of WebFilter.
It takes two arguments, a URL template and a
filter script.
The URL template specifies those URLs that should be piped through the
filter script. It can either be an extended regular expression (if
RegExp
has been switched on before) or contain one or more '*'-characters, each of
which being a place holder for an arbitrary sequence of letters.
If a given URL matches the template of a Filter
directive, and then later is mapped to a different URL using the
Map
directive, and matches another Filter
line's template further down, then the document will be piped through
both filter scripts; first through the first, then through the
second.
The second argument to Filter
, the filter script, is a
string that will be passed as argument to
/bin/sh -cThis argument can contain spaces and backslash-escaped newlines; it ends when the first unescaped newline is encountered. Escaped newlines become literal newlines in the string passed to the shell -- be aware that sometimes the shell actually needs an escaped newline itself, which you can get by preceding the newline with two backslashes in the
WebFilter
config file. Every backslash that's not followed by a newline is
copied literally to the command string.
Since you can't always be sure what the value of the PATH
environment variable and the current directory was when
webfilter
was started, it is best to use absolute pathnames
for all programs (you can't use "~" for you home directory either,
since some brain dead non-GNU shells don't know about it. "$HOME" is
ok, though.)
Because the second argument is processed by sh -c
, it can
contain pipes, command sequences, file redirections and references to
environment variables. If you don't need all this and just want to
start a single program with a couple of arguments, it's best to
precede this program's name with exec
to speed up matters.
There's one problem with filterscripts that need
an '
-enclosed argument, and this argument itself needs to
contain an '
. This situation appears for example if you
specify a filter script as a perl argument like so
Filter http://some.host.com/* exec perl -e 'several;perl;commands;'since
'
is frequently used in perl commands.
As a consequence of the shell's command line parsing, the only way to
specify an '
as part of the perl commands is to write it
as '\''
.In addition to the document's HTML text, your filter script will probably see its MIME header on standard input, because it's part of the http protocol version 1.0 and later (if either your browser or the server does not support HTTP/1.0, which is unlikely, no MIME header is sent). MIME headers typically look like this:
HTTP/1.0 200 Document follows MIME-Version: 1.0 Server: CERN/3.0 Date: Wednesday, 30-Aug-95 20:49:59 GMT Content-Type: text/html Content-Length: 716 Last-Modified: Saturday, 26-Aug-95 18:38:30 GMTEvery line is ended by a carriage return + line feed (Ctrl-M Ctrl-J), and the whole header is ended by an empty line. Usually, your script can ignore this header, but sometimes you want to look at its first line: if anything but a 200 appears there, then you're not getting the document you requested, but an error message or an authorization request of some kind. In this case, you'll probably don't want to do any filtering. Another case is if you want to prevent certain servers from placing cookies on your computer. Cookies are sent in the MIME header in a line starting with "Set-Cookie:", so it's trivial to filter them out. Note however that malignant sites can still set cookies using javascript; to prevent this, make your cookies file (somewhere in your browser's directory) non-writeable.
To find out what precisely your script sees on standard input, you can start out with filtering the document through the script
tee outfileand examine
outfile
afterwards.
Make sure that your patterns are not too broad: you definitely don't
want to pipe .gif
files through your filter scripts!
If you want to block access to a particular set of URLs altogether,
use Fail
or Redirect
and not Filter
.
This is much faster. It is especially suitable for shutting up the
recently established junk companies that place their ad banners on other
people's sites and use cookies to track virtually everybody's
browsing habits. A single Fail
or Redirect
pattern will take care of them.
Experience shows that most ad filtering can already be accomplished
with a couple of good Redirect
commands: for example, redirect all gifs from
Yahoo to a small local gif saying "deleted". Use Filter
only if you have to: it's more flexible, but
much more complicated to use and slower
than the alternatives.
Make sure to read the httpd documentation about the
directives Pass
and Map
and the address
mapping algorithm in general if you want to
understand the following examples.
The first is the good old simple minded fascist approach: censor the shit, no questions asked. Don't let the client see the uncensored version (unless they are very smart, that is).
Filter http://host.with.ads.1/*.html exec ~/.webfilter/script/for/host/1 Filter http://host.with.ads.2/some/dir/*.html exec ~/.webfilter/script/for/host/2 Pass http://*This will let everything go through unfiltered except those specified URLs on
host.with.ads.1
and host.with.ads.2
;
they will be filtered through the specified scripts.
Here's a more sophisticated example: we allow clients to see the unfiltered version if they specify the URL in one of the following formats:
http://host.with.ads.1/some/page.html|shit http://host.with.ads.1|shit/some/page.htmlThe first format is good if only that single page should be unfiltered; with the second format, all partial references on
page.html
will also appear unfiltered if selected.
RegExp On Pass http://([^/]*)\|shit/(.*) http://\1/\2 Pass http://(.*)\|shit http://\1 RegExp Off Filter http://host.with.ads.1/*.html exec ~/.webfilter/script/for/host/1 Filter http://host.with.ads.2/some/dir/*.html exec ~/.webfilter/script/for/host/2 Pass http://*In the final example, we implement two filtering schemes, which can even be combined: adding
|noshit
to an URL will filter
out ads, adding |nofilth
removes indecent language,
adding |noshit|nofilth
does both and
the unaltered URL gives the document unaltered. Again, we allow for
the extension to be added after the hostname or after the document.
# First, move the extensions to the back: RegExp on Map http://([^/]*)\|noshit\|nofilth/(.*) http://\1/\2|noshit|nofilth Map http://([^/]*)\|nofilth/(.*) http://\1/\2|nofilth Map http://([^/]*)\|noshit/(.*) http://\1/\2|noshit RegExp off # Now the nofilth filtering: Filter http://host.number.1/*|nofilth exec ~/.noshit/nofilth/host/1 Filter http://host.number.2/some/dir/*|nofilth exec ~/.noshit/nofilth/host/2 # Remove |nofilth extension: Map *|nofilth * # Now the noshit filtering: Filter http://host.number.1/*|noshit exec ~/.noshit/script/for/host/1 Filter http://host.number.2/some/dir/*|noshit exec ~/.noshit/script/for/host/2 # Remove |noshit extension: Map *|noshit * # Pass everything through: Pass http://*Note that you can use these extended URLs in your hotlist or type them directly into your browser, but you can't put them as links on HTML pages that might be read by browsers not using your proxy.
(Back to the Table of Contents)
WebFilter
is released under the GNU
General Public License, our beloved little copyleft
virus that infects everything it touches. Essentially this means that
you can do with the WebFilter
patches whatever you want unless you try to restrict
someone else's right to do whatever they want.
WebFilter
includes a slightly modified version of Tom
Lord's regular expression library rx
, which is also
covered by the GPL.
(Back to the Table of Contents)
There was at some point a commercial filtering program around; it was implemented as a netscape plug-in for Windows 95 and Windows NT machines and was called Internet Fast Forward. The company, PrivNet, was bought by PGP.com and apparently Internet Fast Forward died in the process.
Two implementations of the WebFilter idea for Unix systems have been presented at WWW conferences; they both appear to be smaller, cleaner and more powerful than WebFilter. I have not tried them though, and they don't seem to be actively supported anymore. The first is OreO from the OSF and the second is V6 from Inria.
The Internet Junkbuster is a lean and mean proxy that is specifically designed to block advertising banners (specified by URL regular expression matching) and cookies. Less flexible than WebFilter, but much smaller, faster and easier to use. It gets the job done remarkably well. Runs on Unix, Windows NT and Windows 95 and is free, including the source code. I use Junkbuster myself these days. Here's the blockfile I use to get rid of advertising banners.
A similar free Linux and Windows based product is AdBuster.
Adkiller is a similar free product which runs on Macs, Unix and Windows.
WebWasher, written by Siemens, is a high quality ad filtering personal proxy, free for personal use. It removes ad banners based on size, gets rid of pop-up windows and stops animated graphics. It only works on Windows.
Proxomitron is a very flexible Windows based personal proxy which can alter HTML in many ways, for example removing ads, killing pop-up windows, change backgrounds, killing background sounds etc.
Taz is a free patch for the popular squid proxy which offers the same functionality as junkbuster, but is faster. Squid and Taz run on Unix systems, but the proxy can be used from any browser.
Abiprox is a perl based personal filtering proxy; it is extremely powerful and customizable, but not so easy to use. A more mature continuation of that same theme is FilterProxy.
Muffin is a free filtering proxy for the web written in Java; runs on all platforms. Similar to junkbuster, but more flexible, portable and powerful. It supports several "filters", one of which can delete images based on their width/height ratio (banner ads) and another one allows modifying the incoming HTML stream using a simple language, allowing for stripping other ads. Highly recommended.
ByProxy is a more general solution: it can act as a proxy for every IP service (most important are email and WWW) and modifies the in- and outgoing traffic based on completely general plug-ins. It's free, written in Java and should hence be widely portable. There is a plug-in for removing ads from webpages, but I don't know how well it works.
There are also several commercial ad-filtering products around, but don't waste your money. I won't give them any free advertising here :-)
Neither of these solutions can remove the annoying top-level advertising windows coming from various "free-internet" or "money-for-surfing" companies. You can use the wonderful Windows shareware utility NookMe for this purpose.
The WAB project in Zürich uses a filtering proxy to prepare HTML pages for blind users.
Several video cassette recorders by RCS can skip recorded commercials during play-back. After recording a program, it goes back over the tape and looks for gaps of the proper length between fades-to-black; these are then marked as commercials. During playback, it automatically fast-forwards through these commercials. It's said to be about 90% accurate. Video Magazine reviewed a unit and found it worked well.
TiVo and ReplayTV are two Linux based devices which record TV in real time on a hard drive, allowing for delayed playback, during which the viewer can skip ads using a "skip 30 seconds" or "fast forward" button. They still cannot remove ads automatically, though.
Please check out the excellent collections of junk filtering software links at JunkBusters and by Francis Irving .
(Back to the Table of Contents)
WebFilter
are powerless
against
all of these. I hope the AI people will get their act together in
time.The only idea I have right now is to promote a little widely recognized No-Ads gif that all ad-free sites could put on their home pages. Maybe you can design one?
I have already received two suggestions for a No-Ads logo to be put on ad-free web sites. Have a look and leave your comments.
(Back to the Table of Contents)
WebFilter
or read other people's
remarks and respond to them. Check
it out!
There was an article
about Webfilter (when it was still called NoShit
) on WebWeek and
another one on Suck.
(Back to the Table of Contents)