Spam, or unsolicited commercial e-mail (UCE) as it is sometimes called, is the scourge of the Internet. Spam has increased relentlessly over the last ten years and now accounts for over half of all Internet bandwidth. One in six consumers have acted on spam e-mails, so there is a strong business case for keeping spam out of your users’ inboxes. There are a variety of different spam solutions, ranging from outsourcing your spam entirely to no action at all. However, if you have your own e-mail server, you can add spam filtering very easily.
SpamAssassin is a very popular open source anti-spam tool. It won a Linux New Media Award-2006 as the “Best Linux-based Anti-spam Solution”, and is considered by many to be the best free, open source, anti-spam tool, and better than many commercial products. In fact, several commercial products and services are based on SpamAssassin or previous versions of it.
Why filter e-mail
If you don’t receive any spam, there may be no need to filter spam. However, once one spam message has been received, it is invariably followed by many more. Spammers can sometimes detect if a spam e-mail is viewed, using techniques such as Web bugs, which are tiny images in HTML e-mails that are fetched from web servers, and then know that an e-mail address is valid and vulnerable. If spam is filtered, the initial e-mail may never get seen, and consequently the spammer may not then target the e-mail address with further spam.
Despite legal efforts against spam, it is actually on the increase. In Europe and the US, the recent legislation against spam (Directive 2002/58/EC and bill number S.877 respectively) has had little effect and spam is still on the increase in both regions.
The main reason for this is that spam is a very good business model. It is very cheap to send spam, as little as one thousandth of a cent per e-mail, and it takes a very low hit rate before a profit is made. The spammer needs to turn just one spam in a hundred thousand or so into a sale to make a profit. As a result, there are many spammers and spam is used to promote a wide range of goods. Spamming costs are also negligible due to use of malware that uses innocent computers to send spam on their behalf.
In contrast, the costs of spam to the recipient are remarkably high. Estimates have varied, from 10 cents per spam received, through 1,000 dollars per employee per year, up to a total cost of 140 billion dollars globally in 2007 alone. This cost is mainly labor—distracting people from their work by clogging their inboxes and forcing them to deal with many extra e-mails. Spam interferes with day-to-day work and can include material that is offensive to most people. Companies have a duty to protect their employees from such content. Spam filtering is a very cheap way of minimizing the costs and protecting the workforce.
Spam is a moving target
Spam isn’t static. It changes on a day-to-day basis, as spammers add new methods to their arsenal and anti-spammers develop countermeasures. Due to this, the anti-spam tools that work best are those that are updated frequently. It’s a similar predicament to antivirus software—virus definitions need to be updated regularly or new viruses won’t be detected.
SpamAssassin is regularly updated. In addition to new releases of the software, there is a vigorous community creating, critiquing, and testing new anti-spam rules. These rules can be downloaded automatically for up-to-date protection against spam.
Let’s discuss some of the measures used by SpamAssassin to fight spam:
- Open relays: These are e-mail servers that allow spammers to send e-mails even though they are not connected to the owner of the server in any way. To counter this, the anti-spam community has developed blocklists, also known as blacklists, which can be used by anti-spam software to detect spam. Any e-mail that has passed through a server on a blocklist is treated more suspiciously than one that has not. SpamAssassin uses a number of blocklists to test e-mails.
- Keyword filters: These are useful tools against spam. Spammers tend to repeat the same words and phrases again and again. Rules to detect these phrases are used extensively by SpamAssassin. These make up the bulk of the tests, and the user community rules mentioned previously is normally of this form. They allow specific words, phrases, or sequences of letters, numbers, and punctuation to be detected.
- Blacklists and whitelists: These are used to list known senders of spam and sources of good e-mail respectively. E-mails from an address on a blacklist are probably spam and are treated accordingly, while e-mails from addresses on a whitelist will be less likely to be treated as spam. SpamAssassin allows the user to enter blacklists and whitelists manually, and also builds up an automatic whitelist and blacklist based on the e-mails that it processes.
- Statistical filters: These are automated systems that give the probability that an e-mail is spam. This filtration is based on what the filter has seen previously as both spam and non-spam. They generally work by finding words that are present in one type of e-mail but not the other, and using this knowledge to determine which type a new e-mail is. SpamAssassin has a statistical filter called the Bayesian filter that can be very effective in improving detection rates.
- Content databases: These are mass e-mail detection systems. A lot of e-mail servers receive and submit e-mails to central servers. If the same e-mail is sent to thousands of recipients, it is probably a spam. The content databases prevent confidential e-mails from being sent to the server, by using a technique called hashing that also lowers the amount of data sent to the server. SpamAssassin can integrate with several content databases, notably Vipul’s Razor (http://razor.sourceforge.net), Pyzor (http://sourceforge.net/apps/trac/pyzor/), and the Distributed Checksum Clearinghouse, that is, DCC (http://www.rhyolite.com/dcc/).
- URL blocklists: These are similar to open relay blocklists, but list the websites used by spammers. In nearly all spams, a web address is given. A database of these is built so that spam e-mails can be quickly detected. This is a very efficient and effective tool against spam. By default, SpamAssassin uses Spam URI Realtime BlockLists (SURBLs), without any further configuration required.
Spam filtering options
Spam can be filtered on the server or the client. The two approaches are explained next. In the first scenario, spam is filtered on the client.
- Mail is processed by the MTA.
- The e-mail is then placed in the appropriate user’s inbox.
- The e-mail client reads all new e-mail from the inbox.
- The e-mail client then passes the e-mail to the filter.
- When the filter returns the results, the client can display the valid e-mail and either discard spam or file it in a separate folder.
In this approach, the spam filtering is always done by the client and is always done when new e-mail is processed. Often when the user may be present, so he or she may either experience a delay before e-mail is visible or there may be a period where spam e-mail is present in the inbox before the client software can filter the spam from view. The amount of spam filtering that can be performed on the client may be limited. In particular, the network tests such as open relay blocklists or SURBLs might be too time consuming or complex to perform on the user’s PC. As spam is a moving target, updating many client PCs can become a difficult administrative task.
In the second scenario, the spam filtering is performed on the e-mail server.
- Incoming e-mail is received by the MTA.
- It is then passed on to the spam filter.
- The results are then sent back to the MTA.
- Depending on the results, the MTA places the e-mail in the appropriate user’s inbox (4a), or in a separate folder for spam (4b).
- The e-mail client accesses e-mails in the user’s inbox and it can also access the spam folder if required.
This approach has several advantages:
- The spam filtering is done when the e-mail is received, which may be any time of the day. The user is less likely to be inconvenienced by delays.
- The server can specialize in spam filtering. It may use external services such as open relay blocklists, online content databases, and SURBLs.
- Configuration is centralized, which will ease setup (for example, firewalls may need to be configured to use online spam tests) and also maintenance (updating of rules or software).
On the other hand, the disadvantages include:
- A single point of failure now exists. However, with care, a broken spam filtering service can be configured around. If the service is not available, e-mail will still be delivered but spam will not be filtered.
- All spam must be processed by one service. If this service is not scalable, large volumes of e-mail may affect mail delivery times, resulting in poor or intermittent filtering, or possibly even the loss of e-mail service.
Introduction to SpamAssassin
Spam filtering actually involves two phases—detecting the spam and then doing something with it. SpamAssassin is a spam detector and it modifies the e-mail it processes by putting in headers to mark whether it is spam. It is up to the MTA or the mail delivery agent in the e-mail system to react to the headers that SpamAssassin creates in an e-mail, to filter it out. However, it’s possible that another part of the e-mail system could perform this task.
The previous figure gives a schematic representation of SpamAssassin. At the heart of SpamAssassin is its Rules Engine that determines which rules are called. Rules trigger whether the various tests are used, including the Bayesian Filter, the network tests, and the auto-whitelists.
SpamAssassin uses various databases to do its work, and these are shown too. The rules and scores are text files. Default rules and scores are included in the SpamAssassin distribution and, as we will see, both system administrators and users can add rules or change the scores of existing rules by adding them to files in specific locations. The Bayesian filter (which is a major part of SpamAssassin, and will be covered later) uses a database of statistical data based on previous spam and non-spam e-mails. The Auto-Blacklist/Whitelist also creates its own database.
Downloading and installing SpamAssassin
SpamAssassin is slightly different from most of the software that is used in this book. It is written in a language called Perl, which has its own distribution method called CPAN (Comprehensive Perl Archive Network). CPAN is a large website of Perl software (normally, Perl modules), and the term CPAN is also the name of the software used to download those modules and install them. Though SpamAssassin is provided as a package by many Linux distributions, we strongly recommend that you install it from source rather than use a package. This way, you will get the latest version of SpamAssassin rather than the one that was current when your Linux distributer created its release.
Most Perl users will build Perl modules using CPAN and experience no difficulties. CPAN can automatically locate and install any dependencies (other components that are required to make the desired component work properly). From a Perl point of view, using CPAN to install Perl modules is like using the rpm or apt-get commands in Linux. The basics are very simple and, once a system is configured, it generally works every time.
However, learning and configuring a new way of installing software may put off some people. A SpamAssassin release is distributed in source form, but administrators of Red Hat Package Manager (RPM) based systems can easily convert the latest SpamAssassin release into rpm format and then the regular rpm command can be used to install the package. The Debian repository is updated fairly quickly when SpamAssassin is updated and the regular apt-get commands can be used to install SpamAssassin. We strongly advise you to install via apt-get, CPAN, or using the rpmbuild command as described next, in preference to using an RPM provided by a distributor.
As SpamAssassin is a Perl Module, it appears on CPAN first. In fact, it is only released when it arrives at CPAN. Users of CPAN can download the latest version of SpamAssassin literally minutes after it has been released.
Support is also easier to obtain if SpamAssassin is built from source. Some distributors make unusual decisions when creating their RPM of SpamAssassin or may modify certain default values. These make obtaining support more difficult.
RPMs also take time to be delivered. Distributors need time to build and test new versions of software before they release them, and most software packages are not updated as quickly as SpamAssassin. So, Linux distributions may not provide the latest software, and what is provided can be several versions out of date.
The prerequisites for installing SpamAssassin 3.2.5 using CPAN are as follows:
- Perl version 5.6.1 or later: Most modern Linux distributions will include this as a part of the base package.
- Several Perl modules: The current version of SpamAssassin needs the Digest::SHA1, HTML::Parser, and the Net::DNS modules. CPAN will install these if you configure it to follow dependencies, but there are many additional Perl modules that are optional and should be installed to get the best spam detection. CPAN will issue warnings with the module names, which will enable you to identify and install them.
- C compiler: This may not be installed by default and may have to be added using the rpm command. The compiler used will normally be called gcc.
- Internet connection: CPAN will attempt to download the modules using HTTP or FTP, so the network should be configured to allow this.
If you’ve used CPAN before, you can skip to the next section, Installing SpamAssassin Using CPAN.
If a proxy server is required for Internet traffic, CPAN (and other Perl modules and scripts) will use the http_proxy environment variable. If the proxy requires a username and password, these need to be specified using environment variables. As CPAN is normally run as root, these commands should be entered as root:
# export HTTP_proxy
# export HTTP_proxy_user
# export HTTP_proxy_pass
Next, enter this command:
# perl -MCPAN -e shell
If the output is similar to the following, the CPAN module is already installed and configured, and you can skip to the next section, Installing SpamAssassin Using CPAN.
cpan shell -- CPAN exploration and modules installation (v1.7601)
ReadLine support enabled
If the output prompts for manual configuration, as shown next, the CPAN module is installed but not configured.
Are you ready for manual configuration? [yes]
During configuration, the CPAN Perl module prompts for answers to around 30 questions. For most of the questions, selecting the default value is the best response. This initial configuration must be completed before the CPAN Perl module can be used. The questions are mainly about the location of various utilities, and the defaults can be chosen by pressing Enter. The only question for which we should change the default is the one about building prerequisite modules. If we configure CPAN to follow dependencies, it will install the required modules without prompting.
Policy on building prerequisites (follow, ask or ignore)? [ask] follow
Once CPAN is configured, exit the shell by typing exit and pressing Enter. We are now ready to use CPAN to install SpamAssassin.