Computer Science

Spam: A Discussion

by Russ Smith

A growing issue for users of email over the past several years has been the increasing quantity of unsolicited and unwanted mail, usually but not necessarily featuring advertising, referred to generally and hereafter in this document as "spam." Adding to the issue is the difficulty (to say the least) in finding server-wide solutions that do not result in user mail being filtered (and perhaps deleted) without user consent. Therefore, spam filtering is generally considered the province of the user. This document attempts to address some ways users of CSX (the Oklahoma State University primary Computer Science servers) might alleviate the spam problem in their own user experiences. A quick rundown of the technical steps included in this document can be found in this quickref.

Types of spam filtering

There are a few general approaches to spam filtering that are worth knowing a little about before discussing specific packages that implement such approaches. The first is "heuristic filtering," where a set of static rules are applied one by one against a message and a determination is made of whether the message is probably best defined as spam. The upside of heuristic filtering is that it is very automatic -- typically no user involvement is required. The downside is that it is prone to a substantial number of false negatives (where actual spam is mischaracterized as legitimate email) and some number of false positives (where legitimate email is mischaracterized as spam) as well. For this reason, the "hands-off" characteristic of the heuristic approach can be misleading, as dealing with false positives and negatives has to be done manually and can be time-consuming. On the other hand, doing so is usually straightforward and can generally be done via any email client.

Another approach is the more recent so-called "Bayesian" approach, wherein a piece of software is trained over time to recognize spam by "showing" it pieces of spam and non-spam and informing it which is which. (The details of this approach were introduced and improved respectively by Paul Graham in his papers A Plan for Spam and Better Bayesian Filtering.) The upside is that this has been proven to generate far fewer false negatives and next to no false positives (although as spammers try to find ways to defeat the Bayesian approach, this may hold less true -- time will tell.) The downside is that the process of training the system may be laborious and often requires interaction directly on the server (in this case, CSX) which complicates matters for users of POP3 and IMAP.

Further approaches, both independent and in addition to the above, include "whitelisting," wherein a list of "good" email addresses is exempted from the filtering process, "blacklisting," wherein a list of "bad" email addresses is always refused outright, and "challenge-response," where an unknown sender might receive an automated challenge telling them to email back to a specific address at which time they will be whitelisted automatically.

IMAP and POP

We primarily recommend three software approaches as regards CSX mail. Before getting into two pieces of software installed on CSX, a discussion of IMAP and POP3 clients is in order. It is becoming more common for such email clients -- for two examples, Thunderbird and Mail.app for OSX -- to include client-side filtering, often including multiple approaches previously mentioned (typically at least Bayesian and whitelisting.) For users of IMAP and POP3, this can be an ideal approach, as the interface for training is often simple and intuitive and does not require logging into CSX whenever training is needed. It has been pointed out, however, that users on very slow links may not find this approach desirable, however, as spam is not filtered until it is downloaded. For many people, though, this approach may be excellent.

SpamAssassin

In cases where a client-side filtering is not desired or usable, there are two pieces of software available (and up-to-date as of this writing) on CSX. The first is SpamAssassin, a perl program that historically has used the heuristic approach, although recently it has begun to also provide a Bayesian training system in addition. Primarily this document will concern itself with the "hands-off" use of SpamAssassin heuristics, although the user is welcome to read more about the Bayesian aspect. The file /pub/htdocs/spamassassin.ex has an example of a file that procmail can use to implement SpamAssassin filtering; generally a user could simple do cp /pub/htdocs/spamassassin.ex .procmailrc in his or her home directory and begin seeing results. The effect is that all mail presumed by SpamAssassin to be spam is filed in a folder named Spam (under the directory ~/mail) instead of the user's inbox.

Why a Spam folder?

This leads to an important digression: why not just delete mail assumed to be spam? The answer is earlier in this document: the non-zero likelihood of false positives; if the user deletes presumed spam automatically, then false positives will be lost without ever being seen, which is typically considered unacceptable. Consequently, this document recommends instead filing presumed spam in a separate folder and reviewing this folder frequently before deleting its contents.

Spamprobe

Spamprobe, a Bayesian filter, is also available on CSX. Unfortunately, it is somewhat harder to give a "plug and play" solution as was previously done with SpamAssassin. Generally, there are a few stages to implementing Spamprobe:

  1. Collect spam messages for several days, ideally refiling them into a folder (hereafter called Spam.)
  2. Run spamprobe for the first time to prime it with a large number (the larger the better) of known spam and known non-spam messages. Let's assume that a folder named Saved contains non-spam messages, and that your mail folders are under the directory mail (but don't assume this is correct in practice); you would do something like this:
    spamprobe -c good mail/Saved
    spamprobe -c spam mail/Spam
    Note that the first command you run will create the initial Spamprobe database.
  3. Including in your .procmailrc code such as that in /pub/htdocs/spamprobe.ex. This instructs spamprobe to act similarly to SpamAssassin in that it adds a header or headers to the email then refiles the message if it believes it was spam; additionally, it further trains its database based on the message contents. If you are already using spamassassin, you can use both with something like the contents of /pub/htdocs/spamboth.ex.
  4. If the message is incorrectly characterized -- that is, if it's a false positive or false negative -- you will need to manually train spamprobe about it to improve its behavior. Save the message as a file (hereafter called File), and do something like spamprobe train-spam File (if it's a false negative; if it's a false positive -- far less common -- use train-good instead.)
  5. Run spamprobe cleanup periodically -- at least daily is vital to keep disk usage down. (Failure to do so that causes disk overuse may result in deletion of spamprobe databases, requiring starting training over.) One way to automate this would be to run crontab -e and add a line like this:
    0 1 * * * /usr/bin/spamprobe cleanup
    which runs spamprobe cleanup every morning at 1am.
Otherwise, this approach is much like the SpamAssassin approach in that supposed spam is refiled in Spam (under ~/mail) for you to review before deleting.

Concluding remarks

Questions about these or other approaches are always welcome at the usual system manager address. Please be patient with us, as this is a difficult problem that dedicated individuals constantly struggle with on a daily basis. Any errors in this document or in the example files given should be pointed out, and we will give full diligence to correcting them. Hopefully this document will help you with your struggle against spam.

Other references