Skip to main content

skip to main content

developerWorks  >  AIX and UNIX  >

Systems Administration Toolkit: Spam and virus filtering for e-mail

developerWorks
Document options
PDF format - Fits A4 and Letter

PDF - Fits A4 and Letter
167KB (18 pages)

Get Adobe® Reader®

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Martin Brown (mc@mcslp.com), Freelance Writer, Consultant

22 Jan 2008

Look beyond tools like SpamAssassin and Amavis to see how you can extend them and provide additional filtering facilities to lower the amount of spam hitting the e-mail boxes of your users. Most companies use spam and virus filtering services on their UNIX® platforms, but there are some methods that you can use that help improve your filtering scores and might even eliminate spam reaching inboxes.

About this series

The typical UNIX® administrator has a key range of utilities, tricks, and systems he or she uses regularly to aid in the process of administration. There are key utilities, command-line chains, and scripts that are used to simplify different processes. Some of these tools come with the operating system, but a majority of the tricks come through years of experience and a desire to ease the system administrator's life. The focus of this series is on getting the most from the available tools across a range of different UNIX environments, including methods of simplifying administration in a heterogeneous environment.

Spam and virus filtering fundamentals

There are many different tools and systems available for the filtering and removal of spam e-mail at the UNIX server level. Tools like SpamAssassin and more detailed agents, such as Amavis (which includes interfaces to the SpamAssassin, various virus scanners, and other spam tools such as Razor), use a variety of different methods to identify and capture spam.

These methods include, but are certainly not limited to:

  • Direct matches—This involves looking for specific text in the message, either in the body, the message headers, subject, or even e-mail addresses. Certain spam e-mails and programs that produce them use the same templates, fake headers, and sometimes even the same (wrong) codes and structure.
  • Pattern matches—This involves looking for patterns of text, such as the four-letter codes used for share listings, or the various patterns used to describe the various pills and drugs often sold using spam.
  • Fingerprint matches—This involves looking for more complex combinations of words and phrases. It's very common to have a number of e-mails all containing the same basic information but with minor changes, such as name, age, or date. By creating a set fingerprint for the basic structure of the e-mail, you can identify it as spam.
  • Bayesian techniques—The Bayes theorem compares the number of spam words in an e-mail against the number of normal words in an e-mail, thus giving the e-mail a probability value indicating how likely it is that the e-mail is spam. The principle is very simple, but it's also very effective as a lot of spam contains the same words and sometimes the same repetition of words.
  • DNS blacklists—These are published lists of hosts known to send and forward spam or to send or forward spam.
  • Whitelisting and blacklisting—The theory is very simple. Adding an e-mail to the whitelist indicates that it's an e-mail address that you trust, while the blacklists contains e-mail addresses you don't trust.

It is almost impossible to rely on a single one of the techniques above to identify spam perfectly, but you can increase the quality of the spam filter by using combinations.

To demonstrate how a single piece of information is not enough, consider an e-mail from a friend. You might have added their address to your whitelist, but what if the spam sender has spoofed (faked) e-mail address of your friend? If you just used the black and whitelist solution, the spam would most likely get through. But, if the spam contained an advert for some drugs, then it's highly likely that one of the matching mechanisms or Bayesian filters would identify the e-mail as spam.

To work effectively, most solutions use some kind of scoring mechanism. In general, the higher the score, the more likely the e-mail is to be spam. By giving different scores to different parts of the spam filtering methods, you can ultimately provide an effective score. For example, let's say that being in the whitelist gives a score of -10, but matching the Bayesian filter gives a score of 15. Further matches against the pattern and text matching add another 5 points, and the "score" for the e-mail is now 10 points (-10+15+5). If you set a "spam" score of 7, then the e-mail is treated as spam and deleted or quarantined, accordingly.

To help describe how you can improve the process, Figure 1 shows a typical spam filtering solution where the e-mail is filtered as it comes in, and the e-mail is delivered into individual mailboxes.


Figure 1. Typical spam filtering solution architecture
Typical spam filtering solution architecture

Setting up the initial system is only part of the solution though. For long-term planning, you need to be thinking about how you are going to stay on top of the spam e-mail problem, because the spammers are finding new solutions and tricks all the time and, even with a spam filtering solution in place, you are unlikely to reach a solution that's 100 percent.

Let's start by making sure that any spam that wasn't caught can be trapped and reported.

Setting up a report mailbox

The Bayesian spam filtering technique works by comparing "spam" words with "normal" words. The problem is what do you class as normal and spam words? The results vary for different people.

For example, if you work for a pharmacy, then you are likely to get a lot of e-mails containing the names of drugs. Unfortunately, drugs are often sold through spam e-mail, and genuine messages you might be expecting can get caught by the system. Fortunately, Bayesian systems base their decisions on past experience, so the more you teach the Bayesian filter about what is and isn't spam, the better the filter gets at identifying the e-mails correctly.

All systems that support learning should have some form of script or application that accepts the message text. SpamAssassin, for example, provides the sa-learn script, which you can tell to identify the message as Spam or Ham, accordingly. You might also have other solutions that benefit from reporting. The Razor spam filter provides a server-based service where you can report Spam so that other more users can benefit from your identifying the e-mail as spam.

For this to work, you need to set up a mail folder or system where users can send their e-mail so that the e-mail can be scanned and the Bayesian filter "caught" whether the e-mail is spam (a spam e-mail that the system thought was genuine) or ham (a genuine e-mail that the system thought was spam).

In general, it is easier to set up a single mailbox that you can use to hold the spam and another to hold the ham. In a system that supports user-by-user learning, you can set up the folders in each mailbox for users. In the examples presented in this article, let's assume that you are using SpamAssassin and IMAP-based mail system so that you can read and parse the contents and have SpamAssassin learn the details, but the principles could easily be applied to other spam filtering environments.

Listing 1 shows a simple Perl script that accesses a global mailbox, downloads each message, and then reports the content to spamassassin and Razor.


Listing 1. A script for reporting and learning spam
                
#! /usr/bin/perl

$SpamFolder = "INBOX";
$Server     = 'imap.mcslp.pri';
$User       = 'spam';
$Password   = 'ilovespam';

use Mail::IMAPClient;

# Open the connection to the mail server

my $SPAMIMAP = Mail::IMAPClient -> new (Server   => $Server,
                                        User     => $User,
                                        Password => $Password);
if (!defined($SPAMIMAP))
{
    print "Error: $@\n";
}

# Select the Spam Folder

$SPAMIMAP->select($SpamFolder);

# Get a list of Message IDs

my @MIDs = $SPAMIMAP->messages();

# Exit if there aren't any messages to process

if (scalar(@MIDs) == 0)
{
    exit(0);
}

# Create a temporary directory to hold our message text

mkdir '/tmp/spamreport',0000;

# Process each ID

foreach $MID (@MIDs)
{
# Get the message text, and write the text out to
# test file
    my $path = "/tmp/spamreport/$MID";
    my $msgtext = $SPAMIMAP -> message_string($MID);
    open(FILE,">$path");
    print FILE $msgtext;
    close(FILE);

# Run the SpamAssassin learn script on the file

    system("cat $path|sa-learn --spam");

# Run the Razor reporter on the message content

    system("cat $path|razor-report");

# Delete the original message

    $SPAMIMAP->delete_message($MID);

# Delete the temporary file

    unlink($path);
}

# Empty the trash and disconnect

$SPAMIMAP->expunge();
$SPAMIMAP->disconnect();

You can use a modified version of the script to report ham in the same way by allowing users to copy messages into the folder that were incorrectly identified as spam into the ham folder. You should be more careful with this folder, as it theoretically contains genuine e-mail, and any user that can write e-mails to the folder can also potentially read them. With a global mailbox for this purpose, it is possible for different users to read each other's Ham mail.

The system helps to improve the spam filtering by improving the knowledge of the system about what is and isn't spam, and it can be seen here in an updated version of the spam solution in Figure 2.


Figure 2. Using auto-reporting and learning mechanisms
Using auto-reporting and learning mechanisms

A further improvement is to better identify the senders of the e-mail and to help filter the e-mail before it even gets checked for the usual spam contents.

Updating whitelists and blacklists

The fundamentals of the white and blacklists used in spam filtering are very simple. The whitelist contains e-mail addresses you trust, while the blacklists contains e-mails you don't trust. The relative weighting of the information—that is, the size of the score according to whether the e-mail appears in the white or blacklists—is up to you.

There are some limitations with the whitelists and blacklists:

  • Blacklists can become huge, for the simple reason that spammers often use a multitude of different addresses as the source for their spam. The larger your blacklists, the longer it takes to parse your spam and, ultimately, this might become a barrier to using the blacklists at all. In particular, be careful of using the auto-blacklisting service built into spam solutions, as they add all e-mail addresses identified as spam to the blacklists.
  • The auto-blacklisting can also cause a problem if you get valid e-mails that are identified as spam. In other words, a genuine and trusted e-mail address gets identified as a spam e-mail address, which can skew your results.
  • The same rule applies to auto-whitelisting, missed spam that gets through means the e-mail address is added to your whitelist, even thought it's technically spam.

There is no simple way round these limitations, but you can work to improve the quality of your whitelists and blacklists by using auto-processing of your e-mails and updating of your lists.

For example, in a given environment, you can generally auto-populate your whitelist with:

  • All the e-mail addresses in your address book or global address book
  • All the e-mail addresses for the users of your system
  • All the e-mail addresses of known, good, sources, such as client and suppliers

Furthermore, if you have access to the e-mail mailboxes of users (for example, because of the Spam and Ham processing demonstrated earlier in this article), then you can process the e-mail that has reached their e-mail accounts as the source for your whitelist. If you choose this method, make sure that you only choose the e-mail that has been identified as valid; try the filtering solution (see the Using standard filtering tools section).

Using whitelists and blacklists adds another thread to the spam solution (see Figure 3).


Figure 3. Updating white and blacklists automatically
Updating white/blacklists automatically

Finally, the "last mile" of mail delivery can be used to further filter offending e-mails.

Using standard filtering tools

Using a spam filtering solution catches a significant amount of spam, but many fail to reach 100 percent reliability. Part of the problem is that spammers are getting very clever at allowing their spam to make it through the filters.

Fortunately for you, they also have a habit of using a wide range of e-mail addresses, both for the sender and the recipient, that might not actually match your address or the address of someone you know. White and blacklists can be an effective way of trying to alleviate the spam problem, but the sheer range of addresses in use by most spammers means that spam will often get past the filters to the user's mailbox.

As a last defense for the filtering and removal of spam, you can take advantage of the many server-side or client-side filtering mechanisms and file e-mail into e-mail folders, or file any e-mail that you don't explicitly recognize into a "quarantine" folder. Users can then manually select the messages (and, if necessary, update their filters) and use the Spam and Ham folders you've already created to help improve the quality of the spam filtering solution at the front end.

There are three ways of doing this:

  • File everything you recognize into folders, and leave your inbox as the quarantine folder of e-mails you do not recognize and which need manual filtering.
  • Ignore anything you recognize (for example, don't filter it), but move unrecognized e-mail to a quarantine folder.
  • Filter everything you recognize into dedicated folders and anything you don't recognize into a quarantine folder.

Using a filtering system on your server or client also means that you can take advantage of simpler rules to dispose of spam that gets through the spam filters. Some spam makes it through by tricking spam filters into thinking it is genuine or has already been scanned and given a low mark. You can use your filters to get rid of this, as it is often more simply identifiable.

Also, some unwanted e-mail is not spam at all. Some mailing lists are impossible to be removed from, even after contacting the companies involved, and even if it was a genuine mailing list that you no longer want to receive e-mails on. Also, occasionally, a user mistakenly adds your address to their address book and you end up essentially getting somebody else's mail.

Irrespective of the method you use, it adds another layer (and, more importantly, another filter) to your mail infrastructure, giving you a final spam filtering solution like the one shown in Figure 4.


Figure 4. Mailbox filtering in our spam filtering architecture
Mailbox filtering in our spam filtering architecture

With all of this filtering in place, it is easy to forget that your changes should be measurable.

Obtaining statistics and generating reports

When filtering e-mail for spam and viri, it's very easy to forget that a metric to monitor how effective your solutions are at removing the spam are a good idea. Recording and measuring can also be an effective way of determining whether there are any specific trends and in some cases can be used to develop completely different ways of examining the email as it comes in to make the process more effective.

If you are using a tool like Amavis, then the information about how each e-mail has been treated can be extracted by parsing the contents of the log file. Listing 2 shows a single line from the Amavis log file.


Listing 2. Line from the Amavis log file
                
Nov 26 11:33:45 constable.example.com /usr/bin/amavisd[2257]: (02257-04)Blocked SPAM, 
[83.237.69.122] [64.18.7.11] <jqyay@quintiles.com> -> 
<null@gendarme.example.com>, quarantine: quarantine@gendarme.example.com, 
Message-ID: <1d9b01c83020$3150e150$c0a8008f@Ned>, mail_id: YDOXKqndoiPU, 
Hits: 69.428, 11621 ms

The "Blocked SPAM" is the useful fragment of the log output, as it tells you both what the e-mail was identified as and what happened to it. The first word tells you whether the e-mail was blocked or passed. The second describes the type, including spam, infected (virus), bad header, banned, or clean. Listing 3 shows a Perl script that extracts this information and summarizes it.


Listing 3. Perl script that extracts log output
                
#!/usr/bin/perl

my $stats = {};

while(<STDIN>)
{
    chomp;

    next unless (m{/usr/bin/amavisd\[\d+\]: \(\d+-\d{2}\)});

    if (m/(Passed|Blocked) [A-Z]+/)
    {
        my ($proc_mode,$proc_type) = (m/(Passed|Blocked) ([-A-Z]+)/);
        $stats->{$proc_mode}->{$proc_type}++;
    }
}

foreach my $mode (sort keys %{$stats})
{
    my $modetotal = 0;
    print "$mode\n";
    foreach my $type (sort keys %{$stats->{$mode}})
    {
        printf("\t%-20s %7d\n",$type,$stats->{$mode}->{$type});
        $modetotal += $stats->{$mode}->{$type};
    }
    printf("\t%-20s %7d\n",'Total',$modetotal);
}

To run the script we pipe the contents of the file through the script: 

$cat amavis.log |perl parse_amavis.pl
Blocked
    BANNED                 32793
    CLEAN                      1
    INFECTED                 766
    SPAM                   85499
    Total                 119059
Passed
    BAD-HEADER              1415
    CLEAN                  70588
    SPAM                     356
    Total                  72359

This unfortunately shows that more than 62 percent of the e-mail received was blocked because it was spam.

With a little additional work, you can extract the information from the log and write that data into a database. Listing 4 shows the skeleton of a script that parses the Amavis log information in more detail.


Listing 4. Skeleton of a script that parses the Amavis log
                
#!/usr/bin/perl

use Time::ParseDate;

while(<STDIN>)
{
    chomp;

    next unless (m{/usr/bin/amavisd\[\d+\]: \(\d+-\d{2}\)});
    next if (m{(mcfilter|slpfliter)});
    if (m/(Passed|Blocked) [A-Z]+/)
    {
        my ($datetime,$host,$proc_mode,$proc_type,
            $sender,$recip,$hits,$msgid,$mailid);

	my @blocks = split(/\s+/);

# Extract the date

	$datetime = parsedate(sprintf("%s %s %s",@blocks[0..2]));

# Extract the hose

	$host = $blocks[3];

# Extract processing information

	($proc_mode,$proc_type) = (m/(Passed|Blocked) ([-A-Z]+)/);

# Extract the sender/recipient information

	($sender,$recip) = (m/<(.*?)> -> <(.*?)>/);

# Extract the spam score; anything with a negative score
# is effectively zero (i.e. it passed)

        ($hits) = (m/Hits: ([-0-9.]+),/);
        $hits = 0 if ($hits eq '-');

# Now write the	information into a database...

    }
}

Writing that information into a database table is an exercise for the reader. If you decide to use this script, you should record the date, e-mail addresses, spam score, processing, and other information. That will extract the maximum amount of information back out of the database again.

As an example of what you can achieve once you have this information in a database, Figure 5 shows a graph generated from a database of parsed logs from Amavis.


Figure 5. Spam statistics as a graph
Spam statistics as a graph

You can see here much more clearly the marked difference between clean passed mail (green) and blocked mail (red).

Coherence

Finally, bear in mind that even employing all these solutions in combination with an existing spam filtering solution might not resolve the problem entirely.

You can improve the situation even further if you take a coherent approach to the system. For example, you've looked at a number of solutions in this article, but keep in mind that:

  • Any auto spam or ham reporting solution should also update the stats, accordingly.
  • Any auto spam or ham reporting solution should update the whitelists and blacklists, if necessary.
  • If using server-side filtering and those rules include full e-mail addresses, use the server-side filters to update the whitelists and blacklists.
  • Consider using a database or a front end to allow users to update the whitelists and blacklists.

Ultimately, you want to make sure that your e-mail system and the spam filtering solution have the right information and the right quality of information to enable them to filter the spam effectively.

Summary

Spam filtering solutions are a necessary evil in today's e-mail climate. It is virtually impossible to avoid the spam and, even if you never publish your e-mail address, the chances are you will get spam.

As you've seen in this article, most spam solutions use a variety of different techniques to filter the spam on it's way into your e-mail system, but you can improve the quality of the filtering by working with your users and the spam filters. Providing a method for reporting missed spam, automatically updating black and whitelists, and using a secondary filter system can all help to reduce the amount of spam that reaches your inbox. Using the techniques in this article either removes the responsibility from you to manually update the systems, or empowers you to help improve the overall spam filtering solution.



Resources

Learn

Get products and technologies
  • IBM trial software: Build your next development project with software for download directly from developerWorks.

  • Amavisd is a mail filtering solution that interfaces to many solutions like SpamAssassin, Razor and numerous virus scanners.

  • SpamAssassin: This tool is a highly configurable spam scanner that uses a variety of techniques, including basic matching and Bayesian scanning to give e-mails a score indicating their likelihood of being spam.

  • nmap: This tool scans network hosts and ports and provides you information about potential unauthorized hosts and services.

Discuss


About the author

Martin Brown has been a professional writer for more than seven years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms—Perl, Python, Java™, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows®, Solaris, Linux, BeOS, Mac OS X and more—as well as Web programming, systems management, and integration. He is a Subject Matter Expert (SME) for Microsoft® and regular contributor to ServerWatch.com, LinuxToday.com, and IBM developerWorks. He is also a regular blogger at Computerworld, The Apple Blog, and other sites. You can contact him through his Web site.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


IBM, AIX, and Redbooks are registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.