 | Level: Intermediate Martin Brown (mc@mcslp.com), Freelance Writer, Consultant
22 Jan 2008 Look beyond tools like SpamAssassin and Amavis to see how you can
extend them and provide additional filtering facilities to lower the amount of spam
hitting the e-mail boxes of your users. Most companies use spam and virus filtering
services on their UNIX® platforms, but there are some methods that you can use
that help improve your filtering scores and might even eliminate spam reaching
inboxes.
About this series
The typical UNIX® administrator has a key range of utilities, tricks, and
systems he or she uses regularly to aid in the process of administration. There
are key utilities, command-line chains, and scripts that are used to simplify
different processes. Some of these tools come with the operating system, but a
majority of the tricks come through years of experience and a desire to ease the
system administrator's life. The focus of this series is on getting the most from
the available tools across a range of different UNIX environments, including
methods of simplifying administration in a heterogeneous environment.
Spam and virus filtering
fundamentals
There are many different tools and systems available for the filtering and
removal of spam e-mail at the UNIX server level. Tools like SpamAssassin and more
detailed agents, such as Amavis (which includes interfaces to the SpamAssassin,
various virus scanners, and other spam tools such as Razor), use a variety of
different methods to identify and capture spam.
These methods include, but are certainly not limited to:
- Direct matches—This involves looking for specific text in the
message, either in the body, the message headers, subject, or even e-mail
addresses. Certain spam e-mails and programs that produce them use the same
templates, fake headers, and sometimes even the same (wrong) codes and
structure.
- Pattern matches—This involves looking for patterns of text,
such as the four-letter codes used for share listings, or the various patterns
used to describe the various pills and drugs often sold using spam.
- Fingerprint matches—This involves looking for more complex
combinations of words and phrases. It's very common to have a number of e-mails
all containing the same basic information but with minor changes, such as name,
age, or date. By creating a set fingerprint for the basic structure of the
e-mail, you can identify it as spam.
- Bayesian techniques—The Bayes theorem compares the number of
spam words in an e-mail against the number of normal words in an e-mail, thus
giving the e-mail a probability value indicating how likely it is that the
e-mail is spam. The principle is very simple, but it's also very effective as a
lot of spam contains the same words and sometimes the same repetition of words.
- DNS blacklists—These are published lists of hosts known to send
and forward spam or to send or forward spam.
- Whitelisting and blacklisting—The theory is very simple. Adding
an e-mail to the whitelist indicates that it's an e-mail address that you trust,
while the blacklists contains e-mail addresses you don't trust.
It is almost impossible to rely on a single one of the techniques above to
identify spam perfectly, but you can increase the quality of the spam filter by
using combinations.
To demonstrate how a single piece of information is not enough, consider an
e-mail from a friend. You might have added their address to your whitelist, but
what if the spam sender has spoofed (faked) e-mail address of your friend? If you
just used the black and whitelist solution, the spam would most likely get
through. But, if the spam contained an advert for some drugs, then it's highly
likely that one of the matching mechanisms or Bayesian filters would identify the
e-mail as spam.
To work effectively, most solutions use some kind of scoring mechanism. In
general, the higher the score, the more likely the e-mail is to be spam. By giving
different scores to different parts of the spam filtering methods, you can
ultimately provide an effective score. For example, let's say that being in the
whitelist gives a score of -10, but matching the Bayesian filter gives a score of
15. Further matches against the pattern and text matching add another 5 points,
and the "score" for the e-mail is now 10 points (-10+15+5). If you set a "spam"
score of 7, then the e-mail is treated as spam and deleted or quarantined,
accordingly.
To help describe how you can improve the process, Figure 1
shows a typical spam filtering solution where the e-mail is filtered as it comes
in, and the e-mail is delivered into individual mailboxes.
Figure 1. Typical spam filtering solution
architecture
Setting up the initial system is only part of the solution though. For long-term
planning, you need to be thinking about how you are going to stay on top of the
spam e-mail problem, because the spammers are finding new solutions and tricks all
the time and, even with a spam filtering solution in place, you are unlikely to
reach a solution that's 100 percent.
Let's start by making sure that any spam that wasn't caught can be trapped and
reported.
Setting up a report
mailbox
The Bayesian spam filtering technique works by comparing "spam" words with
"normal" words. The problem is what do you class as normal and spam words? The
results vary for different people.
For example, if you work for a pharmacy, then you are likely to get a lot of
e-mails containing the names of drugs. Unfortunately, drugs are often sold through
spam e-mail, and genuine messages you might be expecting can get caught by the
system. Fortunately, Bayesian systems base their decisions on past experience, so
the more you teach the Bayesian filter about what is and isn't spam, the better
the filter gets at identifying the e-mails correctly.
All systems that support learning should have some form of script or application
that accepts the message text. SpamAssassin, for example, provides the sa-learn
script, which you can tell to identify the message as Spam or Ham, accordingly.
You might also have other solutions that benefit from reporting. The Razor spam
filter provides a server-based service where you can report Spam so that other
more users can benefit from your identifying the e-mail as spam.
For this to work, you need to set up a mail folder or system where users can send
their e-mail so that the e-mail can be scanned and the Bayesian filter "caught"
whether the e-mail is spam (a spam e-mail that the system thought was genuine) or
ham (a genuine e-mail that the system thought was spam).
In general, it is easier to set up a single mailbox that you can use to hold the
spam and another to hold the ham. In a system that supports user-by-user learning,
you can set up the folders in each mailbox for users. In the examples presented in
this article, let's assume that you are using SpamAssassin and IMAP-based mail
system so that you can read and parse the contents and have SpamAssassin learn the
details, but the principles could easily be applied to other spam filtering
environments.
Listing 1 shows a simple Perl script that accesses a global
mailbox, downloads each message, and then reports the content to spamassassin and
Razor.
Listing 1. A script for reporting and learning spam
#! /usr/bin/perl
$SpamFolder = "INBOX";
$Server = 'imap.mcslp.pri';
$User = 'spam';
$Password = 'ilovespam';
use Mail::IMAPClient;
# Open the connection to the mail server
my $SPAMIMAP = Mail::IMAPClient -> new (Server => $Server,
User => $User,
Password => $Password);
if (!defined($SPAMIMAP))
{
print "Error: $@\n";
}
# Select the Spam Folder
$SPAMIMAP->select($SpamFolder);
# Get a list of Message IDs
my @MIDs = $SPAMIMAP->messages();
# Exit if there aren't any messages to process
if (scalar(@MIDs) == 0)
{
exit(0);
}
# Create a temporary directory to hold our message text
mkdir '/tmp/spamreport',0000;
# Process each ID
foreach $MID (@MIDs)
{
# Get the message text, and write the text out to
# test file
my $path = "/tmp/spamreport/$MID";
my $msgtext = $SPAMIMAP -> message_string($MID);
open(FILE,">$path");
print FILE $msgtext;
close(FILE);
# Run the SpamAssassin learn script on the file
system("cat $path|sa-learn --spam");
# Run the Razor reporter on the message content
system("cat $path|razor-report");
# Delete the original message
$SPAMIMAP->delete_message($MID);
# Delete the temporary file
unlink($path);
}
# Empty the trash and disconnect
$SPAMIMAP->expunge();
$SPAMIMAP->disconnect();
|
You can use a modified version of the script to report ham in the same way by
allowing users to copy messages into the folder that were incorrectly identified
as spam into the ham folder. You should be more careful with this folder, as it
theoretically contains genuine e-mail, and any user that can write e-mails to the
folder can also potentially read them. With a global mailbox for this purpose, it
is possible for different users to read each other's Ham mail.
The system helps to improve the spam filtering by improving the knowledge of the
system about what is and isn't spam, and it can be seen here in an updated version
of the spam solution in Figure 2.
Figure 2. Using auto-reporting and learning
mechanisms
A further improvement is to better identify the senders of the e-mail and to help
filter the e-mail before it even gets checked for the usual spam contents.
Updating whitelists and
blacklists
The fundamentals of the white and blacklists used in spam filtering are very
simple. The whitelist contains e-mail addresses you trust, while the blacklists
contains e-mails you don't trust. The relative weighting of the
information—that is, the size of the score according to whether
the e-mail appears in the white or blacklists—is up to you.
There are some limitations with the whitelists and blacklists:
- Blacklists can become huge, for the simple reason that spammers often use a
multitude of different addresses as the source for their spam. The larger your
blacklists, the longer it takes to parse your spam and, ultimately, this might
become a barrier to using the blacklists at all. In particular, be careful of
using the auto-blacklisting service built into spam solutions, as they add all
e-mail addresses identified as spam to the blacklists.
- The auto-blacklisting can also cause a problem if you get valid e-mails that
are identified as spam. In other words, a genuine and trusted e-mail address
gets identified as a spam e-mail address, which can skew your results.
- The same rule applies to auto-whitelisting, missed spam that gets through
means the e-mail address is added to your whitelist, even thought it's
technically spam.
There is no simple way round these limitations, but you can work to improve the
quality of your whitelists and blacklists by using auto-processing of your e-mails
and updating of your lists.
For example, in a given environment, you can generally auto-populate your
whitelist with:
- All the e-mail addresses in your address book or global address book
- All the e-mail addresses for the users of your system
- All the e-mail addresses of known, good, sources, such as client and
suppliers
Furthermore, if you have access to the e-mail mailboxes of users (for example,
because of the Spam and Ham processing demonstrated earlier in this article), then
you can process the e-mail that has reached their e-mail accounts as the source
for your whitelist. If you choose this method, make sure that you only choose the
e-mail that has been identified as valid; try the filtering solution (see the
Using
standard filtering tools section).
Using whitelists and blacklists adds another thread to the spam solution (see
Figure 3).
Figure 3. Updating white and blacklists
automatically
Finally, the "last mile" of mail delivery can be used to further filter offending
e-mails.
Using standard
filtering tools
Using a spam filtering solution catches a significant amount of spam, but many
fail to reach 100 percent reliability. Part of the problem is that spammers are
getting very clever at allowing their spam to make it through the filters.
Fortunately for you, they also have a habit of using a wide range of e-mail
addresses, both for the sender and the recipient, that might not actually match
your address or the address of someone you know. White and blacklists can be an
effective way of trying to alleviate the spam problem, but the sheer range of
addresses in use by most spammers means that spam will often get past the filters
to the user's mailbox.
As a last defense for the filtering and removal of spam, you can take advantage
of the many server-side or client-side filtering mechanisms and file e-mail into
e-mail folders, or file any e-mail that you don't explicitly recognize into a
"quarantine" folder. Users can then manually select the messages (and, if
necessary, update their filters) and use the Spam and Ham folders you've already
created to help improve the quality of the spam filtering solution at the front
end.
There are three ways of doing this:
- File everything you recognize into folders, and leave your inbox as the
quarantine folder of e-mails you do not recognize and which need manual
filtering.
- Ignore anything you recognize (for example, don't filter it), but move
unrecognized e-mail to a quarantine folder.
- Filter everything you recognize into dedicated folders and anything you don't
recognize into a quarantine folder.
Using a filtering system on your server or client also means that you can take
advantage of simpler rules to dispose of spam that gets through the spam filters.
Some spam makes it through by tricking spam filters into thinking it is genuine or
has already been scanned and given a low mark. You can use your filters to get rid
of this, as it is often more simply identifiable.
Also, some unwanted e-mail is not spam at all. Some mailing lists are impossible
to be removed from, even after contacting the companies involved, and even if it
was a genuine mailing list that you no longer want to receive e-mails on. Also,
occasionally, a user mistakenly adds your address to their address book and you
end up essentially getting somebody else's mail.
Irrespective of the method you use, it adds another layer (and, more importantly,
another filter) to your mail infrastructure, giving you a final spam filtering
solution like the one shown in Figure 4.
Figure 4. Mailbox filtering in our spam
filtering architecture
With all of this filtering in place, it is easy to forget that your changes
should be measurable.
Obtaining statistics and
generating reports
When filtering e-mail for spam and viri, it's very easy to forget that a metric
to monitor how effective your solutions are at removing the spam are a good idea.
Recording and measuring can also be an effective way of determining whether there
are any specific trends and in some cases can be used to develop completely
different ways of examining the email as it comes in to make the process more
effective.
If you are using a tool like Amavis, then the information about how each e-mail
has been treated can be extracted by parsing the contents of the log file.
Listing 2 shows a single line from the Amavis log file.
Listing 2. Line from the Amavis log file
Nov 26 11:33:45 constable.example.com /usr/bin/amavisd[2257]: (02257-04)Blocked SPAM,
[83.237.69.122] [64.18.7.11] <jqyay@quintiles.com> ->
<null@gendarme.example.com>, quarantine: quarantine@gendarme.example.com,
Message-ID: <1d9b01c83020$3150e150$c0a8008f@Ned>, mail_id: YDOXKqndoiPU,
Hits: 69.428, 11621 ms
|
The "Blocked SPAM" is the useful fragment of the log output, as it tells you both
what the e-mail was identified as and what happened to it. The first word tells
you whether the e-mail was blocked or passed. The second describes the type,
including spam, infected (virus), bad header, banned, or clean.
Listing 3 shows a Perl script that extracts this information
and summarizes it.
Listing 3. Perl script that extracts log output
#!/usr/bin/perl
my $stats = {};
while(<STDIN>)
{
chomp;
next unless (m{/usr/bin/amavisd\[\d+\]: \(\d+-\d{2}\)});
if (m/(Passed|Blocked) [A-Z]+/)
{
my ($proc_mode,$proc_type) = (m/(Passed|Blocked) ([-A-Z]+)/);
$stats->{$proc_mode}->{$proc_type}++;
}
}
foreach my $mode (sort keys %{$stats})
{
my $modetotal = 0;
print "$mode\n";
foreach my $type (sort keys %{$stats->{$mode}})
{
printf("\t%-20s %7d\n",$type,$stats->{$mode}->{$type});
$modetotal += $stats->{$mode}->{$type};
}
printf("\t%-20s %7d\n",'Total',$modetotal);
}
To run the script we pipe the contents of the file through the script:
$cat amavis.log |perl parse_amavis.pl
Blocked
BANNED 32793
CLEAN 1
INFECTED 766
SPAM 85499
Total 119059
Passed
BAD-HEADER 1415
CLEAN 70588
SPAM 356
Total 72359
|
This unfortunately shows that more than 62 percent of the e-mail received was blocked
because it was spam.
With a little additional work, you can extract the information from the log and
write that data into a database. Listing 4 shows the
skeleton of a script that parses the Amavis log information in more detail.
Listing 4. Skeleton of a script that parses the Amavis log
#!/usr/bin/perl
use Time::ParseDate;
while(<STDIN>)
{
chomp;
next unless (m{/usr/bin/amavisd\[\d+\]: \(\d+-\d{2}\)});
next if (m{(mcfilter|slpfliter)});
if (m/(Passed|Blocked) [A-Z]+/)
{
my ($datetime,$host,$proc_mode,$proc_type,
$sender,$recip,$hits,$msgid,$mailid);
my @blocks = split(/\s+/);
# Extract the date
$datetime = parsedate(sprintf("%s %s %s",@blocks[0..2]));
# Extract the hose
$host = $blocks[3];
# Extract processing information
($proc_mode,$proc_type) = (m/(Passed|Blocked) ([-A-Z]+)/);
# Extract the sender/recipient information
($sender,$recip) = (m/<(.*?)> -> <(.*?)>/);
# Extract the spam score; anything with a negative score
# is effectively zero (i.e. it passed)
($hits) = (m/Hits: ([-0-9.]+),/);
$hits = 0 if ($hits eq '-');
# Now write the information into a database...
}
}
|
Writing that information into a database table is an exercise for the reader. If
you decide to use this script, you should record the date, e-mail addresses, spam
score, processing, and other information. That will extract the maximum amount of
information back out of the database again.
As an example of what you can achieve once you have this information in a
database, Figure 5 shows a graph generated from a database of
parsed logs from Amavis.
Figure 5. Spam statistics as a graph
You can see here much more clearly the marked difference between clean passed
mail (green) and blocked mail (red).
Coherence
Finally, bear in mind that even employing all these solutions in combination with
an existing spam filtering solution might not resolve the problem entirely.
You can improve the situation even further if you take a coherent approach to the
system. For example, you've looked at a number of solutions in this article, but
keep in mind that:
- Any auto spam or ham reporting solution should also update the stats, accordingly.
- Any auto spam or ham reporting solution should update the whitelists and
blacklists, if necessary.
- If using server-side filtering and those rules include full e-mail addresses,
use the server-side filters to update the whitelists and blacklists.
- Consider using a database or a front end to allow users to
update the whitelists and blacklists.
Ultimately, you want to make sure that your e-mail system and the spam filtering
solution have the right information and the right quality of information to enable
them to filter the spam effectively.
Summary
Spam filtering solutions are a necessary evil in today's e-mail climate. It is
virtually impossible to avoid the spam and, even if you never publish your e-mail
address, the chances are you will get spam.
As you've seen in this article, most spam solutions use a variety of different
techniques to filter the spam on it's way into your e-mail system, but you can
improve the quality of the filtering by working with your users and the spam
filters. Providing a method for reporting missed spam, automatically updating
black and whitelists, and using a secondary filter system can all help to reduce
the amount of spam that reaches your inbox. Using the techniques in this article
either removes the responsibility from you to manually update the systems, or
empowers you to help improve the overall spam filtering solution.
Resources Learn
- Use an
RSS
feed
to request notification for the upcoming articles in this series. (Find out more
about
RSS feeds of developerWorks
content.)
-
System Administration Toolkit:
Check out other parts in this series.
-
Popular content:
See what AIX® and UNIX content your peers find interesting.
- Check out other articles and tutorials written
by Martin Brown:
-
AIX and
UNIX:
The AIX and UNIX developerWorks zone provides a wealth of information relating to
all aspects of AIX systems administration and expanding your UNIX skills.
-
New to AIX and UNIX?:
Visit the "New to AIX and UNIX" page to learn more about AIX and UNIX.
-
AIX Wiki:
A collaborative environment for technical information related to AIX.
- Search the AIX and UNIX library by topic:
-
Safari bookstore:
Visit this e-reference library to find specific technical resources.
-
developerWorks technical events and webcasts:
Stay current with developerWorks technical events and webcasts.
-
Podcasts: Tune in and
catch up with IBM technical experts.
Get products and technologies
-
IBM trial software:
Build your next development project with software for download directly from
developerWorks.
-
Amavisd is a mail filtering
solution that interfaces to many solutions like SpamAssassin, Razor and numerous
virus scanners.
-
SpamAssassin: This tool is a highly
configurable spam scanner that uses a variety of techniques, including basic
matching and Bayesian scanning to give e-mails a score indicating their likelihood
of being spam.
-
nmap: This tool scans network hosts and
ports and provides you information about potential unauthorized hosts and
services.
Discuss
- Participate in the
developerWorks blogs
and get involved in the developerWorks community.
- Participate in the AIX and UNIX forums:
About the author  | |  | Martin Brown has been a professional writer for more than seven years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms—Perl, Python, Java™, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows®, Solaris, Linux, BeOS, Mac OS X and more—as well as Web programming, systems management, and integration. He is a Subject Matter Expert (SME) for Microsoft® and regular contributor to ServerWatch.com, LinuxToday.com, and IBM developerWorks. He is also a regular blogger at Computerworld, The Apple Blog, and other sites. You can contact him through his Web site. |
Rate this page
|  |