A Last Go At Spam Filtering Before Whitelisting

by Danny Sullivan on January 19, 2006

in Email

Yesterday was the last day I let my three mail services run without training
them to understand false matches. It was also the first time that SpamCop ran
with all previous training removed. Despite that, it still performed great.
False matches were up a bit with Gmail, while Yahoo Mail was simply appalling in
performance. The stats below, along with a first look at "whitelisting" mail by
identifying false matches. Those new to the series should see my
Email category and read from the
oldest up.

Let’s start with the stats. For more on what the figures mean, see this
page. These are for January
19, 10:30am UK time through January 20, 11:30am UK time.

Service

Yahoo

Gmail

SpamCop

Inbox

507

297

301

Spam

144

376

374

False Match

18

10

1

Total Mail

669

673

676

% Spam Caught

22%

56%

55%

% False Match

13%

3%

0.3%

Spam

162

386

375

Both Gmail and SpamCop have generally been catching spam in the 60 percent
range, so yesterday seemed a tough one for them. As for Yahoo, it continues to
catch far less spam. Worse, of the spam it grabs, it has a far higher false match
rate.

What did Yahoo nab as spam yesterday that wasn’t?

  • Editorial newsletters from MediaPost, Search Engine Guide, ClickZ and our
    own SearchDay newsletter
  • Company newsletters from Handango and iTunes
  • Several work emails, replies from people I’d messaged
  • An email I’d sent to myself

For the first time, I’ve now used the "Not Spam" button that Yahoo Mail Beta
offers to identify the false matches. Several of the newsletters it caught yesterday
as spam come each day. We’ll see if using Not Spam helps train Yahoo not to catch these
when I check things tomorrow.
Annoyingly, there’s no way to see that the addresses have actually been added to
any whitelist.

Gmail did far better than Yahoo on the false match front, but it still
grabbed a number of items.

  • Several press releases from various companies. Perhaps I should consider
    that a feature, rather than a bug. There was only one I was mildly interested
    in.
  • A Hitwise newsletter
  • Some work email replies
  • A message I sent to myself

Like Yahoo, Gmail has a Not Spam button you can use to indicate if something
was nabbed by mistake. I’ve done that to the false matches. Also like Yahoo,
there’s unfortunately no way to see exactly what addresses (if any) have been
added to a personal whitelist. In addition, I continue to find it annoying that I can’t
sort messages in my spam folder by subject, as you can with Yahoo and SpamCop.
This makes it much easier to scan and spot things falsely held as spam,
especially since items in non-Latin languages get grouped together.

Over at SpamCop, removing
all my previous filters didn’t cause the false match rate to go up. SpamCop held
only one item, a message I’d sent myself. I used the "Release and Whitelist"
feature to train SpamCop about this. I also love that by going into Options,
then SpamCop Tools, then Manage Your Personal Whitelist, I can see that my
address was indeed added to the whitelist.

Way back, I wrote that despite the spam catching at either SpamCop or Gmail,
both would let some spam through. Here are some stats from yesterday to ponder:

Gmail Inbox

297

MailWasher Spam

76

Real Mail

221

% Spam In Inbox

26%

What’s this showing? In the first chart, you could see that Gmail stopped a
lot of spam from getting into my inbox at all. That 297 figure for my inbox
represents how much mail was allowed through what’s effectively my first line of
defense, Gmail’s own spam filtering.

My second line of defense, as I’ve
written, is
Mailwasher. It has its own
spam filtering features, along with a blacklist I’ve built up over years. Using
it, I stopped another 76 items from hitting my Outlook mail application. In
other words, 26 percent of what Gmail thought was "clean" wasn’t. I’ve never
tested this with SpamCop, but in my experience, it probably lets about 20
percent through.

I also wanted to add a bit on why I got started using Mailwasher but still
want filtering on my server as well, as I
explained
in comments on Jeremy’s blog:

Ideally, I want my mailserver to keep the junk from ever showing up in my
inbox at all. That doesn’t happen, which is why I use Mailwasher. But until
about a year and a half ago, I had no broadband. Spam wasn’t just annoying.
With the amount I get, downloading the junk over 56K was time consuming.

That’s why I could never rely on something like Cloudmark (which I ran for
years) or Outlook 2003’s native spam filtering, at least as a first line of
defense. I needed to keep spam out before downloading. Filtering after
download was a last line of defense. If you fought through SpamCop, then
through Mailwasher, finally Cloudmark or Outlook would help me — but only few
items got that far. I’m quite the Mailwasher fan.

So some of my narrowband habits still remain. I could just pull everything
down and filter, but I still prefer to keep it out. And Mailwasher is a fast
way to prefer what’s made it past the mail server spam filters and delete
easily (plus quickly preview mail). I only wish Mailwasher could report back
to Gmail the spam that’s getting through, so Gmail would learn. But that seems
down to Gmail having an API for developers to use.

Ironically, there’s such as easy way for Gmail or SpamCop to improve the spam
still getting past their filters. Just give me an option to filter out messages
predominantly using Asian or Cyrillic characters. That’s what’s getting through.
My assumption is that the spam filters they are using just don’t work well in
non-English or non-Latin languages.

With SpamCop, I can kind of rig this by finding a unique character, the Asian
or Cyrillic equivalent of the letter "e" in English, the most popular letter
used. Unfortunately, the filtering only happens when you log into web mail.
With Gmail, I might be able to make that type of filter work despite doing POP
downloads. I’ll try later. But unfortunately, I can’t make it automatically move
items to the spam folder for possible review. I have to tag them (which means
they’re still in the inbox) or throw them in the trash (which may work, but it’s
another thing to review).

Finally, I leave you with this:

Look at the top. A first I thought these must be ads, but there’s no ad
coding. That example just leads

here
. They’ve been out
since at least
April 2005.

  • Share/Save/Bookmark

Leave a Comment

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Previous post: New Desperate Housewives, Over There & Battlestar Galactica Arrive In The UK

Next post: Abandoning NewsGator Because Of Portability Issues