A Last Go At Spam Filtering Before Whitelisting

Yesterday was the last day I let my three mail services run without training them to understand false matches. It was also the first time that SpamCop ran with all previous training removed. Despite that, it still performed great. False matches were up a bit with Gmail, while Yahoo Mail was simply appalling in performance. The stats below, along with a first look at “whitelisting” mail by identifying false matches. Those new to the series should see my Email category and read from the oldest up.

Let’s start with the stats. For more on what the figures mean, see this page. These are for January 19, 10:30am UK time through January 20, 11:30am UK time.













False Match




Total Mail




% Spam Caught




% False Match








Both Gmail and SpamCop have generally been catching spam in the 60 percent range, so yesterday seemed a tough one for them. As for Yahoo, it continues to catch far less spam. Worse, of the spam it grabs, it has a far higher false match rate.

What did Yahoo nab as spam yesterday that wasn’t?

  • Editorial newsletters from MediaPost, Search Engine Guide, ClickZ and our own SearchDay newsletter
  • Company newsletters from Handango and iTunes
  • Several work emails, replies from people I’d messaged
  • An email I’d sent to myself

For the first time, I’ve now used the “Not Spam” button that Yahoo Mail Beta offers to identify the false matches. Several of the newsletters it caught yesterday as spam come each day. We’ll see if using Not Spam helps train Yahoo not to catch these when I check things tomorrow. Annoyingly, there’s no way to see that the addresses have actually been added to any whitelist.

Gmail did far better than Yahoo on the false match front, but it still grabbed a number of items.

  • Several press releases from various companies. Perhaps I should consider that a feature, rather than a bug. There was only one I was mildly interested in.
  • A Hitwise newsletter
  • Some work email replies
  • A message I sent to myself

Like Yahoo, Gmail has a Not Spam button you can use to indicate if something was nabbed by mistake. I’ve done that to the false matches. Also like Yahoo, there’s unfortunately no way to see exactly what addresses (if any) have been added to a personal whitelist. In addition, I continue to find it annoying that I can’t sort messages in my spam folder by subject, as you can with Yahoo and SpamCop. This makes it much easier to scan and spot things falsely held as spam, especially since items in non-Latin languages get grouped together.

Over at SpamCop, removing all my previous filters didn’t cause the false match rate to go up. SpamCop held only one item, a message I’d sent myself. I used the “Release and Whitelist” feature to train SpamCop about this. I also love that by going into Options, then SpamCop Tools, then Manage Your Personal Whitelist, I can see that my address was indeed added to the whitelist.

Way back, I wrote that despite the spam catching at either SpamCop or Gmail, both would let some spam through. Here are some stats from yesterday to ponder:

Gmail Inbox


MailWasher Spam


Real Mail


% Spam In Inbox


What’s this showing? In the first chart, you could see that Gmail stopped a lot of spam from getting into my inbox at all. That 297 figure for my inbox represents how much mail was allowed through what’s effectively my first line of defense, Gmail’s own spam filtering.

My second line of defense, as I’ve written, is Mailwasher. It has its own spam filtering features, along with a blacklist I’ve built up over years. Using it, I stopped another 76 items from hitting my Outlook mail application. In other words, 26 percent of what Gmail thought was “clean” wasn’t. I’ve never tested this with SpamCop, but in my experience, it probably lets about 20 percent through.

I also wanted to add a bit on why I got started using Mailwasher but still want filtering on my server as well, as I explained in comments on Jeremy’s blog:

Ideally, I want my mailserver to keep the junk from ever showing up in my inbox at all. That doesn’t happen, which is why I use Mailwasher. But until about a year and a half ago, I had no broadband. Spam wasn’t just annoying. With the amount I get, downloading the junk over 56K was time consuming.

That’s why I could never rely on something like Cloudmark (which I ran for years) or Outlook 2003’s native spam filtering, at least as a first line of defense. I needed to keep spam out before downloading. Filtering after download was a last line of defense. If you fought through SpamCop, then through Mailwasher, finally Cloudmark or Outlook would help me — but only few items got that far. I’m quite the Mailwasher fan.

So some of my narrowband habits still remain. I could just pull everything down and filter, but I still prefer to keep it out. And Mailwasher is a fast way to prefer what’s made it past the mail server spam filters and delete easily (plus quickly preview mail). I only wish Mailwasher could report back to Gmail the spam that’s getting through, so Gmail would learn. But that seems down to Gmail having an API for developers to use.

Ironically, there’s such as easy way for Gmail or SpamCop to improve the spam still getting past their filters. Just give me an option to filter out messages predominantly using Asian or Cyrillic characters. That’s what’s getting through. My assumption is that the spam filters they are using just don’t work well in non-English or non-Latin languages.

With SpamCop, I can kind of rig this by finding a unique character, the Asian or Cyrillic equivalent of the letter “e” in English, the most popular letter used. Unfortunately, the filtering only happens when you log into web mail. With Gmail, I might be able to make that type of filter work despite doing POP downloads. I’ll try later. But unfortunately, I can’t make it automatically move items to the spam folder for possible review. I have to tag them (which means they’re still in the inbox) or throw them in the trash (which may work, but it’s another thing to review).

Finally, I leave you with this:

Look at the top. A first I thought these must be ads, but there’s no ad coding. That example just leads here. They’ve been out since at least April 2005.