A Last Go At Spam Filtering Before Whitelisting

Yesterday was the last day I let my three mail services run without training them to understand false matches. It was also the first time that SpamCop ran with all previous training removed. Despite that, it still performed great. False matches were up a bit with Gmail, while Yahoo Mail was simply appalling in performance. The stats below, along with a first look at "whitelisting" mail by identifying false matches. Those new to the series should see my Email category and read from the oldest up.

Let's start with the stats. For more on what the figures mean, see this page. These are for January 19, 10:30am UK time through January 20, 11:30am UK time.

Service

Yahoo

Gmail

SpamCop

Inbox

507

297

301

Spam

144

376

374

False Match

18

10

1

Total Mail

669

673

676

% Spam Caught

22%

56%

55%

% False Match

13%

3%

0.3%

Spam

162

386

375

Both Gmail and SpamCop have generally been catching spam in the 60 percent range, so yesterday seemed a tough one for them. As for Yahoo, it continues to catch far less spam. Worse, of the spam it grabs, it has a far higher false match rate.

What did Yahoo nab as spam yesterday that wasn't?

  • Editorial newsletters from MediaPost, Search Engine Guide, ClickZ and our own SearchDay newsletter
  • Company newsletters from Handango and iTunes
  • Several work emails, replies from people I'd messaged
  • An email I'd sent to myself

For the first time, I've now used the "Not Spam" button that Yahoo Mail Beta offers to identify the false matches. Several of the newsletters it caught yesterday as spam come each day. We'll see if using Not Spam helps train Yahoo not to catch these when I check things tomorrow. Annoyingly, there's no way to see that the addresses have actually been added to any whitelist.

Gmail did far better than Yahoo on the false match front, but it still grabbed a number of items.

  • Several press releases from various companies. Perhaps I should consider that a feature, rather than a bug. There was only one I was mildly interested in.
  • A Hitwise newsletter
  • Some work email replies
  • A message I sent to myself

Like Yahoo, Gmail has a Not Spam button you can use to indicate if something was nabbed by mistake. I've done that to the false matches. Also like Yahoo, there's unfortunately no way to see exactly what addresses (if any) have been added to a personal whitelist. In addition, I continue to find it annoying that I can't sort messages in my spam folder by subject, as you can with Yahoo and SpamCop. This makes it much easier to scan and spot things falsely held as spam, especially since items in non-Latin languages get grouped together.

Over at SpamCop, removing all my previous filters didn't cause the false match rate to go up. SpamCop held only one item, a message I'd sent myself. I used the "Release and Whitelist" feature to train SpamCop about this. I also love that by going into Options, then SpamCop Tools, then Manage Your Personal Whitelist, I can see that my address was indeed added to the whitelist.

Way back, I wrote that despite the spam catching at either SpamCop or Gmail, both would let some spam through. Here are some stats from yesterday to ponder:

Gmail Inbox

297

MailWasher Spam

76

Real Mail

221

% Spam In Inbox

26%

What's this showing? In the first chart, you could see that Gmail stopped a lot of spam from getting into my inbox at all. That 297 figure for my inbox represents how much mail was allowed through what's effectively my first line of defense, Gmail's own spam filtering.

My second line of defense, as I've written, is Mailwasher. It has its own spam filtering features, along with a blacklist I've built up over years. Using it, I stopped another 76 items from hitting my Outlook mail application. In other words, 26 percent of what Gmail thought was "clean" wasn't. I've never tested this with SpamCop, but in my experience, it probably lets about 20 percent through.

I also wanted to add a bit on why I got started using Mailwasher but still want filtering on my server as well, as I explained in comments on Jeremy's blog:

Ideally, I want my mailserver to keep the junk from ever showing up in my inbox at all. That doesn't happen, which is why I use Mailwasher. But until about a year and a half ago, I had no broadband. Spam wasn't just annoying. With the amount I get, downloading the junk over 56K was time consuming.

That's why I could never rely on something like Cloudmark (which I ran for years) or Outlook 2003's native spam filtering, at least as a first line of defense. I needed to keep spam out before downloading. Filtering after download was a last line of defense. If you fought through SpamCop, then through Mailwasher, finally Cloudmark or Outlook would help me -- but only few items got that far. I'm quite the Mailwasher fan.

So some of my narrowband habits still remain. I could just pull everything down and filter, but I still prefer to keep it out. And Mailwasher is a fast way to prefer what's made it past the mail server spam filters and delete easily (plus quickly preview mail). I only wish Mailwasher could report back to Gmail the spam that's getting through, so Gmail would learn. But that seems down to Gmail having an API for developers to use.

Ironically, there's such as easy way for Gmail or SpamCop to improve the spam still getting past their filters. Just give me an option to filter out messages predominantly using Asian or Cyrillic characters. That's what's getting through. My assumption is that the spam filters they are using just don't work well in non-English or non-Latin languages.

With SpamCop, I can kind of rig this by finding a unique character, the Asian or Cyrillic equivalent of the letter "e" in English, the most popular letter used. Unfortunately, the filtering only happens when you log into web mail. With Gmail, I might be able to make that type of filter work despite doing POP downloads. I'll try later. But unfortunately, I can't make it automatically move items to the spam folder for possible review. I have to tag them (which means they're still in the inbox) or throw them in the trash (which may work, but it's another thing to review).

Finally, I leave you with this:

Look at the top. A first I thought these must be ads, but there's no ad coding. That example just leads here. They've been out since at least April 2005.

By Danny Sullivan on Jan. 19, 2006 | Permalink
See related posts in: Email

Next Post: Abandoning NewsGator Because Of Portability Issues
Previous Post: New Desperate Housewives, Over There & Battlestar Galactica Arrive In The UK
All Posts: Daggle Archives
Posts By Category: Daggle Categories
Return To: Daggle Home Page

Comments

Want to comment? If you are signed into TypeKey, you'll see a form below. No form? Click on the sign-in link below, and you can sign-in or sign-up for a free account. Sorry you have to use TypeKey, but I use it to avoid comment spam. All comments currently appear automatically after posting.

Leave a comment

Subscribe!
Subscribe Via Web Feed
Subscribe with Google
Add to My Yahoo!
Subscribe with Bloglines
Add to netvibes
Subscribe with Live.com
Subscribe in NewsGator Online
Subscribe in Rojo

Add to My AOL

Get new entries via email. Enter your address below:


follow dannysullivan at http://twitter.com
Search