Spam Filtering Test Continues — Stats & Preparing To Test Whitelists

by on January 18, 2006

in Email

First, a big hello to all of you who may have surfed in from Jeremy’s kind link about my test! Yesterday’s stats put Gmail into the lead with me. More on that in a moment, but I wanted to talk about what I’m testing next.

I said at the beginning of my stat gathering exercise that I’d built up a whitelist at SpamCop of about 210 addresses. I also had 3 addresses on my blacklist. As I noted, this might have been a factor with SpamCop doing better in the testing. I also had SpamCop cranked up to use more than its default set of DNS blacklists.

Today, I’ve wiped out my whitelist and blacklist at SpamCop. I actually killed off about half of the whitelist yesterday, then had to move on to other things (you can only kill 10 entries at a time, so it’s a tedious process). But even that may have already had an impact, since the SpamCop false match rate went up.

I also reset SpamCop to use only its own its own blacklist, rather than the other ones I’d enabled, as explained more here. You can find a full list of what SpamCop lets you use here. I also came across this list from one spam solutions company that shows what percentage of its clients use various blacklists. Here’s a nice PDF guide to how various DNS blacklists work and some popular ones out there.

The change means that SpamCop is now running in as “pure” or “virgin” mode as I can configure, similar to how Yahoo and Gmail have been running. In both those other places, I’ve not released any items marked as spam to help train up a whitelist, nor have I flagged any items as spam to create a blacklist.

I will add the caveat that since I’ve had Gmail running since it launched, there’s a chance I might have flagged the occasional item as “not spam” or “spam” over the past nearly two years and not be recalling this. If so, it would be very, very few items — on the order of less than 10, if even that. The reason is simply that I rarely went into Gmail on a daily basis before now, and I can’t recall on any of those occasions ever doing this type of flagging activity. One thing that would help is if Gmail let you view any white or blacklists that were created. I don’t see any option like this. Yahoo does have a Blocked Address list but no corresponding whitelist viewing option.

I’m going to let all three services run for one more day without doing any whitelisting. Then I’ll test that going forward, to see what impact it has on the false match rate. I’m not sure I’ve got it in me to test how training up a blacklist impacts spam catching. That will take a lot of time with Yahoo, since so much is getting through. With Gmail, I’d have to spend a day prefiltering on the Gmail site before POP downloading, since my Mailwasher tool can’t automatically send to Google a list of what I’d identified as spam getting past its filters. More on this later, but suffice to say Mailwasher’s interested in building this in (you can report to SpamCop), but making it happen would require some type of Gmail API.

On to the stats. For previous posts on this topic, see my Email category. Below is the summary of mail received yesterday from around 10:30am Tuesday, Jan. 17 through the same time Wednesday, January 18, an entire day.

Gmail put in a great performance. No false matches at all and practically the same amount of spam caught as SpamCop. FYI, why don’t the total mail figures match? One reason is that while I grab these stats within minutes of each other, those minutes can add an item or two. Another reason seems to be that some mail might be delayed in showing up. Still, even I was surprised that Yahoo had more than 25 items yesterday than SpamCop. My other guess is that some of those items were sent to my Yahoo email address, rather than my forwarding email address.

Service

Yahoo

Gmail

SpamCop

Inbox

445

246

222

Spam

210

410

415

False Match

12

0

7

Total Mail

667

656

644

% Spam Caught

31%

63%

64%

% False Match

6%

0.0%

2%

Spam

222

410

422

What do the figures on the chart mean? See this page.

{ 1 comment… read it below or add one }

1 Leo Zelevinsky January 18, 2006 at 5:55 pm

My impression of the gmail spam blocker whitelist/blacklist system is that the whitelist is simply your contact list.
So – when you get an email from someone in your contact list (which gmail automatically adds people to whenever you send them email), it is never going to be marked as spam.
When you click ‘Not spam’ I think gmail just adds the sender to the contact list.
So, you inadvertantly probably have a considerable-size whitelist in gmail just through normal gmail use even if you never used the “not spam” button.
I’m not aware of any blacklist that it uses or lets you access.
I may not know what I’m talking about :)

Leave a Comment

Thinking of dropping your link spam? Consider this. Seriously, STOP & READ. The guy who runs Google's spam fighting team? I know him pretty well. In fact, it's sort of a joke between us to see what's the latest absurd link drop I can share. So if you want your site to be a poster child on his idiots wall -- and probably to encounter a Google penalty -- go ahead, drop your link. It's nofollow anyway, plus I do have built-in spam fighting and what gets past that usually gets nabbed in a few minutes to a few hours. So you got to ask yourself. Are you feeling lucky?

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Previous post:

Next post: