Visit and Assist Spam Control Research

Last week, John Graham-Cumming launched If you're familiar with "Hot or Not" you'll probably get the idea. As Graham-Cumming says:

The basic idea is to get humans (that means you) to read a small number of messages (some are ham; some are spam) and decide what they are. I'm doing this because there are currently two usable corpuses of spam and ham: the SpamAssassin Public Corpus (which was hand sorted) and the TREC 2005 Public Corpus (which was machine sorted) ... Once I've got enough human decisions (I'd love to get 10 per message; that means almost 1,000,000 human classifications) I'll make all the data public.

In other words, if you visit the site, you can vote on individual messages, to say whether or not you think they are spam or legitimate. This voting will be very helpful to spam researchers, because an acurate "corpus" of spam and ham allows them to automatically test new spam control techniques. Graham-Cumming continues:

I'll highlight any emails where people disagree with the current classification published by Gordon Cormack ... I expect it'll throw up some interesting data ... for example, just how good are humans at sorting spam? Since we'll be able to look at where the corpus and the humans disagree we'll be able to spot machine errors and human errors.

... Richi Jennings, with thanks to John Graham-Cumming

Post a comment

You must be logged in to post a comment. To comment, first join our community.