Spam Filtering | Making a stand

With 3dpixel.net seemingly taking on more and more businesses of late we’ve been talking to customers about spam filtering. Our policy in the past has always been to use spam filters in a passive way - i.e. get the filters to simply mark an email as spam and leave it to the recipient to filter based on a rule; move to junk folder and mark as read for example.

Businesses and moreso individuals now are taking a more forceful approach to spam, maybe because they simply receive so much these days. We’ve actually had people / businesses move away from us to companies like fasthosts specifically because they actually delete spam email at the network level despite the risks of false positives.

Ok, perhaps our passive approach has no place in this torrent of spam we all receive today. Thus, 6 months ago we started a little trial against our own @3dpixel.net mailserver to set thresholds on the level of spam that would be marked as spam, and the level of spam that would be deleted and the recipient would never see. At the risk of generating false positives, logs were kept, and the scores to delete erred on the side of caution.

A combination of low threshold blacklists, Bayesian content filtering in the ‘many, low score’ format modified from SARE and some made from scratch by ourselves, and tweaked threshold levels for deletion we had a system that was reliably deleting most of the spam we were getting. I say ‘most’, it was 95%+ and any that got through were being marked as spam still.

We ran like this for 5 months with many changes, refinements, additions, deletions and about a month ago, confident in the filtering, we invited some of our customers to use the system. We modified their DNS records on their domains to relay email through the spam filtering system and then via an SMTProute back to their mailserver. The feedback we were getting was very positive. Again, 95%+ spam deletion and any that were getting through were using the secondary MX records (a typical spammer response to anti-nolisting). We also felt that the bayesian filtering scores were improving simply as the volume of email going through the filter was increasing and was learning. We of course had fed the filter several thousand spam and non-spam emails during the testing phase (we get, er.. got a lot of spam) to teach it but more didn’t hurt especially when a person’s spam seems to vary so much.

So now we’re at the stage where we’ve converted entire hosting platforms to this system and completely shut down port 25 to any system apart from localhost and the spam system to stop the secondary / tertiary MX spam. As we run individual plesk hosting platforms at this time we initially noticed that the cpu and disk loads on the servers dropped through the floor because of the drop in email throughput. We knew that much of what the servers were doing was email related but not ‘that’ much, and of course with 85%+ of all email being spam it was obviously taking its toll.

Here’s how we did it:
Load balanced round-robin DNS servers with qmail and John Simpson’s patches (don’t ask me why I use qmail I just like it).
Spamassassin custom compiled to run sa-delete.
rcpthosts and smtproutes auto-synchronisation (3dpixelnet coded)

Positives:
Standalone spam filtering clusters with combined bayesian filtering.
Ease of management for spam systems rather than on each individual server.
No longer have to dance around plesk’s implementation of qmail (some controversy here as plesk apparently broke the qmail licence agreement).

Negatives:
No per-user custom rules.

Still to do / wishlist:
validrcptto - smtproutes and qmail are configured per-domain, not per-email thus any email bombs or dictionary attack are processed by the mail relay upping load. (Blacklists tend to reduce this effect however).
Report per domain per defined period of spam deleted available to users.
jgreylisting - Done 08/04/08

Tags: , , ,