This is primarily a guide for administrators of cPanel hosting systems, though tech-savvy cPanel users with shell access will be able to use this technique as well.
Users of webmail systems like GMail, Yahoo, etc. are accustomed to having a “Mark as Spam” button in the interface. Clicking the button tells the server that the selected message is spam, to prevent similar messages from showing up in the inbox again. So how can administrators of standard cPanel-based hosting systems provide similar functionality?
Users of Desktop mail clients like Apple Mail, Thunderbird, Outlook, etc. also have a “Mark as spam” button, but clicking it only trains the desktop client, not the server. That’s only marginally helpful – it’s far better to prevent known spam from ever getting to the client to begin with. And if you move around between several computers, you have to spam train them all separately.
But if users configure their desktop mail clients to store junk mail on the server in known location whenever that Junk button is clicked, then spamassassin can be run against that server-side maildir, thus training the server-side Bayes database that’s used for junk filtering.
Of course, this takes a bit of willingness on the part of the end user, who must do a bit of mail client config work. But it’s generally easy, and the payoff for users with persistent spam problems is huge. Here is a configuration guide I made for Apple Mail users. I’ll add other mail clients to it in the future.
So there are two steps here:
- Have users with heavy spam problems store their junk mail on the server
- Train spamassassin against that junk folder
Step #2 can technically be done by any user with shell access, but it’s much better if the sysadmin does it for them, making the process non-technical and as transparent as possible. For a user of Apple Mail on a cPanel system, the default location for server-side junk will be:
/home/username/mail/example.com/email_acct/.Junk/cur
The tool for spam training is sa-learn, and its job is to educate users’  Bayesian spam databases about what’s spam and what’s ham. It works like this:
sa-learn --spam --showdots /home/username/mail/example.com/email_acct/.Junk/cur
But what if the root user wants to do it on their behalf? Here’s the rub: It is not possible (or, rather not a good idea)* for root to train spamassassin against individual users’ spam boxes and have that training apply to all users on the server.
The command above will work fine for individual user accounts when run by those users, but if you run sa-learn as root, you will only be training the Bayesian database for user root. You will NOT be training the whole system that those messages are spam. When exim invokes spamassassin on messages belonging to other accounts, the training you’ve done as root will not apply.
The key to making this work is that root must run sa-learn commands as other users. This is accomplished with the ‘su’ command, with the ” – username” option. The hyphen/dash causes root to inherit the environment variables of the invoked users as it’s run. When working this way, you’ll want to use su’s “-c” option to specify a command to be passed in. So, putting it all together, for root to train user joe’s spam database against one of joe’s mail folder’s, root should run:
su -c "sa-learn --spam --showdots /path/to/maildir/full-of-spam" - joe
Try that, and watch Joe’s spam drop to zero overnight! Actually, to make spamassassin truly effective, you must train it with both spam and ham. So you’ll also want to identify a server-side mailbox belonging to the user that consists of mostly non-spam (their inbox is generally a good place to look, if the user is good at pruning junk – just pick directories with a high probability of being mostly-spam or mostly ham). sa-learn is smart, and will skip messages it’s already seen. It will aslo retrain a mesage when it’s moved from the spam box to the ham box or vice versa.
su -c "sa-learn --ham /path/to/maildir/full-of-ham" - joe
This is really important – if you feed Bayes a lot more spam than ham, your system may start to generate false positives. So be sure that ham training is happening as well as spam training.
Putting it all together, root can now create a shell script, to be run via cron, that simply lists spam- and ham-training commands for specific users, to be run as those users. There is a small twist though: To keep spamassassin effective, you must periodically expire old tokens in the database, which is done with:
su -c "sa-learn --force-expire" - account_name
You could technically do that on every run, but it’s CPU-intensive and only needs to be done around once a month. So you might want to put a list of expire commands in a separate script and crontab that for once per month (or you could do some fancy footwork in the same script – whatever works for you).
One final note: The technique above only governs the spam score that gets written into message headers. It does nothing about actually filtering or deleting spam. For this to work, the user must still configure their auto-delete or spambox thresholds in the cPanel SpamAssassin section. Now that message scoring is actually effective, their auto-delete or spambox systems will work effectively as well.
*Global spam training
Earlier I said that it was not a good idea to configure a server to train a global spam database by reading individual user’s reckoning of what’s spam and what’s ham. That’s because A) One man’s junk is another man’s treasure and B) Many users aren’t diligent about keeping spam and ham separate. Using those users’ mailboxes as a global guide could result in a polluted Bayes database that affects everyone.
Nevertheless, if you do want to take this approach, it is possible. Just edit /etc/mail/spamassassin/local.cf
with these instructions, i.e. set a globally recognized path to a central Bayes db. Again, this is not recommended – caveat emptor.
Update: Six months after starting with  the technique above, I’ve stopped. On a small-medium web host, and with each user training their own filters, there apparently just isn’t a large enough sample size to be really effective, and very large amounts of spam still get through. I’ve since disabled this script’s cron job and have added  the Barracuda RBL. Barracuda has been helpful, but it isn’t a silver bullet either. Notes on that here.