I am subscribed to a mail list where some of the messages are non-english which I cannot understand.
How do I filter the non-english messages to /dev/null
using procmail
and/or command line tools?
I use procmail
to filter my email, so ideally any alternative tool would also require a procmail
recipe.
I'd prefer not to have to train my own language models.
One way is to use the perl TextCat package from Gertjan van Noord.
The text_cat
script outputs the most likely language for the mail. This recipe assumes text_cat
has been installed under /usr/local/bin
.
Here is a simple procmail
recipe to call the text_cat
script:
:0
* ^Subject.*Jobs.*Board
{
LANG_=`/usr/local/bin/text_cat`
:0
* ! LANG ?? ^english$
/dev/null
:0
jobs/
}
I've been running text_cat for a few years. There haven't been any non-english messages classified as english, that is, no false-positives. I've not been rigorous about checking for false-negatives.
A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script. Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.
Here is an incompletely tested procmail
recipe for filtering on the spamassassin X-Spam-Languages
header:
:0
* ^Subject.*Jobs.*Board
{
# Delete non-english language emails using spamassassin header
# Test for not X-Spam-Languages: en
:0
* !^X-Spam-Languages: en$
foreign/
# Save english language mails in folder
:0
jobs/
}
Warning: spamassassin will occasionally provide multiple language categorisations like so:
X-Spam-Languages: en da ro
which the above recipe does not account for.
Spamassassin Language Categorisation Configuration
Edit /etc/spamassassin/v310.pre
and uncomment the following line:
loadplugin Mail::SpamAssassin::Plugin::TextCat
Configure the plugin in /etc/spamassassin/local.cf
:
ok_languages en # I understand english
inactive_languages '' # Enable all languages
add_header all Languages _LANGUAGES_
# score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended
This recipe was incompletely tested with spamassassin version 3.4.2.
To adapt these answers to excluding a different language would involve substituting the other language for english
in the first case and substituting the other 2 character language code for en
in the second case.