blacklistprofanity

Spanish profanity black-list


I've been tasked with implementing a blacklist-based profanity filter for a Rails app. I know there are a ton of issues with blacklist-based filtering, but the decision was made above my head. Challenge: I'm looking for a good list of Spanish profanity to run into the filter. For English, we're building on a list which exhaustively lists conjugations/plurals/etc, one per line of a text file. Does such a list exist in the public domain for Spanish?


Solution

  • Finding good lists and having them tuned is difficult. It also sounds like you are doing a lot of manual work that can be automated (i.e. conjugation). I did a lot of this for my company's profanity filter named CleanSpeak and much of this can be automated using POS identifiers for words and in many cases you can manually do POS tagging or find a POS source.

    You'll also need to consider the quality of the lists and the up-keep and management of a filter. A lot of people think it is simple and then realize that it is extremely difficult to prevent false-positives.

    All that said, we found the majority of our lists for other languages difficult to come by online and ended up paying to have many of the built or purchased from other companies. The lists we did find online ended up being nearly worthless once we had them translated. We also attempted to take out blacklist and have that translated, which was a complete failure because most English profanities don't have equivalents in other languages. I would suggest purchasing lists or working with students at your local university to generate lists. A number of our customers found this method relatively good and not overly expensive.

    I would also suggest that you take a look at some of the resources out there that define the best ways to manage User Generated Content. These will help guide you through any build vs. buy decisions.