pythonregexpysparkapache-spark-sqljohnsnowlabs-spark-nlp

Regex in Spark NLP Normalizer is not working correctly


I'm using the Spark NLP pipeline to preprocess my data. Instead of only removing punctuation, the normalizer also removes umlauts.

My code:

documentAssembler = DocumentAssembler() \
    .setInputCol("column") \
    .setOutputCol("column_document")\
    .setCleanupMode('shrink_full')

tokenizer = Tokenizer() \
  .setInputCols(["column_document"]) \
  .setOutputCol("column_token") \
  .setMinLength(2)\
  .setMaxLength(30)
  
normalizer = Normalizer() \
    .setInputCols(["column_token"]) \
    .setOutputCol("column_normalized")\
    .setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
    .setLowercase(True)\

Example:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!

Output:

Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut

Expected Output:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut

Solution

  • The \w pattern is not Unicode-aware by default, you need to make it Unicode-aware with a regex option. In this case, it is easier to do it with an embedded flag option (?U):

    "(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"
    

    More details from the documentation:

    When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

    The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U).

    The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.