regexpysparktokenize

Why my RegexTokenizer transformation in PySpark gives me the opposite of the required pattern?


When I use the RegexTokenizer from pyspark.ml.feature to tokenize sentences column in my dataframe to find all the word characters, I get the opposite of what I would get when the python re package is used for the same sentence. Here is the sample code:

from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer
spark = SparkSession.builder \
        .master("local") \
        .appName("Word list") \
        .getOrCreate()

df = spark.createDataFrame(data = [["Hi there, I have a question about RegexTokenizer, Could you 
                           please help me..."]], schema = ["Sentence"])

regexTokenizer = RegexTokenizer(inputCol="Sentence", outputCol="letters", pattern="\\w")
df = regexTokenizer.transform(df)
df.first()['letters']

This gives the following output:

[' ', ', ', ' ', ' ', ' ', ' ', ' ', ', ', ' ', ' ', ' ', ' ', '...']

On the other hand if I use the re module on the same sentence and use the same pattern to match the letters, using this code here:

import re
sentence = "Hi there, I have a question about RegexTokenizer, could you 
                           please help me..."
letters_list = re.findall("\\w", sentence)
print(letters_list)

I get the desired output as per the regular expression pattern as:

['H', 'i', 't', 'h', 'e', 'r', 'e', 'I', 'h', 'a', 'v', 'e', 'a', 
'q', 'u', 'e', 's', 't', 'i', 'o', 'n', 'a', 'b', 'o', 'u', 't', 
'R', 'e', 'g', 'e', 'x', 'T', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 
'r', 'c', 'o', 'u', 'l', 'd', 'y', 'o', 'u', 'p', 'l', 'e', 'a', 
's', 'e', 'h', 'e', 'l', 'p', 'm', 'e']

I also found that I need to use \W instead of \w in pySpark to solve this problem. Why is this difference? Or have I misunderstood the usage of pattern argument in RegexTokenizer?


Solution

  • From what the documentation on RegexTokenizer says, on creation it has a parameter called gaps. In one mode, the regexp matches gaps (true and is the default), in other it matches tokens (not the gaps, false).

    Try setting it manually to the value you need: in your case, gaps = false.