pythonstringlisttext-segmentation

Converting a String to a List of Words?


I'm trying to convert a string to a list of words using python. I want to take something like the following:

string = 'This is a string, with words!'

Then convert to something like this :

list = ['This', 'is', 'a', 'string', 'with', 'words']

Notice the omission of punctuation and spaces. What would be the fastest way of going about this?


Solution

  • Try this:

    import re
    
    mystr = 'This is a string, with words!'
    wordList = re.sub("[^\w]", " ",  mystr).split()
    

    How it works:

    From the docs :

    re.sub(pattern, repl, string, count=0, flags=0)
    

    Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

    so in our case :

    pattern is any non-alphanumeric character.

    [\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

    a to z, A to Z , 0 to 9 and underscore.

    so we match any non-alphanumeric character and replace it with a space .

    and then we split() it which splits string by space and converts it to a list

    so 'hello-world'

    becomes 'hello world'

    with re.sub

    and then ['hello' , 'world']

    after split()

    let me know if any doubts come up.