[SOLVED] pyspellchecker: do not split URL

pyspellchecker: do not split URL

I tried to set-up an autocorrect using pyspellchecker in Python. In general, it does work, however it currently also splits the URLs, which is not really desired. The code is as following:

from spellchecker import SpellChecker

spell = SpellChecker()
words = spell.split_words("This is my URL https://test.com")
test = [spell.correction(word) for word in words]

This result in the following: ['This', 'is', 'my', 'URL', 'steps', 'test', 'com']

What do I have to change that all URLs are not autocorrected?

Solution

NLTK's TweetTokenizer correctly tokenizes URLs, hashtags, and emoticons.

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> tknzr.tokenize(s)
['This', 'is', 'my', 'URL', 'https://test.com']

NLTK comes with a variety of state-of-the-art word tokenization primitives. I suggest you use NLTK to turn your string into words before filtering for autocorrection. You can use NLTK's part-of-speech utilities to determine what things should be autocorrected.