I tried to set-up an autocorrect using pyspellchecker in Python. In general, it does work, however it currently also splits the URLs, which is not really desired. The code is as following:
from spellchecker import SpellChecker
spell = SpellChecker()
words = spell.split_words("This is my URL https://test.com")
test = [spell.correction(word) for word in words]
This result in the following: ['This', 'is', 'my', 'URL', 'steps', 'test', 'com']
What do I have to change that all URLs are not autocorrected?
NLTK's TweetTokenizer correctly tokenizes URLs, hashtags, and emoticons.
>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> tknzr.tokenize(s)
['This', 'is', 'my', 'URL', 'https://test.com']
NLTK comes with a variety of state-of-the-art word tokenization primitives. I suggest you use NLTK to turn your string into words before filtering for autocorrection. You can use NLTK's part-of-speech utilities to determine what things should be autocorrected.