Let's say that I have a paragraph with different languages in it. like:
This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است.
I would like to calculate what percentage (%) of this paragraph includes English words. So would like to ask how to do that in python.
This offline solution uses the pyenchant spellcheck module:
# -*- coding: utf-8 -*
import enchant
dictionary = enchant.Dict("en_US")
paragraph = u"This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است."
words = paragraph.split(" ")
en_count = 0.0
for word in words:
if dictionary.check(word.strip()):
en_count += 1
percent = en_count/len(words) if len(words) != 0 else 0
print str(percent) + "% english words"
Output:
31.25% english words