pythoncharacterword-count

How to calculate percentage of english words in a paragraph using Python


Let's say that I have a paragraph with different languages in it. like:

This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است.

I would like to calculate what percentage (%) of this paragraph includes English words. So would like to ask how to do that in python.


Solution

  • This offline solution uses the pyenchant spellcheck module:

    # -*- coding: utf-8 -*
    import enchant
    dictionary = enchant.Dict("en_US")
    
    paragraph = u"This is paragraph in English. 这是在英国段。Это пункт на английском языке. این بند در زبان انگلیسی است."
    
    words = paragraph.split(" ")
    en_count = 0.0
    for word in words:
      if dictionary.check(word.strip()):
        en_count += 1
    
    percent = en_count/len(words) if len(words) != 0 else 0
    print str(percent) + "% english words"
    

    Output:

    31.25% english words