pythonspell-checkinglintaspell

How to efficiently find small typos in source code files?


I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.

The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.

What I was playing with so far is:

for f in **/*.py ; do echo $f ; aspell list < $f |  uniq -c ; done

but it will find anything like: assertEqual, MyTestCase, lifecycle


Solution

  • This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.

    Save this as executable file e.g extractcomments:

    #!/usr/bin/env python3
    import argparse
    import io
    import tokenize
    
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(add_help=False)
        parser.add_argument('filename')
        args = parser.parse_args()
    
        with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
            for t in tokenize.generate_tokens(sourcefile.readline):
                if t.type == tokenize.COMMENT:
                    print(t.string.lstrip("#").strip())
    

    Collect all comments for further processing:

    for f in **/*.py ; do  ~/extractcomments $f >> ~/comments.txt ; done
    

    Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:

    aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt
    

    Produces something like:

    10 availabe
     8 assignement
     7 hardwird
    

    Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt

    Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv

    Now we want to recursively replace those in our codebase:

    #!/bin/bash
    
    root_dir=$(git rev-parse --show-toplevel)
    
    while IFS=";" read -r typo fix ; do
        git grep -l -z -w "${typo}" -- "*.py" "*.html"  | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
    done < $root_dir/known_typos.csv
    

    My bash skills are poor so there is certainly space for improvement.

    Update: I could find more typos in method names by running this:

    grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u
    

    Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:

    #!/bin/bash                                                                                                                         
    
    root_dir=$(git rev-parse --show-toplevel)                                                                                           
    while IFS=";" read -r typo fix ; do                                                                                                 
        echo ${typo}                                                                                                                    
        find $root_dir  \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"                                                                                                                    
    done < $root_dir/known_typos.csv