How to efficiently find small typos in source code files?

Asked 17/3, 2017 at 10:3 Answered 13/4, 2019 at 4:12

Solved python spell-checking lint aspell

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.

The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.

What I was playing with so far is:

for f in **/*.py ; do echo $f ; aspell list < $f |  uniq -c ; done

but it will find anything like: assertEqual, MyTestCase, lifecycle

Bogan answered 17/3, 2017 at 10:3 Comment(0)

This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.

Save this as executable file e.g extractcomments:

#!/usr/bin/env python3
import argparse
import io
import tokenize


if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('filename')
    args = parser.parse_args()

    with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
        for t in tokenize.generate_tokens(sourcefile.readline):
            if t.type == tokenize.COMMENT:
                print(t.string.lstrip("#").strip())

Collect all comments for further processing:

for f in **/*.py ; do  ~/extractcomments $f >> ~/comments.txt ; done

Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:

aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt

Produces something like:

10 availabe
 8 assignement
 7 hardwird

Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt

Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv

Now we want to recursively replace those in our codebase:

#!/bin/bash

root_dir=$(git rev-parse --show-toplevel)

while IFS=";" read -r typo fix ; do
    git grep -l -z -w "${typo}" -- "*.py" "*.html"  | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv

My bash skills are poor so there is certainly space for improvement.

Update: I could find more typos in method names by running this:

grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u

Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:

#!/bin/bash                                                                                                                         

root_dir=$(git rev-parse --show-toplevel)                                                                                           
while IFS=";" read -r typo fix ; do                                                                                                 
    echo ${typo}                                                                                                                    
    find $root_dir  \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"                                                                                                                    
done < $root_dir/known_typos.csv

Bogan answered 31/3, 2017 at 14:30 Comment(0)

If you're using typescript you could use the gulp plugin i created for spellchecking: https://www.npmjs.com/package/gulp-ts-spellcheck

Heartstrings answered 11/8, 2018 at 16:12 Comment(0)

If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:

https://www.npmjs.com/package/eslint-plugin-spellcheck

I found it to be very useful.

Another option is scspell:

https://github.com/myint/scspell

It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

Barnsley answered 13/4, 2019 at 4:12 Comment(0)

Recommended topics

Hot tags