I'm trying to configure codespell for a code base that uses Jupyter notebooks a lot.
Codespell throws a lot of false positives on a notebook that contains images embedded using base64 encoding. It appears that the /
and +
are interpreted as word boundaries. With long enough base64, that trips a bunch of rules like ue->use, due
etc.
How can I make codespell ignore those base64 encoded strings altogether?
I've looked into using ignore-regex
, but as far as I can tell, that option operates only on already split words. I'd need ignore-regex to skip entire sections of text.
Minimal reproducible example:
CkNvZGV+ue/+ue+zcGVsbCB0aHJvd3MgYSBsb3Qgb2YgZmFsc2UgcG9zaXRpdmVzIG9uIGEgbm90ZWJvb2sgdGhhdCBjb250YWlucyBpbWF
Running codespell on this gets me:
$ codespell
./test.py:1: ue ==> use, due
./test.py:1: ue ==> use, due
One can use --ignore-regex
to exclude sufficiently long strings that could be base64. This works for me:
codespell --ignore-regex='[A-Za-z0-9+/]{100,}'
To include it in the .codespellrc
or equivalent configfile, make sure to not enclose the regex in single or double quotes (I made this mistake at first):
ignore-regex = [A-Za-z0-9+/]{100,}