I'm reading lines from a file and trying to match them with regex, but it's failing despite the regex matcher looking right. When comparing the line to what it should be as a string declaration, python is saying that they are not equal.
It's looking like this is being caused by a non utf-8 encoding on my file but not sure how to fix this as I'm not sure exactly which encoding is being used. This is a simplified version of the code I'm using to debug:
fp = open('tree.txt', 'r')
lines = [line.strip() for line in fp.readlines()]
fp.close()
for line in lines:
print(f'|{line}| vs |{line}|')
print(line == "[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT")
print(line.encode('utf-8'))
print("[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT".encode('utf-8'))
My output once scanning the line in the file I'm expecting looks like this
|[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT| vs |[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT|
False
b'\x00[\x00I\x00N\x00F\x00O\x00]\x00 \x00i\x00o\x00.\x00j\x00i\x00t\x00p\x00a\x00c\x00k\x00:\x00m\x00o\x00d\x00u\x00l\x00e\x002\x00:\x00j\x00a\x00r\x00:\x002\x00.\x000\x00-\x00S\x00N\x00A\x00P\x00S\x00H\x00O\x00T\x00'
b'[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT'
I'm generating tree.txt by doing mvn dependency:tree > tree.txt on Windows 11 from VSCode terminal, if that's any clue to what kind of encoding is being used lol. Is there a way to convert line
into a string with this utf-8 encoding b'[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT'
, agnostic of its current encoding? I did try opening the file with fp = open('tree.txt', 'r', encoding='utf-8')
but that did not work.
The pattern of nulls in the output says that this file is encoded in big-endian UTF-16. Open it with encoding='utf-16be'
.
You might also want to figure out why Maven is producing output in UTF-16.