pythonutf-8

Why is a line read from a file not == to its hardcoded string despite being printed as the same thing?


I'm reading lines from a file and trying to match them with regex, but it's failing despite the regex matcher looking right. When comparing the line to what it should be as a string declaration, python is saying that they are not equal.

It's looking like this is being caused by a non utf-8 encoding on my file but not sure how to fix this as I'm not sure exactly which encoding is being used. This is a simplified version of the code I'm using to debug:

fp = open('tree.txt', 'r')
lines = [line.strip() for line in fp.readlines()]
fp.close()

for line in lines:
   print(f'|{line}| vs |{line}|')
   print(line == "[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT")
   print(line.encode('utf-8'))
   print("[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT".encode('utf-8'))

My output once scanning the line in the file I'm expecting looks like this

|[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT| vs |[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT|
False
b'\x00[\x00I\x00N\x00F\x00O\x00]\x00 \x00i\x00o\x00.\x00j\x00i\x00t\x00p\x00a\x00c\x00k\x00:\x00m\x00o\x00d\x00u\x00l\x00e\x002\x00:\x00j\x00a\x00r\x00:\x002\x00.\x000\x00-\x00S\x00N\x00A\x00P\x00S\x00H\x00O\x00T\x00'
b'[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT'

I'm generating tree.txt by doing mvn dependency:tree > tree.txt on Windows 11 from VSCode terminal, if that's any clue to what kind of encoding is being used lol. Is there a way to convert line into a string with this utf-8 encoding b'[INFO] io.jitpack:module2:jar:2.0-SNAPSHOT', agnostic of its current encoding? I did try opening the file with fp = open('tree.txt', 'r', encoding='utf-8') but that did not work.


Solution

  • The pattern of nulls in the output says that this file is encoded in big-endian UTF-16. Open it with encoding='utf-16be'.

    You might also want to figure out why Maven is producing output in UTF-16.