windowsubuntuencodingunzipwinrar

Wrong filenames after unzipping on ubuntu


Problem

I have a zip-file that I would like to unzip on Ubuntu with the correct filenames (they contain æ,ø,å).

What I have tried:

1. Unrar in Windows 10 - WORKS!

Everything works as expected and filenames are correct.

2. Unzip in Ubuntu

unzip file.zip

The characters æ,ø and å are missing from the filenames, where 'æ' has been replaces with 'C'.

I attempt to detect the encoding of the zip-file, but it doesn't seem to tell me anything.

file file.zip

3. Unzip with encoding in Ubuntu

I attempt to unpack the file using various encodings that are often used for æ,ø,å-containing texts.

unzip -O UTF-8 file.zip
unzip -O ISO-8859-1 file.zip
unzip -O windows-1257 file.zip

None work...

4. Unzip using 7zip in Ubuntu

It is suggested that 7zip may fix the problem, but no..

7z x file.zip

5. Unzip using 7zip and danish language setting in Ubuntu

It is suggested that I change the ubuntu language settings and then try again.

saveLang=$LANG
export LANG=da_DK
7z x file.zip
export LANG=$saveLang

This also does not work.

6. Unzip using Python3 in Ubuntu - WORKS!

The unzip works correctly if I use Python3 for the purpose, but there must be an easier way?

import zipfile

with zipfile.ZipFile('file.zip', "r") as z:
  z.extractall("/home/xxxx/")

7. Next step

I am considering finding a list of "ALL" encodings, and then just extracting the filenames and going through them manually. Something along the line of this...

while read p; do
  echo "$p"
  unzip -j -O $p file.zip
done <encodings.txt

Conclusion

Windows and Python3 seems to have some MAGIC under the hood that I cannot replicate. Do you guys have any suggestions to what this "MAGIC" is?


Solution

  • The key piece of information you provided was that unrar on windows was able to create the filenames correctly. So unless unrar is doing some encoding detection under the hood, that meant that there is a good chance that the encoding used in the zip files matches the default codepage used on your Windows setup.

    Using chcp on Windows you see that your codepage is

    Active code page: 850
    

    It's then a simple matter of telling unzip that the encoding used in the zip file is CP850

    unzip -O CP850 file.zip