I am trying to unzip a huge zip file split into several parts. I am in a Macbook laptop and I am using:
>> unzip '*.zip' -d <unzip_path>
All works well, but during unzipping process, some if the files report:
illegal byte sequence
And they are not extracted.
I am very aware that this is due to some weird characters like letters (á)
included in the name of some of the files inside some of the .zip file parts.
I would like to know how to solve this, and still be able to extract the problematic files.
Looking into the different zip file parts and somehow replace the file names is not an option since there are so many files with illegal characters.
Without seeing the zip file (is the file publically available?) I'm guessing at the issue, but In your case I suspect the problem is as follows
To unzip the files & get the charset correct you need to get the encoding changed from whatever was used in the zip file to utf8.
Some newish versions of unzip
have a -I
option that will do this for you. Below is the help text from unzip
on my Ubuntu setup, Note the presence of the line with -I CHARSET
$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).
-p extract files to pipe, no messages -l list files (short format)
-f freshen existing files, create none -t test compressed archive data
-u update files, create if necessary -z display archive comment only
-v list verbosely/show version info -T timestamp archive to latest
-x exclude files that follow (in xlist) -d extract files into exdir
modifiers:
-n never overwrite existing files -q quiet mode (-qq => quieter)
-o overwrite files WITHOUT prompting -a auto-convert any text files
-j junk paths (do not make directories) -aa treat ALL files as text
-U use escapes for all non-ASCII Unicode -UU ignore any Unicode fields
-C match filenames case-insensitively -L make (some) names lowercase
-X restore UID/GID info -V retain VMS version numbers
-K keep setuid/setgid/tacky permissions -M pipe through "more" pager
-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
-I CHARSET specify a character encoding for UNIX and other archives
See "unzip -hh" or unzip.txt for more help. Examples:
unzip data1 -x joe => extract all files except joe from zipfile data1.zip
unzip -p foo | more => send contents of foo.zip via pipe into program more
unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer
If you do have this option available you just run it like this (replacing ISO-8859-7
with whatever encoding is used in the zip file)
$ unzip -I ISO-8859-7 some-file.zip
If you unzip is too old, an alternative is 7z
-- it has a commandline option -scs
that allows you to specify the charset used in the filenames.