Following this question, I'm attempting to load a 40 GB TAR file with bz2-compressed json files into PostgreSQL in an effecient manner.
As per the answer mentioned above, I'm trying to separate the process and use external tools to create the following flow.
I'm currently getting an error when arriving at bzcat, this is what I have to build line that executes the above:
pipeline = [filename[1:3] + " && ", # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
'"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"', # Call to tar, outputs to stdin
" | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#, # Forward its output to bzcat
' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""
When executing the code for TAR, the TAR file is outputted to the CMD prompt, hinting me that all is well. However, the bzcat line brings an error:
x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
Running -tvv gives me:
huff+mtf data integrity (CRC) error in data
I've tried to extract the same archive with 7-zip (GUI): this still works. Any help on how to troubleshoot this would be greatly appreciated. I'm running Windows 8.1 with GNUWin32.
bsdtar.exe is translating newline bytes in the file data into the DOS CRLF sequence resulting in a corrupted bzip2 output stream.
GNU tar worked when using relative paths but it does not handle absolute paths in Windows.
Your best bet is to use 7-zip instead:
7z.exe x -so -ir!*.json.bz2 archive.tar | bzcat | ...