compressiontargoogle-photos

Compressed Google Takeout Files - Divided .tgz file of 50 GB parts


I used Google Takeout to download all my uploaded Google Photos archive in the original quality. It divides the data in 50 GB (the biggest option) of compressed chunks. I chose .tgz file and I downloaded them using rclone in my Raspberry pi (running ubuntu 20.4).

There are more the 40 files that take 2.2 TB space as listed below:

ubuntu@ubuntu:/Takeout/compressed$ ls -lah
total 2.2T
drwxrwxr-x 2 ubuntu ubuntu 4.0K Mar 19 07:24 .
drwxrwxr-x 4 ubuntu ubuntu 4.0K Mar 22 21:05 ..
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:15 takeout-20210218T203743Z-001-049.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:20 takeout-20210218T203743Z-001.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:16 takeout-20210218T203743Z-002-047.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:28 takeout-20210218T203743Z-002.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:14 takeout-20210218T203743Z-003-041.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:28 takeout-20210218T203743Z-003.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:16 takeout-20210218T203743Z-004-051.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:37 takeout-20210218T203743Z-004.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:17 takeout-20210218T203743Z-005-053.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:39 takeout-20210218T203743Z-005.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:12 takeout-20210218T203743Z-006-037.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:47 takeout-20210218T203743Z-006.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:16 takeout-20210218T203743Z-007-045.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 03:56 takeout-20210218T203743Z-007.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:15 takeout-20210218T203743Z-008-039.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 04:04 takeout-20210218T203743Z-008.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 07:12 takeout-20210218T203743Z-009-043.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 04:32 takeout-20210218T203743Z-009.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 04:58 takeout-20210218T203743Z-010.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 05:17 takeout-20210218T203743Z-011.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 05:18 takeout-20210218T203743Z-012.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 05:25 takeout-20210218T203743Z-013.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 05:40 takeout-20210218T203743Z-014.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 06:19 takeout-20210218T203743Z-015.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 06:18 takeout-20210218T203743Z-016.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:39 takeout-20210218T203743Z-017.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-018.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-019.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-020.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-021.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:34 takeout-20210218T203743Z-022.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:38 takeout-20210218T203743Z-023.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-024.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-025.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 08:35 takeout-20210218T203743Z-026.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 09:14 takeout-20210218T203743Z-027.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 09:16 takeout-20210218T203743Z-028.tgz
-rw-rw-r-- 1 ubuntu ubuntu  51G Feb 19 09:15 takeout-20210218T203743Z-029.tgz
-rw-rw-r-- 1 ubuntu ubuntu  50G Feb 19 09:17 takeout-20210218T203743Z-030.tgz
-rw-rw-r-- 1 ubuntu ubuntu  50G Feb 19 12:00 takeout-20210218T203743Z-031.tgz
-rw-rw-r-- 1 ubuntu ubuntu  50G Feb 19 10:29 takeout-20210218T203743Z-032.tgz
-rw-rw-r-- 1 ubuntu ubuntu  50G Feb 19 09:43 takeout-20210218T203743Z-033.tgz
-rw-rw-r-- 1 ubuntu ubuntu  50G Feb 19 11:16 takeout-20210218T203743Z-034.tgz
-rw-rw-r-- 1 ubuntu ubuntu  11G Feb 19 12:10 takeout-20210218T203743Z-035.tgz

The parts are numbered from 1 to 35, but then there are 9 other files with an additional numbers.. I don't know what's the correct order here...

Then I tried extracting this multi-level parts of compressed data using tar.

I tried two approaches so far:

  1. cat ./compressed/takeout-20210218T203743Z-*.tgz | tar xzivf - 2> error.logs 1> output.logs
  2. tar -xzf compressed/* -C ./

Both have extracted only 1.8 TB data without any serious error (just three small files had timestamp that are in the future) --> 1.8T ./Takeout/

Is it possible that compressed files are bigger than their extract? It seems like I lost around 400 GBs while extracting. How can I cross-check the content from multi-part compressed archive and make sure that all data is extracted completely?

I assume there are some big files that are splitted into two different parts and tar cannot detect, therefore skip, them while extracting.

Can you please help me in resolution of this issue?

I now requested another .zip export, again parted in 50 GB chunks. I will try that one however it will take around 10 days to download it..


Solution

  • Yes, it's possible, and in fact certain in the case of photos for the compressed data to be slightly larger than the uncompressed data. Photos are already compressed.

    However, only very slightly larger. Typically 0.03% larger. Certainly not 20% larger.

    The file names you show suggest that there are duplicate files. You may be extracting the same files twice. If I assume that takeout-20210218T203743Z-001-049.tgz has the same contents as takeout-20210218T203743Z-001.tgz, and so on for the other eight such files, then 451 GB would be extracted twice. That correlates approximately to your 1.8 TB being extracted out of 2.2 TB.

    The way to check is to look at the contents of the .tgz files using tar tvfz file.tgz.