bashmergesizecsplit

Merge and split, using csplit in bash


I'm merging three three files (ls -l):

-rw-rw-r-- 1 kacper kacper 1839510 sie 13 14:27 A.jpg
-rw-rw-r-- 1 kacper kacper 2014809 sie 13 14:27 B.jpg
-rw-rw-r-- 1 kacper kacper 1277047 sie 13 14:27 C.pdf

into one file (merged) in bash using:

cat A.jpg >> merged 
echo $SEPARATOR >> merged 
cat B.jpg >> merged 
echo $SEPARATOR >> merged 
cat C.pdf >> merged

where:

SEPARATOR=PO56WLH82SN1ZS5QH5EU9FOZVLBRLHAGHO3D5KOUSPMS6KYSFAYN2DBL

Next I'm splitting the merged file into three parts using:

csplit --suppress-matched merged --prefix="PART_" '/'$SEPARATOR'/' {*}

this produces PART_00, PART_01, PART_02 (ls -l):

-rw-rw-r--  1 kacper kacper 1839398 sie 13 18:41 PART_00
-rw-rw-r--  1 kacper kacper 2014507 sie 13 18:41 PART_01
-rw-rw-r--  1 kacper kacper 1277047 sie 13 18:41 PART_02

PART_00 and PART_01 are JPG files and can be properly displayed. PART_02 is a PDF file and it can be opened and viewed. So, at first glance this looked to me like success.

The problem is that the size of PART_00 (1839398 bytes) is slightly smaller then A.jpg (1839510 bytes). The same goes for the other files (PART_01, B.jpg and PART_02, C.pdf). After checking the files byte by byte using

cmp

the pairs of files are exactly the same up to the point when one of them ends.

Anyone know why this is the case? Advice would be greatly appreciated.


Solution

  • The last lines in the files are not terminated by a newline character. As such, when you add your separator into the merged file you are adding it to the end of the last line in the files. This last line is then matched by csplit and the entire line dropped. Hence the last few characters are being dropped.

    The --supress-matched option for csplit will supress the entire line matching where the pattern is matched.