regexsed

complex string substitution


The aim of this question is to replace the /PageLabels code (source ) in a pdf file for another. We have to do this because there is a bug in the program which print the pdf (we can't change the program). By hand takes a lot of time (we made 50 pdf files per hour).

However to be pragmatic, the example can be summarized as follows.

Old /PageLabels code: Located in a original file called a.pdf.

We use the grep function to get the incorrect /PageLabels code:

grep -aPo '/PageLabels\K[^"]*>>]>>' a.pdf

<</Nums[0<</S/r/St 1>>6<</S/r/St 7>>10<</S/r/St 11>>12<</S/r/St 13>>14<</P(1-)/S/D/St 1>>20<</P(2-)/S/D/St 1>>28<</P(3-)/S/D/St 1>>80<</P(4-)/S/D/St 1>>116<</P(A-)/S/D/St 1>>132<</P(B-)/S/D/St 1>>134<</P(C-)/S/D/St 1>>138<</P(D-)/S/D/St 1>>148<</P(E-)/S/D/St 1>>168<</P(F-)/S/D/St 1>>176<</P(G-)/S/D/St 1>>182<</P(Glossary-)/S/D/St 1>>194<</P(Comments-)/S/D/St 1>>]>>

New /PageLabels code We want to substitute the "Old /PageLabels code" using the following. This is the result of another script which reevaluate the pdf and get the correct /PageLabel code of the pdf (tested and verified manually).

<</Nums[0<</S/r/St 1>>12<</P(1-)/S/D/St 1>>17<</P(2-)/S/D/St 1>>32<</P(3-)/S/D/St 1>>98<</P(4-)/S/D/St 1>>130<</P(A-)/S/D/St 1>>153<</P(B-)/S/D/St 1>>154<</P(C-)/S/D/St 1>>158<</P(D-)/S/D/St 1>>187<</P(E-)/S/D/St 1>>230<</P(F-)/S/D/St 1>>242<</P(G-)/S/D/St 1>>247<</P(Glossary-)/S/D/St 1>>259<</P(Comments-)/S/D/St 1>>]>>

It will be saved in another file called b.pdf

We don't know how to write it using the sed function.

Any ideas would be greatly appreciated.


Solution

  • You should be using replace instead of sed or regex:

    #! /bin/bash
    old=$(grep -aPo '/PageLabels\K[^"]*>>]>>' a.pdf) ## Get Old /PageLabels code
    new=$(/tmp/get_correct_code.sh )  ## Get New /PageLabels code
    cat a.pdf |replace "$old" "$new" > new_a.pdf
    

    From the man page:

    DESCRIPTION
           The replace utility program changes strings in place in files or on the standard input.
    
           Invoke replace in one of the following ways:
    
              shell> replace from to [from to] ... -- file_name [file_name] ...
              shell> replace from to [from to] ... < file_name
    

    UPDATE If you prefer to use sed, you could try it this way:

    #! /bin/bash
    old=$(grep -aPo '/PageLabels\K[^"]*>>]>>' a.pdf) ## Get Old /PageLabels code
    new=$(/tmp/get_correct_code.sh )  ## Get New /PageLabels code
    
    # To replace $old with $new, first you'd have to escape those characters like [, ], -
    eold=$(echo $old | sed 's@\([][-]\)@\\\1@g')
    
    # Then do the replace using sed
    sed "s@$eold@$new@g" a.pdf > b.pdf