regexshellsed

Extracting strings from a file with sed and regular expressions


I would like to ask about extracting specific strings from a file using sed and regular expressions.

Below is the example of the input text file (testfile.txt):

# This file contains a short description of the columns in the
# meta-analysis summary file, named '/some/output/directory/result.txt'

# (Skipping some comment lines...)

# Input for this meta-analysis was stored in the files:
# --> Input File 1 : /some/input/directory/cohort1/dataset1_chrAll.regenie.txt
# --> Input File 2 : /some/input/directory/cohort2/subdir1/chrAll-out.txt
# --> Input File 3 : /some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
# --> Input File 4 : /some/input/directory/cohort3/resfile.txt
# --> Input File 5 : /some/input/directory/cohort4/regenie_res_chrAll.txt

From this file, I would like to extract the list of the input file names, so the result should looks like:

/some/input/directory/cohort1/dataset1_chrAll.regenie.txt
/some/input/directory/cohort2/subdir1/chrAll-out.txt
/some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
/some/input/directory/cohort3/resfile.txt
/some/input/directory/cohort4/regenie_res_chrAll.txt

Below are what I tried:

Try 1

This is the initial command that I used.

cat testfile.txt | sed -e 's/\/some\/input\/directory\/([A-z0-9\/\.\-]*)\.txt/$1/g'

Result:

sed: -e expression #1, char 55: Invalid range end

Try 2

After some search, I tried escaping parentheses using backslashes.

cat testfile.txt | sed -e 's/\/some\/input\/directory\/\([A-z0-9\/\.\-]*\).txt/$1/g'

Result:

sed: -e expression #1, char 56: Invalid range end

So it did not solve the problem.

Try 3

I also tried escaping brackets.

cat testfile.txt | sed -e 's/\/some\/input\/directory\/\(\[A-z0-9\/\.\-\]\*\)\.txt/$1/g'

Result:

# This file contains a short description of the columns in the
# meta-analysis summary file, named '/some/output/directory/result.txt'

# (Skipping some comment lines...)

# Input for this meta-analysis was stored in the files:
# --> Input File 1 : /some/input/directory/cohort1/dataset1_chrAll.regenie.txt
# --> Input File 2 : /some/input/directory/cohort2/subdir1/chrAll-out.txt
# --> Input File 3 : /some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
# --> Input File 4 : /some/input/directory/cohort3/resfile.txt
# --> Input File 5 : /some/input/directory/cohort4/regenie_res_chrAll.txt

This did not raise an error, but this was not what I wanted.

Try 4

Lastly, I tried adding -r option while not escaping parentheses or brackets.

cat testfile.txt | sed -re 's/\/some\/input\/directory\/([A-z0-9\/\.\-]*)\.txt/$1/g'

Result:

sed: -e expression #1, char 55: Invalid range end

It showed the same error message with the first try.

I would like to ask what the problems in my command lines are and whether there is any possible solution for this.

Thank you.


Solution

  • What I would do:

    $ grep -oP -- '--> .* \K(?:/[\w.-]+)+' file
    /some/input/directory/cohort1/dataset1_chrAll.regenie.txt
    /some/input/directory/cohort2/subdir1/chrAll-out.txt
    /some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
    /some/input/directory/cohort3/resfile.txt
    /some/input/directory/cohort4/regenie_res_chrAll.txt
    

    The regular expression matches as follows:

    Node Explanation
    --> '--> '
    .* any character except \n (0 or more times (matching the most amount possible))
    space
    \K resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: look arounds and Support of \K in regex
    (?: group, but do not capture (1 or more times (matching the most amount possible)):
    / /
    [\w.-]+ any character of: word characters (a-z, A-Z, 0-9, _), '.', '-' (1 or more times (matching the most amount possible))
    )+ end of grouping