I would like to ask about extracting specific strings from a file using sed and regular expressions.
Below is the example of the input text file (testfile.txt):
# This file contains a short description of the columns in the
# meta-analysis summary file, named '/some/output/directory/result.txt'
# (Skipping some comment lines...)
# Input for this meta-analysis was stored in the files:
# --> Input File 1 : /some/input/directory/cohort1/dataset1_chrAll.regenie.txt
# --> Input File 2 : /some/input/directory/cohort2/subdir1/chrAll-out.txt
# --> Input File 3 : /some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
# --> Input File 4 : /some/input/directory/cohort3/resfile.txt
# --> Input File 5 : /some/input/directory/cohort4/regenie_res_chrAll.txt
From this file, I would like to extract the list of the input file names, so the result should looks like:
/some/input/directory/cohort1/dataset1_chrAll.regenie.txt
/some/input/directory/cohort2/subdir1/chrAll-out.txt
/some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
/some/input/directory/cohort3/resfile.txt
/some/input/directory/cohort4/regenie_res_chrAll.txt
Below are what I tried:
This is the initial command that I used.
cat testfile.txt | sed -e 's/\/some\/input\/directory\/([A-z0-9\/\.\-]*)\.txt/$1/g'
Result:
sed: -e expression #1, char 55: Invalid range end
After some search, I tried escaping parentheses using backslashes.
cat testfile.txt | sed -e 's/\/some\/input\/directory\/\([A-z0-9\/\.\-]*\).txt/$1/g'
Result:
sed: -e expression #1, char 56: Invalid range end
So it did not solve the problem.
I also tried escaping brackets.
cat testfile.txt | sed -e 's/\/some\/input\/directory\/\(\[A-z0-9\/\.\-\]\*\)\.txt/$1/g'
Result:
# This file contains a short description of the columns in the
# meta-analysis summary file, named '/some/output/directory/result.txt'
# (Skipping some comment lines...)
# Input for this meta-analysis was stored in the files:
# --> Input File 1 : /some/input/directory/cohort1/dataset1_chrAll.regenie.txt
# --> Input File 2 : /some/input/directory/cohort2/subdir1/chrAll-out.txt
# --> Input File 3 : /some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
# --> Input File 4 : /some/input/directory/cohort3/resfile.txt
# --> Input File 5 : /some/input/directory/cohort4/regenie_res_chrAll.txt
This did not raise an error, but this was not what I wanted.
Lastly, I tried adding -r option while not escaping parentheses or brackets.
cat testfile.txt | sed -re 's/\/some\/input\/directory\/([A-z0-9\/\.\-]*)\.txt/$1/g'
Result:
sed: -e expression #1, char 55: Invalid range end
It showed the same error message with the first try.
I would like to ask what the problems in my command lines are and whether there is any possible solution for this.
Thank you.
What I would do:
$ grep -oP -- '--> .* \K(?:/[\w.-]+)+' file
/some/input/directory/cohort1/dataset1_chrAll.regenie.txt
/some/input/directory/cohort2/subdir1/chrAll-out.txt
/some/input/directory/cohort2/subdir2/chrAll-out_ver2.txt
/some/input/directory/cohort3/resfile.txt
/some/input/directory/cohort4/regenie_res_chrAll.txt
Node | Explanation |
---|---|
--> |
'--> ' |
.* |
any character except \n (0 or more times (matching the most amount possible)) |
space | |
\K |
resets the start of the match (what is K ept) as a shorter alternative to using a look-behind assertion: look arounds and Support of \K in regex |
(?: |
group, but do not capture (1 or more times (matching the most amount possible)): |
/ |
/ |
[\w.-]+ |
any character of: word characters (a-z, A-Z, 0-9, _), '.', '-' (1 or more times (matching the most amount possible)) |
)+ |
end of grouping |