stringsed

linux shell: How to extract substrings separated by semicolon from a file using sed regex


I have a file with below content:

10|subpath ; xxx ; xxx ; xxx ; xxx ; substring ; xxx ; ......
12|subpath ; xxx ; xxx ; xxx ; xxx ; substring ; xxx ; ......
18|subpath ; xxx ; xxx ; xxx ; xxx ;  ; xxx ; ......

I want to use sed to extract:

  1. the content before 1st semicolon
  2. the 'substring' between 5th and 6th semicolons (the 'substring' can be any string and can also be nothing, and there could be space between 'substring' and its pre-/post- semicolons).

After extraction, i want to get things like

10|subpath;substring_fs
12|subpath;substring_fs
18|subpath;fs

if 'substring' has nothing/empty, then put 'fs' there. if 'substring' non-emtpy, then append '_fs' to 'substring'


Solution

  • sed -E '
        # strip leading/trailing whitepace for simplicity
        s/[[:space:]]*;[[:space:]]*/;/g
    
        # extract relevant fields
        s/^([^;]*)(;[^;]*){4}(;[^;]*).*/\1\3/
    
        # append _ if required
        s/[^;]$/&_/
    
        # append fs
        s/$/fs/
    ' file