I have a large file and a list of my specific strings. The output should not contain my specific lines and one more after each of them. 2 consecutive matches are impossible due to structure of file that i want to filter. For example,
Specific lines:
'ggg'
'sss'
Input:
'ggg'
'123'
'rrr'
'321'
'sss'
'666'
Output:
'rrr'
'321'
Simple grep -v -A 1
does not work
Assumptions:
awk
script)General approach:
Sample input file:
$ cat input
'ggg' # match/ignore and
'123' # ignore
'rrr'
'321'
'sss' # match/ignore and
'666' # ignore
'aaa' 'ggg' 'xxx'
'12345'
'xxx' # match/ignore and
'xxx' # match/ignore and
98352 # ignore
'xyz'
hello world
Sample set of lines to match on (and ignore):
$ cat lines
'ggg' # will not match on the line: 'aaa' 'ggg' 'xxx'
'sss'
rrr # will not match on 'rrr' because of the missing quotes
'xxx' # will match on consecutive lines and skip the next non-matching line
NOTE: comments do not exist in files
One awk
idea:
awk '
#### 1st file:
FNR==NR { a[$0]; next } # save line as index in array a[]
#### 2nd file:
$0 in a { skip=1; next } # if line is an index in array then set the "skip" flag and ignore this line
skip { skip=0; next } # if flag is set then clear flag and ignore this line
1 # otherwise print current line
' lines input
######
# or as a one-liner
awk 'FNR==NR {a[$0];next} $0 in a {skip=1;next} skip {skip=0;next} 1' lines input
This generates:
'rrr'
'321'
'aaa' 'ggg' 'xxx'
'12345'
'xyz'
hello world
NOTE: if assumptions are wrong and/or this does not work for OP's actual files then we'll need the question updated with a more representative set of data
OP has added a comment stating consecutive line matches cannot occur. This allows us to simplify the code a bit:
awk '
FNR==NR { a[$0]; next } # 1st file: save line as index in array a[]
$0 in a { getline; next } # 2nd file: if line is an index in array then get next line (and ignore) then skip to next input line otherwise ...
1 # print current line
' lines input
######
# or as a one-liner
awk 'FNR==NR {a[$0];next} $0 in a {getline;next} 1' lines input
If we remove one of the 'xxx'
lines from the input
file this will generate:
'rrr'
'321'
'aaa' 'ggg' 'xxx'
'12345'
'xyz'
hello world