awkunix-text-processing

Multiple input files - loop through one and check if string contained in second file - output paragraph


I try to filter a text file based on a second file. The first file contains paragraphs like:

$ cat paragraphs.txt
# ::id 1
# ::snt what is an example of a 2-step garage album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (g / garage)
            :mod (s / step-01
                  :quant 2)))

# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

The second file contains a list of strings like this:

$ cat list.txt
# ::snt what is an example of a abwe album
# ::snt what is an example of a acid techno album

I now want to filter the first file and only keep the paragraphs, if the snt is contained in the second file. For the short example above, the output file would look like this (paragraphs separated by empty line):

$ cat filtered.txt
# ::id 2
# ::snt what is an example of a abwe album
(e / exemplify-01
      :arg0 (a / amr-unknown)
      :arg1 (a2 / album
            :mod (p / person
                  :name (n / name
                        :op1 "abwe"))))

So, I tried to loop through the second file and used awk to print out the paragraphs, but apparently the check does not work (all paragraphs are printed) and in the resulting file the paragraphs are contained multiple times. Also, the loop does not terminate... I tried this command:

while read line; do awk -v x=$line -v RS= '/x/' paragraphs.txt ; done < list.txt >> filtered.txt

I also tried this plain awk script:

awk -v RS='\n\n' -v FS='\n' -v ORS='\n\n' 'NR==FNR{a[$1];next}{for(i in a)if(index($0,i)) print}' list.txt paragraphs.txt > filtered.txt

But, it only takes the first line of the list.txt file.

Therefore, I need your help... :-)


UPDATE 1: from comments made by OP:


UPDATE 2: after trying the solutions on the files as stated in first update (4th-run timing):

fastest command:

awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
time: 8,71s user 0,35s system 99% cpu 9,114 total

second fastest command:

awk 'NR == FNR { a[$0]; next }/^$/ { if (snt in a) print rec; rec = snt = ""; next }/^# ::snt / { snt = $0 }{ rec = rec $0 "\n" }' list.txt paragraphs.txt
time: 14,17s user 0,35s system 99% cpu 14,648 total

third fastest command:

awk 'FNR==NR { if (NF) a[$0]; next }/^$/    { if (keep_para) print para; keep_para=0; para=sep=""}$0 in a { keep_para=1 }{ para=para $0 sep; sep=ORS }END{ if (keep_para) print para }' list.txt paragraphs.txt
time: 15,33s user 0,35s system 99% cpu 15,745 total

Solution

  • Using any awk:

    $ awk -F'\n' 'NR==FNR{list[$0]; next} $2 in list' list.txt RS= ORS='\n\n' paragraphs.txt
    # ::id 2
    # ::snt what is an example of a abwe album
    (e / exemplify-01
          :arg0 (a / amr-unknown)
          :arg1 (a2 / album
                :mod (p / person
                      :name (n / name
                            :op1 "abwe"))))
    

    I'm setting RS and ORS for the 2nd file only as that's the one we want to read/print using paragraph mode but I'm setting FS for all input files to additionally make reading of the first file a bit more efficient as awk then won't waste time splitting each line into fields.

    The main problem with your awk script is you were setting RS and ORS for all input files instead of only setting them for the second one. Also note that RS='\n\n' requires a version of awk that supports multi-char RS while RS='' will work in any awk, see https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line.

    Regarding the while read line; script in your question - see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for the issues with doing that. Also, in regards to '/x/' see Example of testing the contents of a shell variable as a regexp: at How do I use shell variables in an awk script?.