bashterminalgrepcut

How to match patterns from one file against a specific column in another file using grep?


file1, which contains a single string per line. I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line, but some lines may have leading spaces before the first column.

I want to use only grep and/or cut to perform the match and output matching lines from file2 to newFile.txt, ensuring whole word matching (-w).

I've tried

grep -wF -f file1 file2 > newFile.txt 

but due to the file size terminal runs infinitely.

I've also tried

grep -wF -f <(cut -d ' ' -f 2 file2) | grep -wF -f - file2 > newFile.txt 

This only works for some lines in file2 because some lines have multiple spaces before the 2 strings, although those strings are only separated by single space.

File1:

 aaa
 bbb
 ccc

File2:

 a aaa (should match) 
     b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
 c cc (should not match) 

Question: How can I efficiently extract and match whole words in the second column of file2.txt, while handling inconsistent leading spaces? I prefer using grep and/or cut, but I'm open to small modifications.


Solution

  • Asking for help to do this efficiently with grep and cut is like asking for help constructing a garden fence with a kitchen fork and a paperclip. They're simply not the right tools for the job and so they cannot be used efficiently for this, nor can they be used robustly (or portably) without adding yet more tools to the mix to help them out. An awk-only solution, by contrast, would be trivial, efficient, and portable, e.g. the following will work using any POSIX awk:

    $ awk 'NR == FNR{ tgts[$1]; next } $2 in tgts' file1 file2
     a aaa (should match)
         b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
    

    Original answer before I noticed the OP said "I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line" and thought they wanted to match all "words" in file2:

    $ cat tst.awk
    NR == FNR {
        tgts[$1]
        next
    }
    {
        split($0, words, /[^[:alnum:]_]+/)
        for ( i in words ) {
            if ( words[i] in tgts ) {
                print
                next
            }
        }
    }
    

    $ awk -f tst.awk file1 file2
     a aaa (should match)
         b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
    

    If you have more characters than just alpha-numerics and _ that you consider part of a "word" then just change [^[:alnum:]_] to include them, e.g. if a "word" can contain . and - then change it to [^[:alnum:]_.-]