bashsedgrepcomm

How can I append values to each line in file1 after using part of that line to index into file2 and lookup the value?


I basically have the following 2 files:

$ cat file1.txt
AB,12 34 56,2.4,256,,
CD,23 45 67,10.8,257,,
EF,34 56 78,0.6,258,,
GH,45 67 89,58.3,259,,
...
$ cat file2.txt
AB,12 34 56,2.4,36
XY,56 99 11,3.6,15
ZQ,12 36 89,5.9,0
EF,34 56 78,0.6,99
GH,45 67 89,58.3,79
...

And for every line in file1.txt, I'd like to use the first 3 fields as an index in file2.txt, grab the corresponding last field, and place it into file1.txt like so:

cat newfile.txt
AB,12 34 56,2.4,256,36,
CD,23 45 67,10.8,257,,
EF,34 56 78,0.6,258,99,
GH,45 67 89,58.3,259,79,

There is no guarantee that each line in file1 will appear in file2, and vice versa, and for such cases empty fields shown above in newfile.txt are fine.

In my first attempt, I was reading in each line from file1 in a while read loop, then grepping for the appropriate line in file2, and it worked but it was just way too slow. file1 and file2 have hundreds of thousands of lines each.

Is there any way I can use sed to use the first 3 fields of each line from file1 as an index into file2, lookup the value I need, and append it to that line in file1? And do so without reading file1 line by line?

Any help is appreciated.


Solution

  • Using join and sed (for some pre and post processing), and assuming the | character doesn't appear in either file

    join -a1 -t'|' \
        <(sort file1.txt | sed 's/,/|/3') \
        <(sort file2.txt | sed 's/,/|/3') |
        sed 's/,|//; s/|/,/; s/[^,]$/&,/' > newfile.txt
    

    (Tested with the input given in the question)

    It could be done in plain bash using associative arrays, but I doubt if it would be efficient. For example:

    #!/bin/bash
    
    declare -A tail
    
    while IFS= read -r line; do
        if [[ $line =~ ([^,]*,){3} ]]; then
            tail[${BASH_REMATCH[0]}]=${line#"${BASH_REMATCH[0]}"}
        fi
    done < file2.txt
    
    while IFS= read -r line; do
        if [[ $line =~ ([^,]*,){3} ]] && [[ -n ${tail[${BASH_REMATCH[0]}]} ]]; then
            printf '%s%s\n' "${line%?}" "${tail[${BASH_REMATCH[0]}]},"
        else
            printf '%s\n' "$line"
        fi
    done < file1.txt > newfile.txt