bashawkpastecut

Efficient way to join the nth column of all files in a directory?


A for loop is way to slow. The files have 500k lines. I want to join specifically the 4th column of all files. Appending column after column to the right.

The columns in each file are separated by tab.

col1 col2 col3 col4 col5
a 0 0 -1 0.001
b 1 0  2 0.004
c 2 0 3 0

col1 col2 col3 col4 col5
c 2 0 -9 0.004
s 1 0  5 0.002
d 3 0 3 0.4

col1 col2 col3 col4 col5
r 2 1 0 0.4
j 1 1 1 0.2
r 3 1 2 0.1

I want:

file1 file2 file3
-1 -9 0
2 5 1
3 3 2

I tried first converting to .csv:

for file in $(ls) do awk '{$1=$1}1' OFS=',' ${file} > ${file}.csv done

And then doing this:

eval paste -d, $(printf "<(cut -d, -f4 %s) " *.csv)

But I get this error: paste: /dev/fd/19: Too many open files

I have to join 400 files of 500k lines each.


Solution

  • Your OS doesn't allow you to paste that many files in one go. You'll have to break them up into smaller batches. Here's how to simply do one at a time.

    for file in *.csv; do
        if [ -e tempfile ]; then
            paste -d, tempfile <(cut -d, -f4 "$file") >tempfile2
            mv tempfile2 tempfile
        else
            cut -d, -f4 "$file" >tempfile
        fi
    done
    mv tempfile result.csv
    

    As an aside, don't use ls in scripts. You want simply

    awk '{$1=$1}1' OFS=',' * > ${file}.csv
    

    ... but there is no reason to separately convert each file into CSV. You could fold both operations into one;

    rm tempfile
    for file in *; do
        case $file in tempfile | tempfile2 | result.csv) continue;; esac
        if [ -e tempfile ]; then
            paste -d, tempfile <(awk '{print $4}' "$file") >tempfile2
            mv tempfile2 tempfile
        else
            awk '{ print $4 }' "$file" >tempfile
        fi
    done
    mv tempfile result.csv