A for loop is way to slow. The files have 500k lines. I want to join specifically the 4th column of all files. Appending column after column to the right.
The columns in each file are separated by tab.
col1 col2 col3 col4 col5
a 0 0 -1 0.001
b 1 0 2 0.004
c 2 0 3 0
col1 col2 col3 col4 col5
c 2 0 -9 0.004
s 1 0 5 0.002
d 3 0 3 0.4
col1 col2 col3 col4 col5
r 2 1 0 0.4
j 1 1 1 0.2
r 3 1 2 0.1
I want:
file1 file2 file3
-1 -9 0
2 5 1
3 3 2
I tried first converting to .csv:
for file in $(ls) do awk '{$1=$1}1' OFS=',' ${file} > ${file}.csv done
And then doing this:
eval paste -d, $(printf "<(cut -d, -f4 %s) " *.csv)
But I get this error: paste: /dev/fd/19: Too many open files
I have to join 400 files of 500k lines each.
Your OS doesn't allow you to paste
that many files in one go. You'll have to break them up into smaller batches. Here's how to simply do one at a time.
for file in *.csv; do
if [ -e tempfile ]; then
paste -d, tempfile <(cut -d, -f4 "$file") >tempfile2
mv tempfile2 tempfile
else
cut -d, -f4 "$file" >tempfile
fi
done
mv tempfile result.csv
As an aside, don't use ls
in scripts. You want simply
awk '{$1=$1}1' OFS=',' * > ${file}.csv
... but there is no reason to separately convert each file into CSV. You could fold both operations into one;
rm tempfile
for file in *; do
case $file in tempfile | tempfile2 | result.csv) continue;; esac
if [ -e tempfile ]; then
paste -d, tempfile <(awk '{print $4}' "$file") >tempfile2
mv tempfile2 tempfile
else
awk '{ print $4 }' "$file" >tempfile
fi
done
mv tempfile result.csv