I have a couple of large files (~1Gb) of such structure:
fooA iug9wa
fooA lauie
fooA nwgoieb
fooB wilgb
fooB rqgebepu
fooB ifbqeiu
...
fooN ibfiygb
fooN yvsiy
fooN aeviu
I would like to replace in shell each fooX (which contains letters, numbers "." and "_"), (I have all listed in foo.list
) to sequential numbers 1 to N.
I've used:
nfoos=$(wc -l < foo.list)
for i in $(seq 1 $nfoos)
do
currentfoo=$(sed "${i}q;d" foo.list)
sed -i "s/"${currentfoo}"/$i/g" file1
sed -i "s/"${currentfoo}"/$i/g" file2
sed -i "s/"${currentfoo}"/$i/g" filen
done
However, with large files it's been taking forever.
Since each consecutive fooX always appears in the files than foo(X-1) I though to make sed
only search the part of fileX after the last match of fooX, so that with each foo there is less space to search.
I've been trying to use labels and some multiline approaches, but the syntax keeps beating me here.
Does anyone know how to make it work? (Doesn't necessarily have to use sed
, but would be great if it worked in basic shell in Bash.)
Appreciate any help. And if you do, please explain each function/option/variable used so that I can figure out where I had been messing up.
You can use awk
.
The first part of the next awk
command will fill the array a, the second part replaces the first word.
awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list file1
When this is what you like, you can loop over your files
for f in file1 file2 filen; do
awk 'NR==FNR { a[$1]=NR; next} $1 in a{$1=a[$1]; print}' foo.list "${f}" > "${f}.tmp" &&
mv "${f}.tmp" "${f}"
done
The &&
makes sure the new file will only replace the original file when awk
was OK.