linuxunixawksua

Deduplicating a Text File and keeping the last occurence in one output file and moving others to another output file


I have a file with dups records (dups are in columns). I want to keep only the last occurrence of the dup records in a file and move the all other dups in another file.

File : input

foo j
bar bn
bar b
bar bn
bar bn
bar bn
kkk hh
fjk ff
foo jj
xxx tt
kkk hh

I have used the following awk statement to keep the last occurrence --

awk '{line=$0; x[$1]=line;} END{ for (key in x) print x[key];}' input > output

File : output

foo jj
xxx tt
fjk ff
kkk hh
bar bn

How can I move the repeating records to another file (leaving the last occurrence)?

Moving foo j in one file let say d_output and keeping foo jj in output file


Solution

  • Another option you could try, keeping the order by reading the input file twice:

    awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=dups file file
    

    output:

    bar bn
    fjk ff
    foo jj
    xxx tt
    kkk hh
    

    Duplicates:

    $ cat dups
    foo j
    bar bn
    bar b
    bar bn
    bar bn
    kkk hh
    

    @Sudo_O @WilliamPursell @user2018441. Sudo_O thank you for the performance test. I tried to reproduce them on my system, but it does not have tac available, so I tested with Kent's version and mine, but I could not reproduce those differences on my system.

    Update: I tested with Sudo_O's version using cat instead of tac. Although on a system with tac there was a difference of 0,2 seconds between tac and cat when outputting to /dev/null (see at the bottom of this post)

    I got:

    Sudo_O
    $ time cat <(seq 1 1000000) | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'
    
    real    0m1.491s
    user    0m1.307s
    sys     0m0.415s
    
    kent
    $ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <(seq 1 1000000) > /dev/null
    
    real    0m1.238s
    user    0m1.421s
    sys     0m0.038s
    
    scrutinizer
    $ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <(seq 1 1000000) <(seq 1 1000000) > /dev/null
    
    real    0m1.422s
    user    0m1.778s
    sys     0m0.078s
    

    --

    when using a file instead of the seq I got:

    Sudo_O
    $ time cat <infile | awk 'a[$1]++{print $0 > "/dev/null";next}{print $0 > "/dev/null"}'
    
    real    0m1.519s
    user    0m1.148s
    sys     0m0.372s
    
    
    kent
    $ time awk '$1 in a{print a[$1]>"/dev/null"}{a[$1]=$0}END{for(x in a)print a[x]}' <infile > /dev/null
    
    real    0m1.267s
    user    0m1.227s
    sys     0m0.037s
    
    scrutinizer
    $ time awk 'NR==FNR{A[$1]=NR; next} A[$1]!=FNR{print>f; next}1' f=/dev/null <infile <infile > /dev/null
    
    real    0m0.737s
    user    0m0.707s
    sys     0m0.025s
    

    Probably due to caching effects, which would be present also for larger files.. Creating the infile took:

    $ time seq 1 1000000 > infile
    
    real    0m0.224s
    user    0m0.213s
    sys     0m0.010s
    

    Tested on a different system:

    $ time cat <(seq 1 1000000) > /dev/null
    
    real    0m0.764s
    user    0m0.719s
    sys     0m0.031s
    $ time tac <(seq 1 1000000) > /dev/null
    
    real    0m1.011s
    user    0m0.820s
    sys     0m0.082s