bashawk

How to print the unique field names in the first line of many files?


The first line of each file contains field names. There may be duplicates in the field names. I want to print only the unique field names. Here's what I tried:

In a Bash file, files_and_folders.sh, I entered this:

#!/bin/bash
for file in **/*.TXT ; do
   awk 'NR == 1 { for (i=1; i<=NF; i++) if (!seen[$i]) seen[$i] = 1} END { for (idx in seen) printf ("%s\n",idx) }' "${file}"
done

The Bash file ran successfully but the output contains duplicates:

AB_CODE
ACFT_CODE
AC_TYPE
ADD_INFO
AKA
ALT
ALT
ALT
ALT
ALT
ALT
ALT
ALT1_DESC
ALT2_DESC
ALT3_DESC

How to modify the AWK program (in the Bash script) to eliminate duplicates?


Solution

  • You must not run a loop in bash and run a new awk process for each file otherwise associative array seen will be initialized for every awk and it won't know existing entries set by previous invocations of awk.

    You should do it in a single awk like this:

    awk 'FNR == 1 {
       for (i=1; i<=NF; ++i) {
          uniques[$i]
       }
    }
    END {
       for (i in uniques)
          print i
    }' **/*.TXT
    
    AC_TYPE
    AKA
    ALT
    ADD_INFO
    AB_CODE
    ALT1_DESC
    ALT2_DESC
    ALT3_DESC
    ACFT_CODE