pythonbashgrepxargsunix-head

List of unique headers recursively on files matching pattern


I want the unique headers for a bunch of csv files whose names contain ABC or XYZ.

Within a single directory, I can sort of get what I need with:

head -n ` *.csv > first.txt
cat -A first.txt | tr ',' '\n' | sort | uniq

Of course, this isn't recursive and it includes all csv files, not just the ones I want.

If I do the following, I get the recursive search, but also a bunch of junk:

find . -type f -name "ABC*.csv" -o -name "XYZ*.csv" | xargs head -n 1 | tr ',' '\n' | sort | uniq

I'm on Windows 10 with MinGW64. I suppose I could use Python, but I feel so close to having it!


Solution

  • When head is given multiple files (xargs does that) it prints their names as well.

    Using find's -exec action (you should force the precedence of -name 'ABC*.csv' -o -name 'XYZ*.csv for it to work) you can obtain the desired result. uniq is also not required here, sort can do that on its own. And as a sidenote, you better enclose literal strings in single quotes.

    find . -type f \( -name 'ABC*.csv' -o -name 'XYZ*.csv' \) -exec head -n 1 {} \; | tr ',' '\n' | sort -u
    

    If your files have DOS line endings above command will not work though. In that case you should delete carriage returns using tr or sed:

    find . -type f \( -name 'ABC*.csv' -o -name 'XYZ*.csv' \) -exec head -n 1 {} \; | tr -d '\r' | tr ',' '\n' | sort -u
    # or
    find . -type f \( -name 'ABC*.csv' -o -name 'XYZ*.csv' \) -exec head -n 1 {} \; | sed 's/\r//; s/,/\n/g' | sort -u