bashawksedxargs

Bash: How to extract parent directory of 3 files at a time


I have file names like this:

/foo/bar/bazz/JMA01023D_E07/JMA01023D_E07_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01023D_E08/JMA01023D_E08_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01023D_E09/JMA01023D_E09_EKDL230054768-1A_22HFKNLT3_L4_1.fq.gz
/foo/bar/bazz/JMA01022D_E06/JMA01022D_E06_EKDL230054767-1A_22HF2WLT3_L7_1.fq.gz
/foo/bar/bazz/JMA01001D_A01/JMA01001D_A01_EKDL230054750-1A_222T7MLT4_L1_1.fq.gz
/foo/bar/bazz/JMA01001D_A02/JMA01001D_A02_EKDL230054750-1A_222T7MLT4_L1_1.fq.gz

3 of these files (full path, sorted alphabetically) form a triplet. I would like to get the parent folder name for 3 files at a time.

So the desired output would be:

JMA01001D_A01 JMA01001D_A02 JMA01022D_E06
JMA01023D_E07 JMA01023D_E08 JMA01023D_E09

Something like this:

find "$@" -iname '*_1.fq.gz' | sort | xargs -I % -n3 sh -c echo % | sed -r 's/ *[^ ]*\/([^ ]+)\/([^ ]+)/\1 /g\'

And ideally, I would like to support spaces, so something with find -print0, sort -z and xargs -0 would be ideal. But I just can't seem to get it to work.

Could someone please help me untangle my brain? It doesn't have to use sed, something with dirname/basename or awk would be fine as well...


Solution

  • You can use to get the folder name and pipe that into xargs -n 3 to get the output with 3 items per line:

    ... | awk -F'/' '{print $(NF-1)}' | xargs -n 3
    

    So if I place your input in /tmp/foo and run the following:

    sort /tmp/foo | awk -F'/' '{print $(NF-1)}' | xargs -n 3
    

    The output is

    JMA01001D_A01 JMA01001D_A02 JMA01022D_E06
    JMA01023D_E07 JMA01023D_E08 JMA01023D_E09