linuxbashstoragediskdiskspace

Find subfolders using a lot of disk space


I'd love to write a bash script that helps me find opportunities for reducing disk usage.

The script would accept 2 arguments: a parent folder (which for me will usually be /apps/), and a threshold (such as "200M").

My current approach is not ideal (doesn't use a threshold and shows a lot of redundant output).

Currently I run cd /apps/ && du -aBM 2>/dev/null | sort -nr | head -n 15 and see output like:

8975M   .
1448M   ./delta
1387M   ./alpha
1350M   ./alpha/releases
1144M   ./bravo/releases
1144M   ./bravo
1137M   ./charlie
1117M   ./delta/releases
902M    ./alpha/releases/202210091311
871M    ./charlie/releases
796M    ./echo
794M    ./echo/releases
791M    ./alpha/releases/202210091311/node_modules
703M    ./scoreboard
684M    ./scoreboard/node_modules

I'd like the output to omit lines like:

8975M   .
1448M   ./delta
1387M   ./alpha
1350M   ./alpha/releases
1144M   ./bravo
1137M   ./charlie
902M    ./alpha/releases/202210091311
796M    ./echo
703M    ./scoreboard

because those were a waste of my attention since the output above had also included subfolders of those folders that were above the threshold that I care about (200M).

These are the more interesting lines:

1144M   ./bravo/releases
1117M   ./delta/releases
871M    ./charlie/releases
794M    ./echo/releases
791M    ./alpha/releases/202210091311/node_modules
684M    ./scoreboard/node_modules

I don't think the du -aBM 2>/dev/null | sort -nr approach is the right starting place to achieve my actual goal, though.

Because in reality, maybe any of those folders (in my most recent example) aren't even as nested / deep as they could be (the lowest level subfolders that also satisfy the threshold of 200 MB).

For example, maybe /echo/subfolder1 and /echo/subfolder2 are each 300M.

I have a cloud server with limited disk space and don't want to pay for more.


Solution

  • Directory hierarchy is a tree. Its nodes are directories and files. For files, value of node is its size. For directories, value of node is sum of sizes of all its children. I believe the problem given is to select nodes with value larger than some threshold whose children are all smaller than that threshold.

    It is already known how to select the 15 largest of these.


    Conveniently, du -a automatically provides a postordering of a depth-first search of the tree.

    So a solution to the problem is to walk du's output.

    This works because the postordering means an unmarked path that exceeds the threshold cannot have smaller children.

    Assuming that paths don't start with whitespace nor contain newlines:

    du -aBM /apps/ 2>/dev/null |
    awk -F/ -v OFS=/ -v min=200 '
        $1+0 > min {
            raw = $0
            sub(/^[0-9]+M[ \t]+/,"") # strip size
            sub(/\/+$/,"") # strip trailing slashes
    
            if ($0 in seen) next
    
            print raw
    
            # mark ancestors as seen
            s = $1
            for (f=1; f<NF; s = s OFS $(++f))
                seen[s]
        }
    ' |
    sort -rn | head -15
    

    The du -S answer usefully solves a slightly-different related problem, where value of directory nodes is sum of sizes of children that are files, and only directory nodes are selected.

    You can also turn it into a function

    Create a bash alias like:

    function du_top() {
        # https://stackoverflow.com/a/78716652/470749
        local path=$1
        local min_size=$2
        local num_results=$3
        du -aBM "$path" 2>/dev/null | \
        awk -F/ -v OFS=/ -v min="$min_size" '
            $1+0 > min {
                raw = $0
                sub(/^[0-9]+M[ \t]+/,"") # strip size
                sub(/\/+$/,"") # strip trailing slashes
    
                if ($0 in seen) next
    
                print raw
    
                # mark ancestors as seen
                s = $1
                for (f=1; f<NF; s = s OFS $(++f))
                    seen[s]
            }
        ' | \
        sort -rn | head -n "$num_results"
    }
    

    After sourcing it, then run it like du_top /apps/ 200 15.