We have a software package that performs tasks by assigning the batch of files a job number. Batches can have any number of files in them. The files are then stored in a directory structure similar to this:
/asc/array1/.storage/10/10297/10297-Low-res.m4a
...
/asc/array1/.storage/3/3814/3814-preview.jpg
The filename is generated automatically. The directory in .storage
is the thousandths digits of the file number.
There is also a database which associates the job number and the file number with the client in question. Running a SQL query, I can list out the job number, client and the full path to the files. Example:
213 sample-data /asc/array1/.storage/10/10297/10297-Low-res.m4a
...
214 client-abc /asc/array1/.storage/3/3814/3814-preview.jpg
My task is to calculate the total storage being used per client. So, I wrote a quick and dirty bash script to iterate over every single row and du
the file, adding it to an associative array. I then plan to echo this out or produce a CSV file for ingest into PowerBI or some other tool. Is this the best way to handle this? Here is a copy of the script as it stands:
#!/bin/sh
declare -A clientArr
# 1 == Job Num
# 2 == Client
# 3 == Path
while read line; do
client=$(echo "$line" | awk '{ print $2 }')
path=$(echo "$line" | awk '{ print $3 }')
if [ -f "$path" ]; then
size=$(du -s "$path" | awk '{ print $1 }')
clientArr[$client]=$((${clientArr[$client]}+${size}))
fi
done < /tmp/pm_report.txt
for key in "${!clientArr[@]}"; do
echo "$key,${clientArr[$key]}"
done
Assuming:
du
This has no shell loops, calls du
once, and iterates over the pm_report file twice.
file=/tmp/pm_report.txt
awk '{printf "%s\0", $3}' "$file" \
| du -s --files0-from=- 2>/dev/null \
| awk '
NR == FNR {du[$2] = $1; next}
{client_du[$2] += du[$3]}
END {
OFS = "\t"
for (client in client_du) print client, client_du[client]
}
' - "$file"