So I have quite a bit of files that I want to to check the file count of based on date, but the thing is dates overlap in the files sometimes. For example a file could look like this
issue_date | cert_id | serial |
---|---|---|
2023-05-14 06:00:00 | 123 | abca234 |
2023-05-14 23:59:00 | 456 | qascmal |
2023-05-15 00:00:00 | 789 | acmaal |
2023-05-16 12:10:00 | 1231 | acmkla |
In the directory I have a bunch of files that may have two dates or more in the name so there is no way of knowing how many certs belong to each date. For example 2023-05-15-2023-05-18.csv.gmz, 2023-05-18-2023-05-19-2023.csv.gmz, etc. Is there a quick and easy way to count these files based on the issue date above, obviously issue_date isn't unique but serial number/cert_id is. so I could use that in conjunction. What I'm trying to do is programmatically check counts based on a given date, but I'm not sure querying these files may be too efficient, and maybe it's best using an API on the source of where these files are coming from. But thought I would ask.
So the solution if I typed in 2023-05-14 as the given date I would get two entries, and for the 15th I would get 1 entry and 16th 1 as well.
IMPORTANT UPDATE : I forget sometimes certs at the end of one file will carry over into the beginning of the next, so I will have to account for duplicates. Right now I have this
zgrep -c -F '2023-05-11' *2023-05-11*.gz | awk -F: '{n+=$2} END {print n}'
but doesn't account for that fact.
Coming back to this to solve the overlap, I created a bash script that could solve this problem.
# Create an array to store the distinct IDs
declare -A distinct_ids
# Record the start time
start_time=$(date +%s)
# Loop through all the compressed archive files in the current directory
for file in ./*.gz; do
# Extract the appropriate column (serial number) from the CSV file and append to a temporary file
zcat "$file" | awk -F',' '{print $4}' >> extracted_ids.txt
done
# Sort and count the unique IDs using sort and uniq
sort extracted_ids.txt | uniq > unique_ids.txt
# Count the number of distinct unique IDs
num_unique_ids=$(wc -l < unique_ids.txt)
# Record the end time
end_time=$(date +%s)
# Calculate the time taken
duration=$((end_time - start_time))
# Display the result
echo "Number of distinct unique IDs: $num_unique_ids"
echo "Time taken: $duration seconds"
# Clean up temporary files
rm extracted_ids.txt unique_ids.txt