linuxcsvcountwc

count number of occurences in linux directory based on date value


So I have quite a bit of files that I want to to check the file count of based on date, but the thing is dates overlap in the files sometimes. For example a file could look like this

issue_date cert_id serial
2023-05-14 06:00:00 123 abca234
2023-05-14 23:59:00 456 qascmal
2023-05-15 00:00:00 789 acmaal
2023-05-16 12:10:00 1231 acmkla

In the directory I have a bunch of files that may have two dates or more in the name so there is no way of knowing how many certs belong to each date. For example 2023-05-15-2023-05-18.csv.gmz, 2023-05-18-2023-05-19-2023.csv.gmz, etc. Is there a quick and easy way to count these files based on the issue date above, obviously issue_date isn't unique but serial number/cert_id is. so I could use that in conjunction. What I'm trying to do is programmatically check counts based on a given date, but I'm not sure querying these files may be too efficient, and maybe it's best using an API on the source of where these files are coming from. But thought I would ask.

So the solution if I typed in 2023-05-14 as the given date I would get two entries, and for the 15th I would get 1 entry and 16th 1 as well.

IMPORTANT UPDATE : I forget sometimes certs at the end of one file will carry over into the beginning of the next, so I will have to account for duplicates. Right now I have this zgrep -c -F '2023-05-11' *2023-05-11*.gz | awk -F: '{n+=$2} END {print n}' but doesn't account for that fact.


Solution

  • Coming back to this to solve the overlap, I created a bash script that could solve this problem.

    
    # Create an array to store the distinct IDs
    declare -A distinct_ids
    
    # Record the start time
    start_time=$(date +%s)
    
    # Loop through all the compressed archive files in the current directory
    for file in ./*.gz; do
        # Extract the appropriate column (serial number) from the CSV file and append to a temporary file
        zcat "$file" | awk -F',' '{print $4}' >> extracted_ids.txt
    done
    
    # Sort and count the unique IDs using sort and uniq
    sort extracted_ids.txt | uniq > unique_ids.txt
    
    # Count the number of distinct unique IDs
    num_unique_ids=$(wc -l < unique_ids.txt)
    
    # Record the end time
    end_time=$(date +%s)
    
    # Calculate the time taken
    duration=$((end_time - start_time))
    
    # Display the result
    echo "Number of distinct unique IDs: $num_unique_ids"
    echo "Time taken: $duration seconds"
    
    # Clean up temporary files
    rm extracted_ids.txt unique_ids.txt