pythonpandasfiltergroup-bycount

How to efficiently filter specific values from a dataset and group those filtered values based on their counts for a plot


I have a large dataset (5m+ records) of NYC crimes. I want to select 5 crimes from the column 'OFNS_DESC' and group the crimes according to the number of occurrences within a month so that I can plot a time series, in a more efficient way.

sample dataset

I currently have the below. The issue is that it takes quite a while to compute because, for each value, it goes through the entire dataset and it does the same for the grouping. I want to make this process more efficient

# convert ARREST_DATE format to YYYY/MM
arrest_ym = df["ARREST_DATE"].dt.to_period('M')

# filter out crime types
dngr_drug = df[df["OFNS_DESC"] == "DANGEROUS DRUGS"]
aslt3 = df[df["OFNS_DESC"] == "ASSAULT 3 & RELATED OFFENSES"]
pt_lrcy = df[df["OFNS_DESC"] == "PETIT LARCENY"]
flny_aslt = df[df["OFNS_DESC"] == "FELONY ASSAULT"]
dngr_wpns = df[df["OFNS_DESC"] == "DANGEROUS WEAPONS"]

# group and count 
dngr_drug_gc = dngr_drug.groupby(arrest_ym)["OFNS_DESC"].count()
aslt3_gc = aslt3.groupby(arrest_ym)["OFNS_DESC"].count()
pt_lrcy_gc = pt_lrcy.groupby(arrest_ym)["OFNS_DESC"].count()
flny_aslt_gc = flny_aslt.groupby(arrest_ym)["OFNS_DESC"].count()
dngr_wpns_gc = dngr_wpns.groupby(arrest_ym)["OFNS_DESC"].count()

Below is the code for my plot. I believe it would be affected based on the changes made to the above.

# multi-line plot
x_labels = dngr_drug_gc.index.astype(str)

plt.figure(figsize = (20, 10))
plt.plot(dngr_drug_gc.index.astype(str), dngr_drug_gc.values, 'b-', label = "Dangerous Drugs")
plt.plot(aslt3_gc.index.astype(str), aslt3_gc.values, 'g-', label = "Assault 3 & Related Offenses")
plt.plot(pt_lrcy_gc.index.astype(str), pt_lrcy_gc.values, 'r-', label = "Petit Larceny")
plt.plot(flny_aslt_gc.index.astype(str), flny_aslt_gc.values, 'c-', label = "Felony Assault")
plt.plot(dngr_wpns_gc.index.astype(str), dngr_wpns_gc.values, 'm-', label = "Dangerous Weapons")

plt.xticks(x_labels[::6], rotation=45)

plt.xlabel("Month")
plt.ylabel("Crime Count")
plt.title("Monthly Crime Frequency")
plt.legend(loc = "best")

My results are fine but I just want to optimise the code. One way I saw was to do the code below but I don't know how to proceed from there.

crime_select = ["DANGEROUS DRUGS", "ASSAULT 3 & RELATED OFFENSES", "PETIT LARCENY", "FELONY ASSAULT", "DANGEROUS WEAPONS"]
filtered_crime = df[df["OFNS_DESC"].isin(crime_select)]

Please bear in mind that I am absolutely new to coding so any explanations would be very helpful. Thanks.

result plot


Solution

  • You are in a good path if you are already thinking about optimization of your code. I must however point out, that writing good quality code, comes with the cost of spending a lot of time learning your tools, in this case the pandas library. This video is how I was introduced to the topic, and personally I believe it helped me a lot.

    If I understand correctly you want to: filter specific crime types, group them by month and add up occurrences, and finally plot monthly crime evolution for each type.

    Trying out your code three times back to back I got 4.4346, 3.6758 and 3.9400 s execution time -> mean 4.0168 s (not counting time taken to load dataset, used time.perf_counter()). The data used where taken from NYPD database (please include your data source when posting questions).

    crime_counts is what we call, a pivot table, and it handles what you did separately for each crime type, while also saving them in an analysis-friendly pd.DataFrame format.

    t1 = time.perf_counter()
    # changing string based date to datetime object
    df["ARREST_DATE"] = pd.to_datetime(df["ARREST_DATE"], format='%m/%d/%Y')
    # create pd.Series object of data on a monthly frequency [length = df length]
    df["ARREST_MONTH"] = df["ARREST_DATE"].dt.to_period('M') # no one's stopping you from adding new columns
    
    
    
    # Filter the specific crime types
    crime_select = ["DANGEROUS DRUGS", "ASSAULT 3 & RELATED OFFENSES", "PETIT LARCENY", "FELONY ASSAULT", "DANGEROUS WEAPONS"]
    filtered = df.loc[df["OFNS_DESC"].isin(crime_select), ["ARREST_MONTH", "OFNS_DESC"]]
    
    crime_counts = (filtered
                    .groupby(["ARREST_MONTH", "OFNS_DESC"])
                    .size()
                    .unstack(fill_value=0))  # Converts grouped data into a DataFrame
    
    # Plot results
    crime_counts.plot(figsize=(12,6), title="Monthly Crime Evolution")
    plt.xlabel("Arrest Month")
    plt.ylabel("Number of Arrests")
    plt.legend(title="Crime Type")
    plt.grid(True)
    
    t2 = time.perf_counter()
    print(f"Time taken to complete operations: {t2 - t1:0.4f} s")
    
    plt.show()
    

    Above code completed three runs in 2.5432, 2.6067 and 2.4947 s -> mean 2.5482 s. Adding up to a ~36.56% speed increase.

    Note: Did you include the dataset loading time into your execution time measurements? I found that by keeping df loaded and only running the calculations part, yields about 3.35s for your code, and 1.85s for mine.