I have a large dataset (5m+ records) of NYC crimes. I want to select 5 crimes from the column 'OFNS_DESC' and group the crimes according to the number of occurrences within a month so that I can plot a time series, in a more efficient way.
I currently have the below. The issue is that it takes quite a while to compute because, for each value, it goes through the entire dataset and it does the same for the grouping. I want to make this process more efficient
# convert ARREST_DATE format to YYYY/MM
arrest_ym = df["ARREST_DATE"].dt.to_period('M')
# filter out crime types
dngr_drug = df[df["OFNS_DESC"] == "DANGEROUS DRUGS"]
aslt3 = df[df["OFNS_DESC"] == "ASSAULT 3 & RELATED OFFENSES"]
pt_lrcy = df[df["OFNS_DESC"] == "PETIT LARCENY"]
flny_aslt = df[df["OFNS_DESC"] == "FELONY ASSAULT"]
dngr_wpns = df[df["OFNS_DESC"] == "DANGEROUS WEAPONS"]
# group and count
dngr_drug_gc = dngr_drug.groupby(arrest_ym)["OFNS_DESC"].count()
aslt3_gc = aslt3.groupby(arrest_ym)["OFNS_DESC"].count()
pt_lrcy_gc = pt_lrcy.groupby(arrest_ym)["OFNS_DESC"].count()
flny_aslt_gc = flny_aslt.groupby(arrest_ym)["OFNS_DESC"].count()
dngr_wpns_gc = dngr_wpns.groupby(arrest_ym)["OFNS_DESC"].count()
Below is the code for my plot. I believe it would be affected based on the changes made to the above.
# multi-line plot
x_labels = dngr_drug_gc.index.astype(str)
plt.figure(figsize = (20, 10))
plt.plot(dngr_drug_gc.index.astype(str), dngr_drug_gc.values, 'b-', label = "Dangerous Drugs")
plt.plot(aslt3_gc.index.astype(str), aslt3_gc.values, 'g-', label = "Assault 3 & Related Offenses")
plt.plot(pt_lrcy_gc.index.astype(str), pt_lrcy_gc.values, 'r-', label = "Petit Larceny")
plt.plot(flny_aslt_gc.index.astype(str), flny_aslt_gc.values, 'c-', label = "Felony Assault")
plt.plot(dngr_wpns_gc.index.astype(str), dngr_wpns_gc.values, 'm-', label = "Dangerous Weapons")
plt.xticks(x_labels[::6], rotation=45)
plt.xlabel("Month")
plt.ylabel("Crime Count")
plt.title("Monthly Crime Frequency")
plt.legend(loc = "best")
My results are fine but I just want to optimise the code. One way I saw was to do the code below but I don't know how to proceed from there.
crime_select = ["DANGEROUS DRUGS", "ASSAULT 3 & RELATED OFFENSES", "PETIT LARCENY", "FELONY ASSAULT", "DANGEROUS WEAPONS"]
filtered_crime = df[df["OFNS_DESC"].isin(crime_select)]
Please bear in mind that I am absolutely new to coding so any explanations would be very helpful. Thanks.
You are in a good path if you are already thinking about optimization of your code. I must however point out, that writing good quality code, comes with the cost of spending a lot of time learning your tools, in this case the pandas
library. This video is how I was introduced to the topic, and personally I believe it helped me a lot.
If I understand correctly you want to: filter specific crime types, group them by month and add up occurrences, and finally plot monthly crime evolution for each type.
Trying out your code three times back to back I got 4.4346, 3.6758 and 3.9400 s execution time -> mean 4.0168 s (not counting time taken to load dataset, used time.perf_counter()
). The data used where taken from NYPD database (please include your data source when posting questions).
crime_counts
is what we call, a pivot table, and it handles what you did separately for each crime type, while also saving them in an analysis-friendly pd.DataFrame
format.
t1 = time.perf_counter()
# changing string based date to datetime object
df["ARREST_DATE"] = pd.to_datetime(df["ARREST_DATE"], format='%m/%d/%Y')
# create pd.Series object of data on a monthly frequency [length = df length]
df["ARREST_MONTH"] = df["ARREST_DATE"].dt.to_period('M') # no one's stopping you from adding new columns
# Filter the specific crime types
crime_select = ["DANGEROUS DRUGS", "ASSAULT 3 & RELATED OFFENSES", "PETIT LARCENY", "FELONY ASSAULT", "DANGEROUS WEAPONS"]
filtered = df.loc[df["OFNS_DESC"].isin(crime_select), ["ARREST_MONTH", "OFNS_DESC"]]
crime_counts = (filtered
.groupby(["ARREST_MONTH", "OFNS_DESC"])
.size()
.unstack(fill_value=0)) # Converts grouped data into a DataFrame
# Plot results
crime_counts.plot(figsize=(12,6), title="Monthly Crime Evolution")
plt.xlabel("Arrest Month")
plt.ylabel("Number of Arrests")
plt.legend(title="Crime Type")
plt.grid(True)
t2 = time.perf_counter()
print(f"Time taken to complete operations: {t2 - t1:0.4f} s")
plt.show()
Above code completed three runs in 2.5432, 2.6067 and 2.4947 s -> mean 2.5482 s. Adding up to a ~36.56% speed increase.
Note: Did you include the dataset loading time into your execution time measurements? I found that by keeping df
loaded and only running the calculations part, yields about 3.35s for your code, and 1.85s for mine.