I wrote a piece of code that kind of does what I want it to do: it goes through a data vector {x, y}, which is ordered in increasing x, and it bins the y-data in the user-defined x-intervals so I can do statistics on the dataset. But it takes super-long and there seems to be an issue.
Example data is in the form of:
rawdata: excel sheet, {x, y} with ~160000 total points, equidistant in x (up to 1e-5 error) and ordered in x from smallest to largest.
bins: excel sheet {bin index, bin minimum, bin center, bin maximum} with 450 total bins.
My attempt below:
# import bins and data values
rawdata = pd.read_excel(r"data.xlsx")
bins = pd.read_excel(r"bins.xlsx")
# add dataframe to data for plotting/operations
dfrawdata = pd.DataFrame(rawdata) # dfrawdata[0][0] is x value, dfrawdata[0][1] is y value
dfbins = pd.DataFrame(bins)
# initialize empty array (list of empty lists) to store results of loop (each loop resulting in binning of the data)
emptyarray = [[] for _ in range(len(dfbins))]
# define loop that goes through the x-y data vector and puts the y-value into a list at index *bindex* in my empty array if its i) larger than 0 and ii) inside that bin
bindex=0 #bin index
dindex=0 #data index
while bindex<=2:
while dindex<=len(data)-1:
if dfbins.loc[bindex][1] < dfrawdata.loc[dindex][0] < dfbins.loc[bindex][3]:
emptyarray[bindex].append(dfrawdata.loc[dindex][1])
dindex = dindex + 1
dindex = 0
bindex = bindex +1
It works perfectly fine for dindex in range of up to 100. But when I increase the range of dindex above 100 the computing time blows up.
Is there any way to do this more elegantly and faster?
You're on the right track with your binning idea, but the performance bottleneck comes from your nested loop approach. Specifically, you're performing an O(n*m) operation (where n = number of bins, m = number of data points), which is inefficient for large datasets like yours (~160,000 data points, 450 bins).
Here's a easy solution using Pandas:
import pandas as pd
# Load data
df = pd.read_excel("data.xlsx", header=None, names=["x", "y"])
bins = pd.read_excel("bins.xlsx", header=None, names=["bin_index", "bin_min", "bin_center", "bin_max"])
# Build bin edges
edges = bins["bin_min"].tolist() + [bins["bin_max"].iloc[-1]]
# Filter y > 0
df = df[df["y"] > 0]
# Assign bin index
df["bin"] = pd.cut(df["x"], bins=edges, labels=False, include_lowest=True)
# Group y-values by bin
binned_y = df.groupby("bin")["y"].apply(list)
I personally tend to use as few external modules as possible for my projects, but for operations of this kind Pandas
and Numpy
are a must.
Anyway, this solution is fully vectorised and exploits the implementation efficiency of Pandas (C-backed). Among other things, it scales well for larger datasets.