pythonloopsvisualizationdata-analysis

how to use for loop to build an array in python?


I wrote a piece of code that kind of does what I want it to do: it goes through a data vector {x, y}, which is ordered in increasing x, and it bins the y-data in the user-defined x-intervals so I can do statistics on the dataset. But it takes super-long and there seems to be an issue.

Example data is in the form of:

rawdata: excel sheet, {x, y} with ~160000 total points, equidistant in x (up to 1e-5 error) and ordered in x from smallest to largest.

bins: excel sheet {bin index, bin minimum, bin center, bin maximum} with 450 total bins.

My attempt below:

# import bins and data values    
rawdata = pd.read_excel(r"data.xlsx") 
bins = pd.read_excel(r"bins.xlsx")    

# add dataframe to data for plotting/operations
dfrawdata = pd.DataFrame(rawdata)   # dfrawdata[0][0] is x value, dfrawdata[0][1] is y value
dfbins = pd.DataFrame(bins)

# initialize empty array (list of empty lists) to store results of loop (each loop resulting in binning of the data)
emptyarray = [[] for _ in range(len(dfbins))]

# define loop that goes through the x-y data vector and puts the y-value into a list at index *bindex* in my empty array if its i) larger than 0 and ii) inside that bin
      
bindex=0    #bin index
dindex=0    #data index
while bindex<=2:
    while dindex<=len(data)-1:
        if dfbins.loc[bindex][1] < dfrawdata.loc[dindex][0] < dfbins.loc[bindex][3]:
            emptyarray[bindex].append(dfrawdata.loc[dindex][1])  
        dindex = dindex + 1
    dindex = 0
    bindex = bindex +1  

It works perfectly fine for dindex in range of up to 100. But when I increase the range of dindex above 100 the computing time blows up.

Is there any way to do this more elegantly and faster?


Solution

  • You're on the right track with your binning idea, but the performance bottleneck comes from your nested loop approach. Specifically, you're performing an O(n*m) operation (where n = number of bins, m = number of data points), which is inefficient for large datasets like yours (~160,000 data points, 450 bins).

    Here's a easy solution using Pandas:

    import pandas as pd
    
    # Load data
    df = pd.read_excel("data.xlsx", header=None, names=["x", "y"])
    bins = pd.read_excel("bins.xlsx", header=None, names=["bin_index", "bin_min", "bin_center", "bin_max"])
    
    # Build bin edges
    edges = bins["bin_min"].tolist() + [bins["bin_max"].iloc[-1]]
    
    # Filter y > 0
    df = df[df["y"] > 0]
    
    # Assign bin index
    df["bin"] = pd.cut(df["x"], bins=edges, labels=False, include_lowest=True)
    
    # Group y-values by bin
    binned_y = df.groupby("bin")["y"].apply(list)
    

    I personally tend to use as few external modules as possible for my projects, but for operations of this kind Pandas and Numpy are a must.

    Anyway, this solution is fully vectorised and exploits the implementation efficiency of Pandas (C-backed). Among other things, it scales well for larger datasets.