pythonpython-3.xpandasdataframe

Python Pandas Dataframe challenge: how do I avoid Iterrows() for this scenario?


Context

I've heard it often said "you should avoid iterating through a Dataframe", or "using iterrows is bad/slow/etc." or "there is nothing you can do with iterrows that you can't do with apply/applymap/map". I'm trying to shed the habit of using iterrows but have found a scenario I have not as yet been able to work a way around using apply/applymap/map. I'm hoping some df guru can help me out.

Overview of logic/scenario

I have a {set} of unique values that correspond to a column in my dataframe. The dataframe is transactional - I need to compare element[0] in my set to <JobNumber> of my dataset, where the values match: append the value of <RunNumber> to a list. On a complete run of the dataset (appending a list of all possible values) - pass that list as a value in dictionary (with element[0] being the key). Repeat for all {set} elements.

Main Challenges

  1. I am not applying a function per element - I am comparing a {set} element to a dataframe column where it matches an arbitrary number of times: append the element in a different column to a list (not df)
  2. After one complete pass of the df, take that compiled list and pass to a dictionary (again, not a df)

All examples I've seen: apply/applymap/map work on every element in a series or dataframe - not (for example) just 2 columns out of a potential 5. Or comparing values on column 1 and 4 then appending column 10 to a structure outside of the dataframe.

Current Solution (looking to improve on)

# Convert to set - make unique
job_set = set(job_list)

results_dict = dict()
# loop through set, append all runs associated per job number:
for job in job_set:
    temp_list = list()
    for idx, val in dict_val.iterrows():
        if val['JobNumber'] == job:
            temp_list.append(val['RunNumber'])

    # append to new dict - {job number : [list of runs]}
    results_dict[job] = temp_list

Sample Data

JobNumber,RunNumber,Brief Created Date
10000,9,8/03/17
1667,2166,5/05/18
1667,2165,5/05/18
1667,2153,8/04/18
1710,3602,24/06/18
1710,3600,22/06/18
1710,3594,18/06/19
1710,3589,11/06/19
1710,3492,5/03/18
1710,3456,27/01/18
2265,1436,3/06/18
2265,1429,6/05/19
2265,1418,8/04/19
3708,459,9/04/19
3708,109,4/06/19
3708,402,26/03/19
3938,401,19/03/19
3938,400,12/03/19
3938,399,5/03/19
3938,391,6/01/19
3938,0,5/03/19
4529,2117,24/06/19
4529,1736,29/03/19
4529,143,30/03/19
4529,1158,9/03/19
4669,374,17/05/19
4813,30,5/01/19
5651,62,1/06/18
5651,61,6/04/19
5651,60,16/03/19

Please let me know if you need more details. My challenge is simply to attempt to complete this using the 'preferred' methods of apply or applymap or map. My main intention is implementing best practices around optimised execution times.

Like I mentioned: all use cases for these methods seem to center on the dataframe/series element-level - I need a way to compare values across a whole dataset, then bring the result into a new data structure, rinse and repeat.

Thanks in advance


Solution

  • Does this work?

    results_dict = dict()
    for job in job_set:
        filt = (dict_val['JobNumber'] == job)  #this creates a filter/mask with only the jobs you want
        new_df = dict_val[filt] #this applies gives a subdataframe only with desired columns
        temp_list = new_df['RunNumber'].tolist()
        results_dict[job] = temp_list
    

    I couldn't run the code as you didn't provide any example dataframe, so in can have some typo. But I hope you get the logic.