how to pandas fast nested for loop for "non numeric" columns? because this for loop is way to slow:
for i in range(len(df1[column_A]):
for j in range(len(df2[column_A]):
if df1[column_A][i] == df2[column_A][j]:
df1[column_B][i] = df2[column_B][j]
else:
pass
so any other way to do it by pandas itself or other libraries?
UPDATE:
and main goal is:
input:
df1:
name rpm power
0 John 1500 high+
1 Mary 1400 high-
2 Sally 300 low-
3 Doe 700 medium-
4 July 1000 medium+
df2:
name age
0 Peter 77
1 Sally 44
2 Micky 22
3 Sally 34
4 July 50
5 Bob 20
required output is:
but i want it df2:
name age rpm power
0 Peter 77 0 NA
1 Sally 44 300 low-
2 Micky 22 0 NA
3 Sally 34 300 low-
4 July 50 1000 medium+
5 Bob 20 0 NA
i also add question in official pandas github: https://github.com/pandas-dev/pandas/issues/59824
The nested loop you provided result in O(n^2) complexity, making it slow for larger datasets . Looping over the same range for both i and j, which is unnecessary.
Instead you can use pd.merge
import pandas as pd
# Merge file1 and file2 on column A
merged_df = pd.merge(file1, file2, on='column_A') # assuming it is a pandas dataframe
# Update file1_column_B with matched values from file2
file1_column_B = merged_df['column_B_y']
pd.merge()
function merges two DataFrames ( file1 and file2 ) based on a common column ( column_A ).Also by default, pd.merge()
performs an inner merge, which means only rows with matching values in column_A are included in the resulting DataFrame.
Time complexity of pd.merge
is O(n + m) in best case scenario (concluding it) where n is the number of rows in the left DataFrame (file1) and m is the number of rows in the right DataFrame (file2). However, in the worst-case scenario (e.g., when there are many duplicate values in the merge column), the time complexity can be O(n × m).
You can also use numPy argsort
+ searchsorted
import numpy as np
sorted = np.argsort(file2['column_A'])
match = np.searchsorted(file2['column_A'][sorted], file1['column_A'])
file1_column_B = file2['column_B'][sorted[match]]
Time Complexity of above sort: argsort: O(n log n) searchsorted: O(m log n) Total: O(n log n + m log n)