pythonpandasmerge

pandas.merge result is larger than the two separate dataframes


I have to dataframes a, b with datetimeindices and want to merge them, so that all indices are included and where holes occur, nan-values will be.

this worked in the past:

df = pd.merge(train_t1s.iloc[:lols],       sym_train_t1.iloc[:lols],      how='outer', sort=True, left_index=True, right_index=True)

but suddenly It creates a dataframes which ist a lot longer than a & b together (e.g. a's length is 2000 and b's length is 2000 too, but df's length will be 316000000)

a and b each own only one column. this column consists also of datetimevalues in the same format (so maybe this could be the cause)

a:

                                  col_a
2019-07-22 18:18:00 2019-07-23 15:54:00
2019-07-22 18:19:00 2019-07-23 15:54:00
2019-07-22 18:20:00 2019-07-23 15:54:00
2019-07-22 18:21:00 2019-07-23 15:54:00
2019-07-22 18:22:00 2019-07-23 15:54:00
2019-07-22 18:23:00 2019-07-23 15:54:00
2019-07-22 18:24:00 2019-07-23 15:54:00
2019-07-22 18:25:00 2019-07-23 15:54:00
2019-07-22 18:26:00 2019-07-23 15:54:00
2019-07-22 18:27:00 2019-07-23 15:54:00
2019-07-22 18:28:00 2019-07-23 15:54:00
2019-07-22 18:29:00 2019-07-23 15:54:00
2019-07-22 18:30:00 2019-07-23 15:54:00
2019-07-22 18:31:00 2019-07-23 15:54:00
2019-07-22 18:32:00 2019-07-23 15:54:00
2019-07-22 18:33:00 2019-07-23 15:54:00
2019-07-22 18:34:00 2019-07-23 15:54:00
2019-07-22 18:35:00 2019-07-23 15:54:00
2019-07-22 18:36:00 2019-07-23 15:54:00
2019-07-22 18:37:00 2019-07-23 15:54:00
2019-07-22 18:38:00 2019-07-23 15:54:00
2019-07-22 18:39:00 2019-07-23 15:54:00
2019-07-22 18:40:00 2019-07-23 15:54:00
2019-07-22 18:41:00 2019-07-23 15:54:00
2019-07-22 18:42:00 2019-07-23 15:54:00
2019-07-22 18:43:00 2019-07-23 15:54:00
2019-07-22 18:44:00 2019-07-23 15:54:00

b

                                  col_b
2019-07-22 17:40:00 2019-07-23 15:54:00
2019-07-22 17:41:00 2019-07-23 15:54:00
2019-07-22 17:42:00 2019-07-23 15:54:00
2019-07-22 17:43:00 2019-07-23 15:54:00
2019-07-22 17:44:00 2019-07-23 15:54:00
2019-07-22 17:45:00 2019-07-23 15:54:00
2019-07-22 17:46:00 2019-07-23 15:54:00
2019-07-22 17:47:00 2019-07-23 15:54:00
2019-07-22 17:48:00 2019-07-23 15:54:00
2019-07-22 17:49:00 2019-07-23 15:54:00
2019-07-22 17:50:00 2019-07-23 15:54:00
2019-07-22 17:51:00 2019-07-23 15:54:00
2019-07-22 17:52:00 2019-07-23 15:54:00
2019-07-22 17:53:00 2019-07-23 15:54:00
2019-07-22 17:54:00 2019-07-23 15:54:00
2019-07-22 17:55:00 2019-07-23 15:54:00
2019-07-22 17:56:00 2019-07-23 15:54:00
2019-07-22 17:57:00 2019-07-23 15:54:00
2019-07-22 17:58:00 2019-07-23 15:54:00
2019-07-22 17:59:00 2019-07-23 15:54:00
2019-07-22 18:00:00 2019-07-23 15:54:00
2019-07-22 18:01:00 2019-07-23 15:54:00
2019-07-22 18:02:00 2019-07-23 15:54:00
2019-07-22 18:03:00 2019-07-23 15:54:00
2019-07-22 18:04:00 2019-07-23 15:54:00
2019-07-22 18:05:00 2019-07-23 15:54:00
2019-07-22 18:06:00 2019-07-23 15:54:00

Here is a sample of "what I think" my data would look like:



import pandas as pd 
periods_A = 102; periods_B = 57;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T")) 
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)  
ix2=pd.date_range(start="2019-07-20 22:25:00", periods=periods_B, freq="T") 
df2 = pd.DataFrame({"B":["2019-07-20 22:25:00"]*len(ix2)}, index=ix2)  
result = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)  
print(len(df1), len(df2), len(result))

Am I missing something out here?

Why pd.concat doesnt work:

import pandas as pd
periods_A = 20; periods_B =20;
ix1= pd.DatetimeIndex(pd.date_range(start="2019-07-21 09:25:00", periods=periods_A, freq="T"))
df1 = pd.DataFrame({"A":["2019-07-21 09:25:00"]*len(ix1)}, index=ix1)

ix2=pd.date_range(start="2019-07-21 09:23:00", periods=periods_B, freq="T")
df2 = pd.DataFrame({"B":["2019-07-21 22:29:00"]*len(ix2)}, index=ix2)

result1 = pd.merge(df1,df2,how='outer', sort=True, left_index=True, right_index=True)

print(len(df1), len(df2), len(result1))
#print(df1, df2)
print(result1)

result2 = pd.concat([df1, df2])
print(len(df1), len(df2), len(result2))
print(result2.sort_index())
print(result1.equals(result2.sort_index()))

# output:
# pd.merge:
20 20 22
                                       A                    B
2019-07-21 09:23:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:24:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:25:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:26:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:27:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:28:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:29:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:30:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:31:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:32:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:33:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:34:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:35:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:36:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:37:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:38:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:39:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:40:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:41:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:42:00  2019-07-21 09:25:00  2019-07-21 22:29:00
2019-07-21 09:43:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:44:00  2019-07-21 09:25:00                  NaN
20 20 40


#pd.concat
                                       A                    B
2019-07-21 09:23:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:24:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:25:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:25:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:26:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:26:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:27:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:27:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:28:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:28:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:29:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:29:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:30:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:30:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:31:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:31:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:32:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:32:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:33:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:33:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:34:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:34:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:35:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:35:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:36:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:36:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:37:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:37:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:38:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:38:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:39:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:39:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:40:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:40:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:41:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:41:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:42:00                  NaN  2019-07-21 22:29:00
2019-07-21 09:42:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:43:00  2019-07-21 09:25:00                  NaN
2019-07-21 09:44:00  2019-07-21 09:25:00                  NaN
False


Solution

  • import pandas as pd
    
    # read files, convert to datetime and set as index
    df1 = pd.read_csv('a_.csv', parse_dates=[0, 1], index_col=[0])
    df2 = pd.read_csv('b_.csv', parse_dates=[0, 1], index_col=[0])
    
    print(len(df1), len(df2))
    [out]:
    3519 3519
    
    # check the number of unique values in the index
    df1.index.nunique()
    df2.index.nunique()
    
    [out]:
    2956
    2956
    

    Review the duplicate values

    # see only the duplicated index and associate column data
    dup1 = df1[df1.index.duplicated()]
    dup2 = df2[df2.index.duplicated()]
    
    # check if the column values are unique
    dup1.nunique()
    dup2.nunique()
    
    [out]:
    b_col    1
    dtype: int64
    a_col    1
    dtype: int64
    

    Remove duplicates and join

    df1 = df1[~df1.index.duplicated(keep='first')]
    df2 = df2[~df2.index.duplicated(keep='first')]
    
    # join the dataframe
    
    result = df1.join(df2, how='outer', sort=True)
    
    print(len(df1), len(df2), len(result))
    print(result.shape)
    [out]:
    2956 2956 2956
    (2956, 2)