import pandas as pd
df_1 = pd.DataFrame({'col1': ['a', 'a', 'a']})
df_2 = pd.DataFrame({'col1': ['b', 'b', 'b']})
df_2.index = [4,5,6]
df_1['col2'] = df_2.col1
I expect a simple copy in the above example, but 'col2' in df_1 is all NAs. I find it strange. What is the rational for this choice? Similar example works differently in R.
The rationale is to keep the data aligned. The same way columns are aligned when you perform an operation.
This enables to perform complex operations without having to worry about the order of the rows/columns like you would with numpy:
A = pd.DataFrame({'blue': [1, 2], 'green': [3, 4]}, index=['circle', 'square'])
B = pd.DataFrame({'green': [10, 20], 'blue': [30, 40]}, index=['square', 'circle'])
A+B
blue green
circle 41 23
square 32 14
Or to handle missing data:
df = pd.DataFrame({'col1': ['a1', 'a2', 'a3']})
df['col2'] = pd.Series(['b1', 'b3'], index=[0, 2])
print(df)
col1 col2
0 a1 b1
1 a2 NaN
2 a3 b3
Or to perform broadcasting:
idx1 = pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
names=['letter', 'number'])
df = pd.DataFrame({'col1': [1, 2, 3, 4]}, index=idx1)
idx2 = pd.Index([1, 2], name='number')
df['col2'] = df['col1'] * pd.Series([1, 10], index=idx2)
print(df)
col1 col2
letter number
a 1 1 1
2 2 20
b 1 3 3
2 4 40
If you want to avoid this you can convert to array:
import pandas as pd
df_1 = pd.DataFrame({'col1': ['a1', 'a2', 'a3']})
df_2 = pd.DataFrame({'col1': ['b1', 'b2', 'b3']})
df_2.index = [2,3,1]
df_1['col2'] = df_2['col1']
df_1['col3'] = df_2['col1'].values # or df_2['col1'].to_numpy()
Output:
col1 col2 col3
0 a1 NaN b1
1 a2 b3 b2
2 a3 b1 b3