I'm writing some code, and having trouble with some unexpected behaviour when using .loc in a pandas DataFrame, depending on the length of the dataframe itself. Some guidance on what is going on and how to avoid generating inconsistent outputs would be appreciated.
First of all, I'm working with Python 3.11 and pandas version 2.2.2. The problem is, depending on the dataframe length, .loc will return a one-item series, or a float64 object. Below the replica with some dummy example I made up.
First, when considering a longer dataframe, with multiindex, returns Series with .loc
df = pd.DataFrame.from_dict(
{'ix1': ['asd', 'asd', 'asd', 'qwe', 'qwe', 'qwe', 'qwe', 'asd', 'qwe', 'asd', 'asd', 'qwe', 'asd', 'asd', 'asd', 'asd', 'qwe', 'qwe', 'qwe', 'qwe', 'asd', 'qwe', 'qwe', 'asd', 'qwe', 'qwe', 'qwe', 'asd', 'asd', 'asd', 'bar', 'qwe', 'qwe', 'asd', 'qwe', 'asd'],
'ix2': ['sdf', 'bar', 'rty', 'fgh', 'cvb', 'cvb', 'vbn', 'bnm', 'jkl', 'ewq', 'uio', 'uio', 'wer', 'dsa', 'vbn', 'cxz', 'sdf', 'iuo', 'bar', 'bvc', 'fgh', 'rty', 'gfd', 'cvb', 'wer', 'bnm', 'ewq', 'tre', 'uyt', 'jhg', 'foo', 'dsa', 'mnb', 'jkl', 'iuy', 'lkj'],
'value': [float(i) for i in range(1, 37)]})
>>> df[['ix1', 'ix2', 'value']].set_index(['ix1', 'ix2']).loc[('bar', 'foo'), 'value']
# <input>:1: PerformanceWarning: indexing past lexsort depth may impact performance.
# ix1 ix2
# bar foo 31.00
# Name: value, dtype: float64
returns PerformanceWarning since index is not sorted and Series with float64 value (undesired)
Shorter dataframe, same structure, same index, same line of code executed, returns float64 insted
df = pd.DataFrame.from_dict(
{'ix1': ['foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'bar'],
'ix2': ['tyu', 'fgh', 'vbn', 'jkl', 'foo', 'asd', 'qwe', 'foo'],
'value': [float(i) for i in range(1, 9)]})
>>> df[['ix1', 'ix2', 'value']].set_index(['ix1', 'ix2']).loc[('bar', 'foo'), 'value']
# np.float64(8.0)
Simply returns a float64 (as desired)
This is causing trouble later down in the code because I need floating numbers with which to perform some calculations, and I don't seem to find what should I do to generate consistent outputs.
Your second example doesn't have duplicated combinations of ix1
/ix2
, which prevents the issue.
If you always want a float I'd use:
cols = ['ix1', 'ix2']
df.drop_duplicates(cols).set_index(cols)['value'].loc[('bar', 'foo')]
Output: 31.0