I have the following code:
import pandas as pd
series_source = pd.Series([1, 2, 3, 4], dtype=int)
normal_index = pd.Series([True, False, True, True], dtype=bool)
big_index = pd.Series([True, False, True, True, False, True], dtype=bool)
# Both indexes give back: pd.Series([1, 2, 3, 4], dtype=int)
# no warnings are raised!
assert (series_source[normal_index] == series_source[big_index]).all()
df_source = pd.DataFrame(
[
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
]
)
# no warning - works as expected: grabs rows 0, 2, and 3
df_normal_result = df_source[normal_index]
# UserWarning: Boolean Series key will be reindexed to match DataFrame index.
# (but still runs)
df_big_result = df_source[big_index]
# passes - they are equivalent
assert df_normal_result.equals(df_big_result)
print("Complete")
Why is it that indexing the series_source
with the big_index
doesn't raise a warning, even though the big index has more values than the source? What is pandas doing under the hood in order to do the Series indexing?
(Contrast this to indexing the df_source
, where an explicit warning is raised that big_index
needs to be re-indexed in order for the operation to work.)
In the indexing docs, it claims that:
Using a boolean vector to index a Series works exactly as in a NumPy ndarray
However, if I do
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.array([True, False, True, True, False])
c = np.array([True, False, True, True, False, True, True])
# returns an ndarray of [1,3, 4] as expected
print(a[b])
# raises IndexError: boolean index did not match indexed array along axis 0;
# size of axis is 5 but size of corresponding boolean axis is 7
print(a[c])
So it does not seem that this functionality matches Numpy as the docs claim. What's going on?
(My versions are pandas==2.2.2
and numpy==2.0.0
.)
Because the indexing Series is first aligned to the index of the indexed DataFrame/Series.
In short, pandas is doing:
tmp = big_index.reindex(df.index)
df_big_result = df_source[tmp]
Example for a Series:
pd.Series([0,1,2])[pd.Series([True, True, False], index=[1,2,0])]
# 1 1
# 2 2
# dtype: int64
You can actually observe this yourself if you change the indices of the indexing Series:
big_index2 = pd.Series([False, False, True, True, True, True],
index=[4,5,0,1,2,3], dtype=bool)
df_source[big_index2]
Output:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
We have 4 rows in the output, despite the first two values being False
. After reindexing, the boolean values are [True, True, True, True]
.
You should get a warning in this case:
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Note that if alignment cannot be done, then an error will be raised, like in numpy:
pd.Series([0,1,2])[pd.Series([True, True, False], index=[1,2,3])]
# IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
pd.Series([0,1,2])[[True, False, False, True]]
# IndexError: Boolean index has wrong length: 4 instead of 3
Because there is a check for DataFrame[Series]
# internal for DataFrame.__getitem__
def __getitem__(self, key):
# ...
if isinstance(key, Series) and not key.index.equals(self.index):
warnings.warn(
"Boolean Series key will be reindexed to match DataFrame index.",
UserWarning,
stacklevel=find_stack_level(),
)
# internal for Series.__getitem__
if com.is_bool_indexer(key):
key = check_bool_indexer(self.index, key)
key = np.asarray(key, dtype=bool)
return self._get_rows_with_mask(key)