I have always been under impression that Pandas uses hashing when indexing the rows in a dataframe such that the operations like df.loc[some_label]
is O(1)
.
However, I just realized today that this is not the case, at least for multi-indexed dataframe. As pointed out in the document, "Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning)". Some articles I found seem to suggest, for multiindex dataframe, Pandas is using binary-search based indexing if you have called sort_index()
on the dataframe; otherwise, it just linearly scans the rows.
My question are
sort_index()
has been called, and linear scan otherwise, like in the case of multi-indexed dataframe?Does a single-indexed DataFrame use hash-based indexing?
No, Pandas does not use hash-based indexing for single-indexed DataFrames. Instead, it relies on array-based lookups or binary search when the index is sorted. If the index is unsorted, Pandas performs a linear scan, which is less efficient.
If the DataFrame is sorted using sort_index()
, Pandas can leverage a binary search to achieve faster lookups. Without sorting, lookups default to a linear scan.
Hash-based indexing is more challenging for multi-indexes due to the hierarchical nature of the index. Instead, Pandas relies on binary search (for sorted indexes) or linear scan (for unsorted indexes) because these methods handle hierarchical indexing efficiently. Hash-based indexing would introduce additional overhead and complexity when working with multiple levels.