pandaspython-3.7hdf5pytableshdfstore

Pandas HDFStore: append fails when min_itemsize is set to the maximum of the string column


I'm detecting the maximum lengths of all string columns of multiple dataframes, then attempting to build a HDFStore:

import pandas as pd

# Detect max string length for each column across all DataFrames
max_lens = {}
for df_path in paths:
    df = pd.read_pickle(df_path)
    for col in df.columns:
        ser = df[col]
        if ser.dtype == 'object' and isinstance(
            ser.loc[ser.first_valid_index()], str
        ):
            max_lens[col] = max(
                ser.dropna().map(len).max(), max_lens.setdefault(col, 0)
            )
print('Setting min itemsizes:', max_lens)

hdf_path.unlink()  # Delete of file for clean retry
store = pd.HDFStore(hdf_path, complevel=9)
for df_path in paths:
    df = pd.read_pickle(df_path)
    store.append(hdf_key, df, min_itemsize=max_lens, data_columns=True)
store.close()

The detected maximum string lengths are as follows:

     max_lens = {'hashtags': 139,
                 'id': 19,
                 'source': 157,
                 'text': 233,
                 'urls': 2352,
                 'user_mentions_user_ids': 199,
                 'in_reply_to_screen_name': 17,
                 'in_reply_to_status_id': 19,
                 'in_reply_to_user_id': 19,
                 'media': 286,
                 'place': 56,
                 'quoted_status_id': 19,
                 'user_id': 19}

Yet still I'm getting this error:

ValueError: Trying to store a string with len [220] in [hashtags] column but
this column has a limit of [194]!
Consider using min_itemsize to preset the sizes on these columns

Which is weird, because the detected maximum length of hashtags is 139.


Solution

  • HDF stores strings in UTF-8, and thus you need to encode the strings as UTF-8 and then find the maximum length.

    a_pandas_string_series.str.encode('utf-8').str.len().max()