pythonpyarrowapache-arrow

Efficiently storing data that is not yet purely-columnar into the Arrow format


I previously asked a question similar to this one (in that it uses the same example to illustrate the problem): Efficiently storing data that is not-quite-columnar-yet into a DuckDB database.

However, DuckDB is not Apache Arrow. In particular, one cannot insert data into an Apache Arrow table or array using SQL queries, unless they are doing so indirectly.

Arrow arrays are directly constructed much like Numpy arrays.. My best guess, based on what I understand from the documentation so far, is that likely I want to use Arrow Tensors in some way. I am far from certain, hence this question.

I have some partially-columnar data like this:

"hello", "2024 JAN", "2024 FEB"
"a", 0, 1

If it were purely-columnar, it would look like:

"hello", "year", "month", "value"
"a", 2024, "JAN", 0
"a", 2024, "FEB", 1

Suppose the data is in the form of a numpy array, like this:

import numpy as np

import numpy as np

data = np.array(
    [["hello", "2024 JAN", "2024 FEB"], ["a", "0", "1"]], dtype="<U"
)
array([['hello', '2024 JAN', '2024 FEB'],
       ['a', '0', '1']], dtype='<U8')

How could I go about efficiently storing data into the Apache Arrow format, in a purely columnar format?

The naive/easy way would be to brute-force transform the data into a columnar format in Python, before storing it as an Apache Arrow array.

Here I mean specifically:

import re

data_header = data[0]
data_proper = data[1:]

date_pattern = re.compile(r"(?P<year>[\d]+) (?P<month>JAN|FEB)")
common_labels: list[str] = []
known_years: set[int] = set()
known_months: set[str] = set()
header_to_date: Dict[str, tuple[int, str]] = dict()
for header in data_header:
    if matches := date_pattern.match(header):
        year, month = int(matches["year"]), str(matches["month"])
        known_years.add(year)
        known_months.add(month)
        header_to_date[header] = (year, month)
    else:
        common_labels.append(header)

# hello, year, month, value
new_rows_per_old_row = len(known_years) * len(known_months)
new_headers = ["year", "month", "value"]
purely_columnar = np.empty(
    (
        1 + data_proper.shape[0] * new_rows_per_old_row,
        len(common_labels) + len(new_headers),
    ),
    dtype=np.object_,
)
purely_columnar[0] = common_labels + ["year", "month", "value"]
for rx, row in enumerate(data_proper):
    common_data = []
    ym_data = []
    for header, element in zip(data_header, row):
        if header in common_labels:
            common_data.append(element)
        else:
            year, month = header_to_date[header]
            ym_data.append([year, month, element])

    for yx, year_month_value in enumerate(ym_data):
        purely_columnar[
            1 + rx * new_rows_per_old_row + yx, : len(common_labels)
        ] = common_data
        purely_columnar[
            1 + rx * new_rows_per_old_row + yx, len(common_labels) :
        ] = year_month_value

print(f"{purely_columnar=}")
purely_columnar=
array([[np.str_('hello'), 'year', 'month', 'value'],
       [np.str_('a'), 2024, 'JAN', np.str_('0')],
       [np.str_('a'), 2024, 'FEB', np.str_('1')]], dtype=object)

Now it is easy enough to create an Arrow array from this re-organized data:

import pyarrow as pa

column_types = [pa.string(), pa.int64(), pa.string(), pa.string()]
pa.table(
    [
        pa.array(purely_columnar[1:, cx], type=column_types[cx])
        for cx in range(purely_columnar.shape[1])
    ],
    names=purely_columnar[0],
)
pyarrow.Table
hello: string
year: int64
month: string
value: string
----
hello: [["a","a"]]
year: [[2024,2024]]
month: [["JAN","FEB"]]
value: [["0","1"]]

But is there anything else I can do to store the data in purely columnar form in Apache Arrow format, apart from first brute-forcing the data into the purely columnar format?


Solution

  • I gave an SQL-based answer to your DuckDB question. If that works for you, then you can use the arrow function from the Python API to get a pyarrow.Table.

    tbl = con.execute("<query>").arrow()
    

    If using only Python, maybe the better way to do this would be to use pandas with the pyarrow engine. Since I don't think pyarrow provides higher level analytical APIs (although there are some compute functions that can do low-level manipulations).

    In [1]: _df = pd.read_csv("pivoted-data.csv", engine="pyarrow")
    
    In [2]: _df
    Out[2]: 
      hello  2024 JAN  2024 FEB
    0     a         0         1
    1     b         1         0
    
    In [3]: _df.melt(id_vars=["hello"], value_vars=_df.columns[1:])
    Out[3]: 
      hello  variable  value
    0     a  2024 JAN      0
    1     b  2024 JAN      1
    2     a  2024 FEB      1
    3     b  2024 FEB      0
    
    In [4]: _df = _df.melt(id_vars=["hello"], value_vars=_df.columns[1:])
    
    In [5]:  _df.variable.str.split(" ", expand=True)
    Out[5]: 
          0    1
    0  2024  JAN
    1  2024  JAN
    2  2024  FEB
    3  2024  FEB
    
    In [6]: pd.concat([
        ...:   _df[["hello", "value"]],
        ...:   _df.variable.str.split(" ", expand=True).rename(columns={0: "year", 1: "month"})
        ...: ], axis=1)
    Out[6]: 
      hello  value  year month
    0     a      0  2024   JAN
    1     b      1  2024   JAN
    2     a      1  2024   FEB
    3     b      0  2024   FEB
    
    In [7]: res = pd.concat([
        ...:   _df[["hello", "value"]],
        ...:   _df.variable.str.split(" ", expand=True).rename(columns={0: "year", 1: "month"})
        ...: ], axis=1)
    
    In [8]: pa.Table.from_pandas(res)
    Out[8]: 
    pyarrow.Table
    hello: string
    value: int64
    year: string
    month: string
    ----
    hello: [["a","b","a","b"]]
    value: [[0,1,1,0]]
    year: [["2024","2024","2024","2024"]]
    month: [["JAN","JAN","FEB","FEB"]]