I previously asked a question similar to this one (in that it uses the same example to illustrate the problem): Efficiently storing data that is not-quite-columnar-yet into a DuckDB database.
However, DuckDB is not Apache Arrow. In particular, one cannot insert data into an Apache Arrow table or array using SQL queries, unless they are doing so indirectly.
Arrow arrays are directly constructed much like Numpy arrays.. My best guess, based on what I understand from the documentation so far, is that likely I want to use Arrow Tensors in some way. I am far from certain, hence this question.
I have some partially-columnar data like this:
"hello", "2024 JAN", "2024 FEB"
"a", 0, 1
If it were purely-columnar, it would look like:
"hello", "year", "month", "value"
"a", 2024, "JAN", 0
"a", 2024, "FEB", 1
Suppose the data is in the form of a numpy array, like this:
import numpy as np
import numpy as np
data = np.array(
[["hello", "2024 JAN", "2024 FEB"], ["a", "0", "1"]], dtype="<U"
)
array([['hello', '2024 JAN', '2024 FEB'],
['a', '0', '1']], dtype='<U8')
How could I go about efficiently storing data
into the Apache Arrow format, in a purely columnar format?
The naive/easy way would be to brute-force transform the data into a columnar format in Python, before storing it as an Apache Arrow array.
Here I mean specifically:
import re
data_header = data[0]
data_proper = data[1:]
date_pattern = re.compile(r"(?P<year>[\d]+) (?P<month>JAN|FEB)")
common_labels: list[str] = []
known_years: set[int] = set()
known_months: set[str] = set()
header_to_date: Dict[str, tuple[int, str]] = dict()
for header in data_header:
if matches := date_pattern.match(header):
year, month = int(matches["year"]), str(matches["month"])
known_years.add(year)
known_months.add(month)
header_to_date[header] = (year, month)
else:
common_labels.append(header)
# hello, year, month, value
new_rows_per_old_row = len(known_years) * len(known_months)
new_headers = ["year", "month", "value"]
purely_columnar = np.empty(
(
1 + data_proper.shape[0] * new_rows_per_old_row,
len(common_labels) + len(new_headers),
),
dtype=np.object_,
)
purely_columnar[0] = common_labels + ["year", "month", "value"]
for rx, row in enumerate(data_proper):
common_data = []
ym_data = []
for header, element in zip(data_header, row):
if header in common_labels:
common_data.append(element)
else:
year, month = header_to_date[header]
ym_data.append([year, month, element])
for yx, year_month_value in enumerate(ym_data):
purely_columnar[
1 + rx * new_rows_per_old_row + yx, : len(common_labels)
] = common_data
purely_columnar[
1 + rx * new_rows_per_old_row + yx, len(common_labels) :
] = year_month_value
print(f"{purely_columnar=}")
purely_columnar=
array([[np.str_('hello'), 'year', 'month', 'value'],
[np.str_('a'), 2024, 'JAN', np.str_('0')],
[np.str_('a'), 2024, 'FEB', np.str_('1')]], dtype=object)
Now it is easy enough to create an Arrow array from this re-organized data:
import pyarrow as pa
column_types = [pa.string(), pa.int64(), pa.string(), pa.string()]
pa.table(
[
pa.array(purely_columnar[1:, cx], type=column_types[cx])
for cx in range(purely_columnar.shape[1])
],
names=purely_columnar[0],
)
pyarrow.Table
hello: string
year: int64
month: string
value: string
----
hello: [["a","a"]]
year: [[2024,2024]]
month: [["JAN","FEB"]]
value: [["0","1"]]
But is there anything else I can do to store the data in purely columnar form in Apache Arrow format, apart from first brute-forcing the data into the purely columnar format?
I gave an SQL-based answer to your DuckDB question. If that works for you, then you can use the arrow
function from the Python API to get a pyarrow.Table
.
tbl = con.execute("<query>").arrow()
If using only Python, maybe the better way to do this would be to use pandas
with the pyarrow
engine. Since I don't think pyarrow
provides higher level analytical APIs (although there are some compute functions that can do low-level manipulations).
In [1]: _df = pd.read_csv("pivoted-data.csv", engine="pyarrow")
In [2]: _df
Out[2]:
hello 2024 JAN 2024 FEB
0 a 0 1
1 b 1 0
In [3]: _df.melt(id_vars=["hello"], value_vars=_df.columns[1:])
Out[3]:
hello variable value
0 a 2024 JAN 0
1 b 2024 JAN 1
2 a 2024 FEB 1
3 b 2024 FEB 0
In [4]: _df = _df.melt(id_vars=["hello"], value_vars=_df.columns[1:])
In [5]: _df.variable.str.split(" ", expand=True)
Out[5]:
0 1
0 2024 JAN
1 2024 JAN
2 2024 FEB
3 2024 FEB
In [6]: pd.concat([
...: _df[["hello", "value"]],
...: _df.variable.str.split(" ", expand=True).rename(columns={0: "year", 1: "month"})
...: ], axis=1)
Out[6]:
hello value year month
0 a 0 2024 JAN
1 b 1 2024 JAN
2 a 1 2024 FEB
3 b 0 2024 FEB
In [7]: res = pd.concat([
...: _df[["hello", "value"]],
...: _df.variable.str.split(" ", expand=True).rename(columns={0: "year", 1: "month"})
...: ], axis=1)
In [8]: pa.Table.from_pandas(res)
Out[8]:
pyarrow.Table
hello: string
value: int64
year: string
month: string
----
hello: [["a","b","a","b"]]
value: [[0,1,1,0]]
year: [["2024","2024","2024","2024"]]
month: [["JAN","JAN","FEB","FEB"]]