I have multiple json columns which I concat to an array of json columns. The DataFarme looks like this
┌─────────────────────────────────┐
│ json_concat │
│ --- │
│ list[str] │
╞═════════════════════════════════╡
│ ["{"integer_col":52,"string_co… │
│ ["{"integer_col":93,"string_co… │
│ ["{"integer_col":15,"string_co… │
│ ["{"integer_col":72,"string_co… │
│ ["{"integer_col":61,"string_co… │
│ ["{"integer_col":21,"string_co… │
│ ["{"integer_col":83,"string_co… │
│ ["{"integer_col":87,"string_co… │
│ ["{"integer_col":75,"string_co… │
│ ["{"integer_col":75,"string_co… │
└─────────────────────────────────┘
Here is the output of polars glimpse
Rows: 10
Columns: 1
$ json_concat <list[str]> ['{"integer_col":52,"string_col":"v"}', '{"float_col":86.61761457749351,"bool_col":true}', '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'], ['{"integer_col":93,"string_col":"l"}', '{"float_col":60.11150117432088,"bool_col":false}', '{"datetime_col":"2021-01-02 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":15,"string_col":"y"}', '{"float_col":70.80725777960456,"bool_col":false}', '{"datetime_col":"2021-01-03 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":72,"string_col":"q"}', '{"float_col":2.0584494295802447,"bool_col":true}', '{"datetime_col":"2021-01-04 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":61,"string_col":"j"}', '{"float_col":96.99098521619943,"bool_col":true}', '{"datetime_col":"2021-01-05 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":21,"string_col":"p"}', '{"float_col":83.24426408004217,"bool_col":true}', '{"datetime_col":"2021-01-06 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":83,"string_col":"o"}', '{"float_col":21.233911067827616,"bool_col":true}', '{"datetime_col":"2021-01-07 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":87,"string_col":"o"}', '{"float_col":18.182496720710063,"bool_col":true}', '{"datetime_col":"2021-01-08 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":75,"string_col":"s"}', '{"float_col":18.34045098534338,"bool_col":true}', '{"datetime_col":"2021-01-09 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":75,"string_col":"l"}', '{"float_col":30.42422429595377,"bool_col":true}', '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}']
I want to write the json column to a table called testing
. I tried both pd.DataFrame.to_sql()
as well as pl.DataFrame.write_database()
both failing with a similary error
The essential part is this sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable
File "/usr/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable
[SQL: INSERT INTO some_schema.testing (json_concat) VALUES (%(json_concat)s)]
[parameters: [{'json_concat': array(['{"integer_col":52,"string_col":"v"}',
'{"float_col":86.61761457749351,"bool_col":true}',
'{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'],
dtype=object)},
# ... abbreviated
dtype=object)}, {'json_concat': array(['{"integer_col":75,"string_col":"l"}',
'{"float_col":30.42422429595377,"bool_col":true}',
'{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}'],
dtype=object)}]]
(exemplary for pandas)
df_pandas.to_sql(
"testing",
con=engines.engine,
schema=schema,
index=False,
if_exists="append",
dtype=DTYPE,
)
How do I need to prepare the concated json column for it to be json serializable?
from typing import Any
import numpy as np
import pandas as pd
import polars as pl
from myengines import engines
from sqlalchemy import dialects, text
schema = "some_schema"
# Seed for reproducibility
np.random.seed(42)
n = 10
# Generate random data
integer_col = np.random.randint(1, 100, n)
float_col = np.random.random(n) * 100
string_col = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), n)
bool_col = np.random.choice([True, False], n)
datetime_col = pd.date_range(start="2021-01-01", periods=n, freq="D")
categorical_col = np.random.choice(["Category1", "Category2", "Category3"], n)
# Creating the DataFrame
df = pl.DataFrame(
{
"integer_col": integer_col,
"float_col": float_col,
"string_col": string_col,
"bool_col": bool_col,
"datetime_col": datetime_col,
"categorical_col": categorical_col,
}
)
df = df.select(
pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
pl.struct(pl.col("datetime_col", "categorical_col"))
.struct.json_encode()
.alias("json3"),
).select(pl.concat_list(pl.col(["json1", "json2", "json3"])).alias("json_concat"))
DTYPE: dict[str, Any] = {"json_concat": dialects.postgresql.JSONB}
There is unfortunately no polars function to serialize a list to a JSON array. Here's how you can do it manually:
df = df.select(
pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
pl.struct(pl.col("datetime_col", "categorical_col")).struct.json_encode().alias("json3"),
).select(
pl.format("[{}]", pl.concat_list(pl.col(["json1", "json2", "json3"])).list.join(",")).alias("json_concat"),
)
engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/postgres", echo=True)
df.write_database(
"testing",
connection=engine,
if_table_exists="append",
)
Also, in expressions, strings are read as column names, so pl.col
isn't required. Here's the cleaned-up code:
df = df.select(
pl.struct("integer_col", "string_col").struct.json_encode().alias("json1"),
pl.struct("float_col", "bool_col").struct.json_encode().alias("json2"),
pl.struct("datetime_col", "categorical_col").struct.json_encode().alias("json3"),
).select(
pl.format("[{}]", pl.concat_list("json1", "json2", "json3").list.join(",")).alias("json_concat"),
)
Alternatively, the concatenation can be written in one format expression:
df = df.select(
pl.format(
"[{}, {}, {}]",
pl.struct("integer_col", "string_col").struct.json_encode(),
pl.struct("float_col", "bool_col").struct.json_encode(),
pl.struct("datetime_col", "categorical_col").struct.json_encode(),
).alias("json_concat"),
)