pythonjsonpandassqlalchemypython-polars

Pandas / Polars: Write list of JSONs to Database fails with `ndarray is not json serializable`


I have multiple json columns which I concat to an array of json columns. The DataFarme looks like this

┌─────────────────────────────────┐
│ json_concat                     │
│ ---                             │
│ list[str]                       │
╞═════════════════════════════════╡
│ ["{"integer_col":52,"string_co… │
│ ["{"integer_col":93,"string_co… │
│ ["{"integer_col":15,"string_co… │
│ ["{"integer_col":72,"string_co… │
│ ["{"integer_col":61,"string_co… │
│ ["{"integer_col":21,"string_co… │
│ ["{"integer_col":83,"string_co… │
│ ["{"integer_col":87,"string_co… │
│ ["{"integer_col":75,"string_co… │
│ ["{"integer_col":75,"string_co… │
└─────────────────────────────────┘

Here is the output of polars glimpse

Rows: 10
Columns: 1
$ json_concat <list[str]> ['{"integer_col":52,"string_col":"v"}', '{"float_col":86.61761457749351,"bool_col":true}', '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'], ['{"integer_col":93,"string_col":"l"}', '{"float_col":60.11150117432088,"bool_col":false}', '{"datetime_col":"2021-01-02 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":15,"string_col":"y"}', '{"float_col":70.80725777960456,"bool_col":false}', '{"datetime_col":"2021-01-03 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":72,"string_col":"q"}', '{"float_col":2.0584494295802447,"bool_col":true}', '{"datetime_col":"2021-01-04 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":61,"string_col":"j"}', '{"float_col":96.99098521619943,"bool_col":true}', '{"datetime_col":"2021-01-05 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":21,"string_col":"p"}', '{"float_col":83.24426408004217,"bool_col":true}', '{"datetime_col":"2021-01-06 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":83,"string_col":"o"}', '{"float_col":21.233911067827616,"bool_col":true}', '{"datetime_col":"2021-01-07 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":87,"string_col":"o"}', '{"float_col":18.182496720710063,"bool_col":true}', '{"datetime_col":"2021-01-08 00:00:00","categorical_col":"Category2"}'], ['{"integer_col":75,"string_col":"s"}', '{"float_col":18.34045098534338,"bool_col":true}', '{"datetime_col":"2021-01-09 00:00:00","categorical_col":"Category1"}'], ['{"integer_col":75,"string_col":"l"}', '{"float_col":30.42422429595377,"bool_col":true}', '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}']

I want to write the json column to a table called testing. I tried both pd.DataFrame.to_sql() as well as pl.DataFrame.write_database() both failing with a similary error

Error

The essential part is this sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable

File "/usr/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
sqlalchemy.exc.StatementError: (builtins.TypeError) Object of type ndarray is not JSON serializable
[SQL: INSERT INTO some_schema.testing (json_concat) VALUES (%(json_concat)s)]
[parameters: [{'json_concat': array(['{"integer_col":52,"string_col":"v"}',
       '{"float_col":86.61761457749351,"bool_col":true}',
       '{"datetime_col":"2021-01-01 00:00:00","categorical_col":"Category3"}'],
      dtype=object)}, 
      # ... abbreviated
      dtype=object)}, {'json_concat': array(['{"integer_col":75,"string_col":"l"}',
       '{"float_col":30.42422429595377,"bool_col":true}',
       '{"datetime_col":"2021-01-10 00:00:00","categorical_col":"Category2"}'],
      dtype=object)}]]

Code that produces the Error

(exemplary for pandas)

df_pandas.to_sql(
    "testing",
    con=engines.engine,
    schema=schema,
    index=False,
    if_exists="append",
    dtype=DTYPE,
)

Question

How do I need to prepare the concated json column for it to be json serializable?

MRE (Create Example Data)

from typing import Any
import numpy as np
import pandas as pd
import polars as pl
from myengines import engines
from sqlalchemy import dialects, text

schema = "some_schema"
# Seed for reproducibility
np.random.seed(42)

n = 10

# Generate random data
integer_col = np.random.randint(1, 100, n)
float_col = np.random.random(n) * 100
string_col = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), n)
bool_col = np.random.choice([True, False], n)
datetime_col = pd.date_range(start="2021-01-01", periods=n, freq="D")
categorical_col = np.random.choice(["Category1", "Category2", "Category3"], n)

# Creating the DataFrame
df = pl.DataFrame(
    {
        "integer_col": integer_col,
        "float_col": float_col,
        "string_col": string_col,
        "bool_col": bool_col,
        "datetime_col": datetime_col,
        "categorical_col": categorical_col,
    }
)



df = df.select(
    pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
    pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
    pl.struct(pl.col("datetime_col", "categorical_col"))
    .struct.json_encode()
    .alias("json3"),
).select(pl.concat_list(pl.col(["json1", "json2", "json3"])).alias("json_concat"))


DTYPE: dict[str, Any] = {"json_concat": dialects.postgresql.JSONB}

Solution

  • There is unfortunately no polars function to serialize a list to a JSON array. Here's how you can do it manually:

    df = df.select(
        pl.struct(pl.col("integer_col", "string_col")).struct.json_encode().alias("json1"),
        pl.struct(pl.col("float_col", "bool_col")).struct.json_encode().alias("json2"),
        pl.struct(pl.col("datetime_col", "categorical_col")).struct.json_encode().alias("json3"),
    ).select(
        pl.format("[{}]", pl.concat_list(pl.col(["json1", "json2", "json3"])).list.join(",")).alias("json_concat"),
    )
    
    engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost:5432/postgres", echo=True)
    df.write_database(
        "testing",
        connection=engine,
        if_table_exists="append",
    )
    

    Also, in expressions, strings are read as column names, so pl.col isn't required. Here's the cleaned-up code:

    df = df.select(
        pl.struct("integer_col", "string_col").struct.json_encode().alias("json1"),
        pl.struct("float_col", "bool_col").struct.json_encode().alias("json2"),
        pl.struct("datetime_col", "categorical_col").struct.json_encode().alias("json3"),
    ).select(
        pl.format("[{}]", pl.concat_list("json1", "json2", "json3").list.join(",")).alias("json_concat"),
    )
    

    Alternatively, the concatenation can be written in one format expression:

    df = df.select(
        pl.format(
            "[{}, {}, {}]",
            pl.struct("integer_col", "string_col").struct.json_encode(),
            pl.struct("float_col", "bool_col").struct.json_encode(),
            pl.struct("datetime_col", "categorical_col").struct.json_encode(),
        ).alias("json_concat"),
    )