pythonjsonserializationorjson

Serializing a Complex python Class to JSON


In my project I analyze the questions of a given exam. Let's say each exam has 10 questions.

For each question I compute some stuff and save it, using the constructor method of class QuestionData (defined in file question_data.py). Each QuestionData object has a pandas dataframe, some dicts, some float attributes and a numpy array.

Next, the exam analysis is done using class ExamData - which also has some simple attributes, some dicts and a list of all the QuestionData objects.

Eventually, what I need to do is to return the ExamData object as JSON so it can be sent back as a response.


I'm working with conda and python 3.12.4. I thought it's a sensible move to start with serializing a single QuestionData object. Tried using the __dict__ trick explained here, but it failed with

AttributeError: 'weakref.ReferenceType' object has no attribute '__dict__'. Did you mean: '__dir__'?

Then I tried installing orjson using conda install orjson, but it refuses to work due to SSL:

>conda install orjson
Collecting package metadata (current_repodata.json): failed

CondaSSLError: OpenSSL appears to be unavailable on this machine. OpenSSL is required to
download and install packages.

Exception: HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/main/win-64/current_repodata.json (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))

The above is after I let it update openssl from 3.0.14-h827c3e9_0 --> 3.0.15-h827c3e9_0, which was a requirement for the installation.


  1. Is there any way of serializing such complex objects without writing my own serializer?
  2. If so, which package is recommended? Am I missing something with orjson?
  3. If a manual serializer is the only solution, how do I write it?

I have plenty of experience with various programming languages, with OOP and with JSON but I'm new to python so please tread lightly.


code:

question_data.py:

import pandas as pd
import numpy as np
import scipy.stats as sps
import string

class QuestionData:
    def __init__(self, data, item: str):
        options_list = ...
        #df for answer analysis
        self._options_data = pd.DataFrame(index = options_list) 
        #percent chosen column
        self._options_data["pct"] = ...
        #mean ability for chosen answer
        self._options_data["theta_mean"] = ...
        #ability sd for chosen answer
        self._options_data["theta_sd"] = ...
        #corr of chosen answer with ability
        self._options_data["theta_corr"] = ...
        
        #item delta
        self._delta = ...
        
        #biserial of key with theta
        self._key_biserial = ...
        #initial IRT params. To be done later
        self._IRT_params = {"a": 1, "b": 0, "c": 0}
        self._IRT_info = {"theta_MI": 0, "info_theta_MI": 0}
        
        #response times vector
        self._response_time = data._response_times[str(item)].to_numpy()

exam_data.py:

from question_data import QuestionData
from datetime import datetime
from dateutil import relativedelta

class ExamData:
    _quantile_list = [5, 25, 50, 75, 95]
    _date_format = '%d/%m/%Y'
    def __init__(self, data):
        fromDate = datetime.strptime(data._details["fromDate"], self._date_format)
        toDate = datetime.strptime(data._details["toDate"], self._date_format)
        delta = relativedelta.relativedelta(toDate, fromDate)
        self._report_duration ={"years": delta.years, "months": delta.months, "days": delta.days}
        self._exposure_num = ...
        self._total_times = data._response_times.sum(axis = 1)
        self._time_quantiles = dict(zip(self._quantile_list,
                                         [self._total_times.quantile(q/100) for q in self._quantile_list]))
        self._q_list = ...
        self._q_data = dict(zip(self._q_list, 
                                     [QuestionData(data, q) for q in self._q_list]))

Examples of what I want to get-

QuestionData:

{
    "_options_data": {"pct": {...}, "theta_mean": {...}, ...}, //<pandas df serialization>
    "_delta": 10,
    "_IRT_info": {"theta_MI": 0, "info_theta_MI": 0},
    "_response_time": [25.5, 41.6, 30.9, ...],

    ...
}

ExamData:

{
    "_report_duration": {"years": 0, "months": 0, "days": 17}, 
    "_exposure_num": 150,
    "_time_quantiles": {"5": 117.89, "25": 167.15, "50": 224.1, ...},
    "_total_times": {"id1": 120.3, "id2": 149.9, ...}, //<pandas series serialization> 
    "_q_data": {"Q1": <QuestionData Object>, "Q2": <QuestionData Object>, ...},
    ...
}

Solution

  • Eventually the simplest solution was to write my own serializer, just a simple extension to this post.

    import json
    import numpy as np
    import pandas as pd
    from question_data import QuestionData
    from exam_data import ExamData
    
    # JSON serializer class so we can easily handle numpy+pandas objects
    class CustomTypeEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, np.generic):
                return obj.item()
            elif ((isinstance(obj, np.ndarray)) or (isinstance(obj, pd.Series))):
                return obj.tolist()
            elif isinstance(obj, pd.DataFrame):
                return obj.T.to_dict()
            elif ((isinstance(obj, QuestionData)) or (isinstance(obj, ExamData))):
                return obj.__dict__
            elif hasattr(obj, 'to_json'):
                return obj.to_json(orient='records')
            return json.JSONEncoder.default(self, obj)
    

    Then, when needed, using it as follows:

    import json
    from question_data import QuestionData
    from exam_data import ExamData
    
    
    data = ...
    ed = ExamData(data)
    q1d = ed._q_data["q1"] #QuestionData object
    
    json_str1 = json.dumps(ed, cls=CustomTypeEncoder) #this works perfectly
    json_str2 = json.dumps(q1d, cls=CustomTypeEncoder) #this too