pythonoracle-databasepython-oracledb

CLOB attribute of a oracle record type is taking 0.5 seconds for each update using python (both with oracledb and cx_oracle library)


I'm trying to update the CLOB attribute of a record type created using below script. I have 40,000 CLOB items, which I'm updating into 40,000 record type DB_TYPE_OBJECT via python code. Each record objects' CLOB attribute (column CLOB_CONTENT) update using python is taking about 0.5 seconds (This is just for creating DB_TYPE_OBJECT) . This is really weird since it's similar to updating an attribute of any other object. (CLOB size is less, size is no issue).

Note: This happens only for CLOB type (for 40,000 items it takes about 3 or 4 hours). when I try without CLOB type (just REC_ID), it finishes up within few seconds.

FYI: This same code and record types is working faster in my personal workspace (tested by running docker image of oracle db ). This issue is popping up at my workplace where database is hosted on aws. (No network issue etc I believe as the same type of code is working faster in Java).

Python3.12 oracledb -> 19.3

create TYPE CLOB_REC_OBJECT as OBJECT(
   REC_ID NUMBER,
   CLOB_CONTENT CLOB
);
import os
import time

import numpy as np
import oracledb
import pandas as pd


dsn = "localhost:1521/ORCLCDB"
username = "SYS"
password = "mypassword1"

try:
    instant_client_dir = os.path.join(os.environ.get("HOME"), "Documents", "OracleDB/instantclient_23_3")
    oracledb.init_oracle_client(lib_dir=instant_client_dir)

    num_objects = 40000
    data = {
        'long_string_column': [f"This is a very long string example for row {i}. It can contain various characters and be quite lengthy to simulate real-world data scenarios." * 5 for i in range(num_objects)],
        'numeric_column': np.random.randint(1000, 100000, size=num_objects)
    }

    df = pd.DataFrame(data)

    conn = oracledb.connect(user=username, password=password, dsn=dsn, mode=oracledb.SYSDBA)
    print("Connected to Oracle Database (CDB Root) using Thin Client successfully!")

    cursor = conn.cursor()
    my_objects_list = []
    rec_type = conn.gettype("CLOB_REC_OBJECT")

    rec_list = [rec_type.newobject() for _ in range(num_objects)]

    start_time = time.time()

    for row in df.itertuples():
        new_obj = rec_list[row[0]]
        setattr(new_obj, 'REC_ID', i)
        setattr(new_obj, 'CLOB_CONTENT', f"Name_{i}"*4000)
        my_objects_list.append(new_obj)

    end_time = time.time()

    print(f"Time taken to create objects and set attributes with setattr(): {end_time - start_time:.4f} seconds")
    print(f"Number of objects in the list: {len(my_objects_list)}")

except oracledb.Error as e:
    error_obj, = e.args
    print(f"Error connecting to Oracle Database: {error_obj.message}")
finally:
    if 'cursor' in locals() and cursor:
        cursor.close()
    if 'conn' in locals() and conn:
        conn.close()

I have provided the lastest code which I tried. I would like to know why is updating a CLOB attribute using python and oracledb takes so much time, what are the checks that I can perform for this to find the root cause and fix this issue. even if it's 40,000 items, it should finish in around 16 seconds just like my personal system does.

Also, if there's any alternative faster way to do this. (First preference would be to find root cause of this issue and fix)


Solution

  • This is a common issue when migrating databases to cloud providers (Azure, AWS, etc.). The network latency from on-prem to cloud is steep - sometimes on the order of 50ms or more - due to geographical distribution (light can only travel so fast - from coast-to-coast in fiberoptics probably requires 20-30ms at a minimum), multiple layers of firewalls and isolated switched networks, and little gremlins like deep packet inspection. This is far higher latency than most properly built LANs in a traditional on-prem network where latencies closer to 1ms or even faster are typical.

    The LOB datatypes (CLOB, BLOB) require special processing by the client driver - no matter how many rows you are able to fetch per FETCH call to the database, if there is a LOB present the fetch pulls back just a locator (not the data value itself). The driver must then loop through all the rows and for each row must issue LOB read requests using the locator for that row, as many times as needed to read in the entire LOB (which is why LOB sizes are basically unlimited). But even for small LOBs, that still requires an internal LOB read call from the client to the database for every row of data. That's where that high network latency is going to hurt - as it is then incurred for every row, instead of only once per FETCH call. It's faster when you run a local database because you've removed the network latency.

    The solution is to avoid LOB datatypes whenever possible if you're going to be dealing with a high-latency network. Here are some approaches:

    1. If the values it holds are always less than 4KB, then either change the datatype to a varchar2(4000), or if you can't do that, then use SQL (directly or embedded in a view) to cast the LOB: SELECT CAST(mylob AS varchar2(4000)) mylob, col2, col3, ....

    2. If some values are >4KB but most are not, then you can do two separate pulls: cast to varchar2 where the length is <= 4000 and don't do so (in the next pull) when length > 4000. That will reduce the # of rows where expensive LOB processing is needed.

    3. If there are too many values > 4000 but the data you really want is in the first 4000, consider truncating the value as you pull it. SELECT CAST(SUBSTR(mylob,1,4000) AS varchar2(4000)) mylob, col2, col3, ....

    4. If even that isn't helpful because most values are > 4000 and you need the whole thing, then look for programming design ways of reducing the traffic (convert your logic to incremental pulls rather than full extracts, etc...)

    If you are updating the CLOB rather than selecting from it, you face the same problem. The workarounds however are more limited, as you would pretty much need to change the datatype on the table to a varchar2(4000).

    Lastly, if the data in this field is structured data (multiple meaningful elements), then use proper normalization rules (Oracle is after all still a relational database, even if more modern models like nested objects, XML, JSON, etc. are supported) to break those elements out into distinct, properly-typed scalar columns and child tables (if necessary) rather than use a CLOB, which is intended for unstructured data that has no programmatic meaning.