I'm working on an asynchronous FastAPI project that fetches large datasets from an API. Currently, I process the JSON response using a list comprehension and NumPy to extract device IDs and names. For example:
response = await client.get(tenant_url, headers=headers, params=params)
response.raise_for_status()
data = response.json()
numpy_2d_arrays = np.array([[device["id"]["id"], device["name"]] for device in data["data"]])
While this works for my current small datasets, I anticipate that some API responses could be as large as 100 MB, depending on the customer size. These large responses will be occasional, and my main concern is processing them efficiently within our limited 12 GB cloud memory environment.
I’m considering converting the raw data into a Polars DataFrame before extracting the device ID and name, in hopes of avoiding Python-level loops and improving performance. However, I haven't run any benchmarks comparing my current approach with a DataFrame-based one.
My main questions are:
Will converting to a Polars DataFrame offer significant performance improvements over using NumPy (or even Pandas) for simply extracting a couple of fields? Are there other recommended strategies or best practices for handling these occasional large responses in an asynchronous FastAPI setup with memory constraints? I appreciate any advice or performance tips from those with experience in processing large datasets in similar environments!
Here's how you can get your data into a Polars DataFrame without any looping.
You can convert from Polars to Numpy as well if you need to.
(
pl.DataFrame(data["data"])
.select(pl.col("id").struct["id"], "name")
)
shape: (3, 2)
┌─────────────────────────────────┬──────────────────┐
│ id ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════════════════════════╪══════════════════╡
│ 7c4145c0-e533-11ef-9681-df1aaf… ┆ C4DEE264E540-002 │
│ be7b4b90-b36a-11ed-9188-71dcc7… ┆ C4DEE264E540 │
│ fbcded60-de0e-11ef-9681-df1aaf… ┆ C4DEE264E540-001 │
└─────────────────────────────────┴──────────────────┘