I often need to fetch large quantities of data from Microsoft SQL servers to be manipulated with Polars in Rust, and per enterprise security policy am more or less forced to use ODBC for these connections. The ODBC requirement restricts me from using mature and featureful libraries like ConnectorX. I am able to connect and efficiently read query results into RecordBatch objects from Arrow using arrow_odbc, but have not been able to convert these RecordBatch objects into Polars DataFrames.
Because the actual data component of RecordBatch
and Series
have the same underlying representation, I thought it would be possible to create a DataFrame
from a RecordBatch
zero-copy.
However in columns.push(Series::from_arrow(&schema.fields().get(i).unwrap().name(), *column)?);
I get the error:
mismatched types
expected struct `std::boxed::Box<(dyn polars::export::polars_arrow::array::Array + 'static)>`
found struct `Arc<dyn arrow::array::Array>`
I was under the impression that an Arc<dyn Array>
is an ArrayRef
, is the real problem perhaps that I have an Arc<dyn arrow::array::Array>
and Series::from_arrow()
is expecting a Polars Arc<Array>
? If so, how do I resolve that?
My full code is below for reference.
use arrow_odbc::{odbc_api::{Environment, ConnectionOptions}, OdbcReaderBuilder};
use arrow::record_batch::RecordBatch;
use polars::prelude::*;
use anyhow::Result;
const CONNECTION_STRING: &str = "...";
pub fn test() -> Result<()> {
let odbc_environment = Environment::new()?;
let connection = odbc_environment.connect_with_connection_string(
CONNECTION_STRING,
ConnectionOptions::default()
)?;
let cursor = connection.execute("SELECT * FROM Backcast_Power_Plant_Map", ())?.unwrap();
let arrow_record_batches = OdbcReaderBuilder::new().build(cursor)?;
fn record_batch_to_dataframe(batch: &RecordBatch) -> Result<DataFrame, PolarsError> {
let schema = batch.schema();
let mut columns = Vec::with_capacity(batch.num_columns());
for (i, column) in batch.columns().iter().enumerate() {
columns.push(Series::from_arrow(&schema.fields().get(i).unwrap().name(), *column)?);
}
Ok(DataFrame::from_iter(columns))
}
for batch in arrow_record_batches {
dbg!(record_batch_to_dataframe(&batch?));
}
Ok(())
}
It appears polars
and arrow-odbc
uses different arrow crates: polars
uses polars-arrow
, and arrow-odbc
uses arrow
. The former's array type is Box<dyn polars_arrow::array::Array>
, while the latter has the type ArrayRef
, which is an alias for Arc<dyn arrow::array::Array>
.
Luckily for us, there exists a compatibility layer in the polars-arrow
crate. You can convert between the two types (and more) via From
impls:
use anyhow::Result;
use arrow::record_batch::RecordBatch;
use arrow_odbc::{
odbc_api::{ConnectionOptions, Environment},
OdbcReaderBuilder,
};
use polars::prelude::*;
const CONNECTION_STRING: &str = "...";
pub fn test() -> Result<()> {
let odbc_environment = Environment::new()?;
let connection = odbc_environment
.connect_with_connection_string(CONNECTION_STRING, ConnectionOptions::default())?;
let cursor = connection
.execute("SELECT * FROM Backcast_Power_Plant_Map", ())?
.unwrap();
let arrow_record_batches = OdbcReaderBuilder::new().build(cursor)?;
fn record_batch_to_dataframe(batch: &RecordBatch) -> Result<DataFrame, PolarsError> {
let schema = batch.schema();
let mut columns = Vec::with_capacity(batch.num_columns());
for (i, column) in batch.columns().iter().enumerate() {
let arrow = Box::<dyn polars_arrow::array::Array>::from(&**column);
columns.push(Series::from_arrow(
&schema.fields().get(i).unwrap().name(),
arrow,
)?);
}
Ok(DataFrame::from_iter(columns))
}
for batch in arrow_record_batches {
dbg!(record_batch_to_dataframe(&batch?));
}
Ok(())
}
Note this requires polars-arrow
with the arrow_rs
feature as a dependency.
From what I can tell, this does not copy the actual data.