I have a spacy nlp function nlp(<string>).vector
that I need to apply to a string column in a dataframe. This function takes on average 13 milliseconds to return.
The function returns a ndarray that contains 300 Float64s. I need to expand these Floats to their own columns. This is the sketchy way I've done this:
import spacy
import polars as pl
nlp = spacy.load('en_core_web_lg')
full = pl.LazyFrame([["apple", "banana", "orange"]], schema=['keyword'])
VECTOR_FIELD_NAMES = ['dim_' + str(x) for x in range(300)]
full = full.with_columns(
pl.col('keyword').map_elements(
lambda x: tuple(nlp(x).vector), return_dtype=pl.List(pl.Float64)
).list.to_struct(fields=VECTOR_FIELD_NAMES).struct.unnest()
)
full.collect()
This takes 11.5s to complete, which is >100 times slower than doing the computation outside of Polars.
Looking at the query plan, it reveals this:
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
WITH_COLUMNS:
[col("keyword").map_list().list.to_struct().struct.field_by_name(dim_0)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_1)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_2)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_3)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_4)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_5)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_6)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_7)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_8)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_9)(),
col("keyword").map_list().list.to_struct().struct.field_by_name(dim_10)(),
...
It carries on like this for all 300 dims. I believe it might be computing nlp(<keyword>)
for every cell of the output. Why might this be? How do I restructure my statements to avoid this?
It's due to how expression expansion works.
The expression level unnest
expands into multiple expressions (one for each field)
pl.col("x").struct.unnest()
Would turn into
pl.col("x").struct.field("a")
pl.col("x").struct.field("b")
pl.col("x").struct.field("c")
Normally you don't notice as Polars caches expressions (CSE), but UDFs are not eligible for caching.
def udf(x):
print("Hello")
return x
df = pl.DataFrame({"x": [[1, 2, 3], [4, 5, 6]]})
df.with_columns(
pl.col.x.map_elements(udf, return_dtype=pl.List(pl.Int64))
.list.to_struct(fields=['a', 'b', 'c'])
.struct.unnest()
)
It calls the UDF for each element.
Hello
Hello
Hello
Hello
Hello
Hello
You can use the unnest
frame method instead.
df.with_columns(
pl.col.x.map_elements(udf, return_dtype=pl.List(pl.Int64))
.list.to_struct(fields=['a', 'b', 'c'])
.alias('y')
).unnest('y')
Hello
Hello