pythonarrayssnowflake-cloud-data-platformdimensionality-reduction

Snowflake ARRAY column as input to Snowpark modeling.decomposition


I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000). These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition methods). A toy example of the dataframe would be:

df = session.sql("""
    select 'doc1' as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
    union
    select 'doc2' as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
    """)
print(df)
# DOC_ID  | DOC_VEC
# doc1 | [   0.1,   0.3,   0.5,   0.7 ]
# doc2 | [   0.2,   0.4,   0.6,   0.8 ]

However, when I try to fit this dataframe

from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols = 'doc_vec', output_cols='out_svd')
print(tsvd)
out = tsvd.fit(df)

I get

 File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
    args = {"X": df[input_cols]}
                 ~~^^^^^^^^^^^^   File "pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]

<...snip...>

KeyError: "None of [Index(['doc_vec'], dtype='object')] are in the [columns]"

Based on the information in this tutorial text_embedding_as_snowpark_python_udf, I suspect the Snowpark array needs to be converted to a np.ndarray before being fed to underlying sklearn.decomposition.TruncatedSVD

Can someone point me to any example using Snoflake arrays as inputs to the Snowpark models, please?


Solution

  • The problem right now is that Snowflake currently doesn't support sparse matrix (but it will).

    A teammate wrote this sample code that will be supported in the future:

    from snowflake.ml.modeling.decomposition import TruncatedSVD
    from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
    from snowflake.snowpark import Session, functions as F, types as T
    
    session = Session.builder.configs(SnowflakeLoginOptions()).getOrCreate()
    
    # This can not work right now because snowflake ml doesn't accept input as array type so far... We'll support it in the future!
    t = session.range(5).with_column(
        "doc_vec",
        F.array_construct(
            F.lit(0.1),
            F.lit(0.2),
            F.lit(0.3),
        ),
    ).with_column("doc_vec", F.col("doc_vec").cast(T.ArrayType(T.FloatType())))
    tsvd = TruncatedSVD(input_cols="DOC_VEC", output_cols="DOC_VEC")
    
    # create a dataframe as input
    t = session.create_dataframe([[0.1, 0.2, 0.3] for _ in range(5)], schema=["A", "B", "C"])
    tsvd = TruncatedSVD(input_cols=["A", "B", "C"], output_cols=["OUTPUT"])
    t.show()
    
    tsvd.fit(t)
    # show the results
    tsvd.transform(t).show()