
kedro ipython, how to access the spark session

I am able to load a spark dataset in a kedro ipython session.

    from kedro.framework.session import KedroSession
    from kedro.framework.startup import bootstrap_project
    from pathlib import Path
    import pyspark.sql
    #from import DataCatalog
    from kedro.extras.datasets.spark import SparkDataSet
    import os
    project_root = Path.cwd()
    session = KedroSession.create()
    context = session.load_context()
    catalog = context.catalog
    test = catalog.load("mydata@spark")
    isinstance(test, pyspark.sql.DataFrame) # True

So there is a spark session correctly defined. question is, how to access this session object? if I run spark = SparkSession.builder.getOrCreate(), I cannot confirm that this is indeed the session managed by Kedro, for example spark.conf.get('spark.driver.maxResultSize') throws a java.util.NoSuchElementException: although this maxResultSize is indeed defined in my project's spark.yml

How to access the right kedro-managed spark session?


  • so if you do kedro ipython (or use the extension) you should have catalog available as a global variable already and you don't need to create it yourself.

    I have a feeling this will work:

    df = catalog.load('my_data')
    type(df, pyspark.sql.DataFrame)
    spark = df.sparkSession