pysparkkedro

kedro ipython, how to access the spark session


I am able to load a spark dataset in a kedro ipython session.

    from kedro.framework.session import KedroSession
    from kedro.framework.startup import bootstrap_project
    from pathlib import Path
    import pyspark.sql
    #from kedro.io import DataCatalog
    from kedro.extras.datasets.spark import SparkDataSet
    import os
    os.chdir('/myproject')     
    project_root = Path.cwd()
    bootstrap_project(project_root)
    
    session = KedroSession.create()
    context = session.load_context()
    catalog = context.catalog
    
    test = catalog.load("mydata@spark")
    test.show(2)
    isinstance(test, pyspark.sql.DataFrame) # True

So there is a spark session correctly defined. question is, how to access this session object? if I run spark = SparkSession.builder.getOrCreate(), I cannot confirm that this is indeed the session managed by Kedro, for example spark.conf.get('spark.driver.maxResultSize') throws a java.util.NoSuchElementException: although this maxResultSize is indeed defined in my project's spark.yml

How to access the right kedro-managed spark session?


Solution

  • so if you do kedro ipython (or use the extension) you should have catalog available as a global variable already and you don't need to create it yourself.

    I have a feeling this will work:

    df = catalog.load('my_data')
    type(df, pyspark.sql.DataFrame)
    spark = df.sparkSession
    ...
    

    https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sparkSession.html