I am able to load a spark dataset in a kedro ipython session.
ipython --ext kedro.extras.extensions.ipython
or kedro ipython
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path
import pyspark.sql
#from kedro.io import DataCatalog
from kedro.extras.datasets.spark import SparkDataSet
import os
os.chdir('/myproject')
project_root = Path.cwd()
bootstrap_project(project_root)
session = KedroSession.create()
context = session.load_context()
catalog = context.catalog
test = catalog.load("mydata@spark")
test.show(2)
isinstance(test, pyspark.sql.DataFrame) # True
So there is a spark session correctly defined. question is, how to access this session object?
if I run spark = SparkSession.builder.getOrCreate()
, I cannot confirm that this is indeed the session managed by Kedro, for example spark.conf.get('spark.driver.maxResultSize')
throws a java.util.NoSuchElementException:
although this maxResultSize
is indeed defined in my project's spark.yml
How to access the right kedro-managed spark session?
so if you do kedro ipython
(or use the extension) you should have catalog
available as a global variable already and you don't need to create it yourself.
I have a feeling this will work:
df = catalog.load('my_data')
type(df, pyspark.sql.DataFrame)
spark = df.sparkSession
...