I have a workspace with different projects as shown below.
I have a code in my main_scripts.py which is under main_scripts sub folder that needs to call a function inside the file config_reader.py which is inside the folder user_functions.
testing_framework is my current working directory with pyspark_training as my root project.
My main_script.py looks like this
and my config_reader.py file like this:
I tried to create a dev.env file into the main pyspark_training folder:
and also i tried modifying the setting for pyspark_training but i am not too sure whether this is correct.
But still getting : ModuleNotFoundError: No module named 'user_functions'
can anyone help me to solve this? I have gone through a bunch of stack verflow topics covering the issue but to no use. I am still getting the same error.
This is the working version of the problem. The problem was with the cwd which was mentioned in the comments by furas. I am pasting the working version for future reference.
Short summary:
want to access:
testing_framework > user_functions > config_reader.py
from: testing_framework > main_scripts > main_script.py
(Note: Folder structure is given in the question above )
main_script.py
import os, sys
import pyspark
from pyspark.sql import SparkSession
BASE = os.path.dirname(os.path.abspath(__file__)) # added the base path which points to where the script is currently
parent_folder = os.path.join(BASE, "..") # joined with the base path
sys.path.append(parent_folder) # used the parent_path to point to the cwd.
from user_functions import config_reader
spark = SparkSession.builder.appName('validation').master("local").getOrCreate()
configs = config_reader.read_config(spark, config_folder, config_file)
print(configs)
config_reader.py
import pyspark
from pyspark.sql import SparkSession
import os
def read_config(spark: SparkSession, config_folder, config_file):
full_path = os.path.join(config_folder, config_file)
if config_file.endswith('.csv'):
df_config = spark.read.format('csv') \
.option('header', True) \
.load(full_path)
return df_config.collect()