pythonsnowflake-cloud-data-platformtableau-api

Sharing data across virtual environments in Python


I am trying to consolidate data from several different sources into a Tableau .hyper file using Pantab. One of my sources is Snowflake. The problem is that Pantab and Snowflake require non-overlapping versions of PyArrow. By using virtual environments, I can write scripts for each of them separately, but that doesn't get me to what I need. I need for the output Pandas dataframe from Snowflake (Script #1 in Environment #1) to use as inputs into Tableau (Script #2 in Environment #2).

I can always have Script #1 export as an Excel file or CSV and then import those, but I'd rather not have those extra files unless there's no other way. There is a lot of transformation that happens prior to creating the .hyper file, so it's not something that lends itself to just doing in Tableau.

I'm currently doing the work in Tableau Data Prep, but it takes over an hour out of my day to babysit it.


Solution

  • As messy as it may seem, writing and reading intermediate files is much safer at this initial step. The files let you check the data after each stage and can help you catch errors that would be much more difficult if everything just flows through and garbage arrives at the end of the processing pipeline.

    I'm not familiar otherwise with the tools you are using. Can they read from standard input and write to standard output? Then you can pipe the data output from stage one to the input of stage two:

    stage_one | stage_two
    

    though since you are using python virtual environments, use the virtual environment commands to run the stages. Using pipenv:

    pipenv run stage_one | pipenv run state_two
    

    or using virtualenv:

    C:\path\to\venv1\python.exe C:\path\to\stage_1.py | C:\path\to\venv2\python.exe C:\path\to\stage_2.py