github-actionspython-poetrydvc

Configure poetry with dvc in GitHub Actions


I'm using dvc for data versioning on Dagshub. I need to run script that triggers great_expectation checkpoint for generating data docs, which should afterwards be deployed to Netlify. I managed to setup this to work locally with poetry. Main problem I'm experiencing is that poetry creates virtual environment on ubuntu image within GHA pipeline and it can't access data that I previously pulled with dvc. Is there a workaround for this?

Project Structure

My pyproject.toml:

[tool.poetry]
name = "air-pollution"
version = "0.1.0"
description = ""
authors = ["Jana Jankovic"]
readme = "README.md"
packages = [
    { include = "src/data/*.py" },
    { include = "src/models/*.py" },
]

[tool.poetry.scripts]
fetch_air_data = "src.data.fetch_air_data:main"
fetch_weather_data = "src.data.fetch_weather_data:main"
preprocess_air_data = "src.data.preprocess_air_data:main"
preprocess_weather_data = "src.data.preprocess_weather_data:main"
merge_processed_data = "src.data.merge_processed_data:main"
predict_model = "src.models.predict_model:main"
server = "src.serve.server:main"
data_stability = "src.data.data_stability:main"
data_validation = "src.data.data_validation:main"
split_train_test = "src.data.split_train_test:main"
update_reference = "src.data.update_reference:main"

[tool.poetry.dependencies]
python = "3.10.5"
numpy = "1.23.5"
pandas = "^1.5.3"
flask = "^2.2.3"
scikit-learn = "^1.2.1"
pytest = "^7.2.1"
great-expectations = "^0.16.1"
evidently = "^0.2.7"
mlflow = "^2.2.2"
flask-cors = "^3.0.10"
requests = "^2.28.2"
dvc = "^2.51.0"
dvc-s3 = "^2.21.0"


[tool.poetry.group.dev.dependencies]
black = "^23.1.0"
jupyter = "^1.0.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

My workflow.yaml:

name: Workflow

on:
  push:
    branches:
      - main

  schedule:
    - cron: "* 1 * * *"

jobs: 
  great_expectations_validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v3
        with:
          python-version: "3.10.5"
          token: ${{ secrets.TKN }}
      
      - name: Install dependencies
        run: |
          pip install poetry
          poetry install

      - name: Setup Dagshub
        run: |
          poetry run dvc remote modify origin --local auth basic
          poetry run dvc remote modify origin --local user ${{ secrets.DAGSHUB_USERNAME }}
          poetry run dvc remote modify origin --local password ${{ secrets.DAGSHUB_TOKEN }}
      
      - name: Pull data
        run: |
          poetry run dvc pull
  
      - name: Run Checkpoint
        run: |
          poetry run data_validation

The error I get is here:

Pipeline error message

I tried copying data folder to poetry env, it didn't work.


Solution

  • With a help from a professional I figured out next things:

    1. Poetry creates virtual env - it can be disabled with using an action, so installing poetry with pip install poetry is not needed anymore
      name: Install and configure Poetry
      uses: snok/install-poetry@v1
      with:
        version: 1.3.2
        virtualenvs-create: false
    
    1. Similarly, dvc should be installed with an action
    - uses: iterative/setup-dvc@v1
    
    1. The reason why pipeline kept failing was because of the wrong path to data folder within great_expectations. As this was a school project, I was following steps of the tutorial which automatically generated great_expectations.yml file in which it modified path to use backslash \, because I was running the setup on Windows. Github Actions pipeline is executed on Ubuntu (Linux) where it can't find the path defined with backslashes \. So the final solution is to replace backslashes \ with forward slashes / inside path in great_expectations.yml > my_datasource > data_connectors > default_inferred_data_connector_name > base_directory.