I'm using dvc for data versioning on Dagshub. I need to run script that triggers great_expectation checkpoint for generating data docs, which should afterwards be deployed to Netlify. I managed to setup this to work locally with poetry. Main problem I'm experiencing is that poetry creates virtual environment on ubuntu image within GHA pipeline and it can't access data that I previously pulled with dvc. Is there a workaround for this?
My pyproject.toml:
[tool.poetry]
name = "air-pollution"
version = "0.1.0"
description = ""
authors = ["Jana Jankovic"]
readme = "README.md"
packages = [
{ include = "src/data/*.py" },
{ include = "src/models/*.py" },
]
[tool.poetry.scripts]
fetch_air_data = "src.data.fetch_air_data:main"
fetch_weather_data = "src.data.fetch_weather_data:main"
preprocess_air_data = "src.data.preprocess_air_data:main"
preprocess_weather_data = "src.data.preprocess_weather_data:main"
merge_processed_data = "src.data.merge_processed_data:main"
predict_model = "src.models.predict_model:main"
server = "src.serve.server:main"
data_stability = "src.data.data_stability:main"
data_validation = "src.data.data_validation:main"
split_train_test = "src.data.split_train_test:main"
update_reference = "src.data.update_reference:main"
[tool.poetry.dependencies]
python = "3.10.5"
numpy = "1.23.5"
pandas = "^1.5.3"
flask = "^2.2.3"
scikit-learn = "^1.2.1"
pytest = "^7.2.1"
great-expectations = "^0.16.1"
evidently = "^0.2.7"
mlflow = "^2.2.2"
flask-cors = "^3.0.10"
requests = "^2.28.2"
dvc = "^2.51.0"
dvc-s3 = "^2.21.0"
[tool.poetry.group.dev.dependencies]
black = "^23.1.0"
jupyter = "^1.0.0"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
My workflow.yaml:
name: Workflow
on:
push:
branches:
- main
schedule:
- cron: "* 1 * * *"
jobs:
great_expectations_validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v3
with:
python-version: "3.10.5"
token: ${{ secrets.TKN }}
- name: Install dependencies
run: |
pip install poetry
poetry install
- name: Setup Dagshub
run: |
poetry run dvc remote modify origin --local auth basic
poetry run dvc remote modify origin --local user ${{ secrets.DAGSHUB_USERNAME }}
poetry run dvc remote modify origin --local password ${{ secrets.DAGSHUB_TOKEN }}
- name: Pull data
run: |
poetry run dvc pull
- name: Run Checkpoint
run: |
poetry run data_validation
The error I get is here:
I tried copying data folder to poetry env, it didn't work.
With a help from a professional I figured out next things:
pip install poetry
is not needed anymore name: Install and configure Poetry
uses: snok/install-poetry@v1
with:
version: 1.3.2
virtualenvs-create: false
- uses: iterative/setup-dvc@v1
great_expectations.yml > my_datasource > data_connectors > default_inferred_data_connector_name > base_directory
.