pythonmachine-learningspacy-3kedromlops

Kedro - Getting path to item in the datacatalog


I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command with something similar to the following code:

import subprocess


def train_spacy_nlp_model(
    config_filepath: str, 
    train_filepath: str, 
    dev_filepath: str, 
    output_dir: str
    ):
    cmd = [
        "python -m", "spacy",
        "train", config_filepath,
        "--output", output_dir,
        "--paths.train", train_filepath,
        "--paths.dev", dev_filepath
    ]

    result = subprocess.run(" ".join(cmd), shell=True)
    if result.returncode != 0:
        raise RuntimeError("Spacy training failed")

But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?


Solution

  • This is probably not the most elegant solution to this, but it works for me so I'll use it until I get a better solution. The solution was to return the path with the object on my DataSet implementation, I doubt that this would generalize for other datasets like SQL queries for example, but since I know that I have to be dealing with a file here, works fine. Here is my implementation:

    from kedro.io import AbstractDataSet
    from spacy.tokens import DocBin
    from dataclasses import dataclass
    from typing import Union
    from pathlib import Path
    
    
    @dataclass
    class DocBinModel:
        filepath: Path
        docbin: DocBin
    
    
    class SpacyDocBinDataSet(AbstractDataSet):
        def __init__(self, filepath, save_args=None, load_args=None):
            self._filepath = filepath
            self._save_args = save_args or {}
            self._load_args = load_args or {}
    
        def _describe(self):
            return dict(
                filepath=self._filepath,
                save_args=self._save_args,
                load_args=self._load_args,
            )
    
        def _load(self):
            with open(self._filepath, "rb") as f:
                docbin = DocBin().from_bytes(f.read())
            
            return DocBinModel(self._filepath, docbin)
    
        def _save(self, data: Union[DocBin, DocBinModel]):
            if isinstance(data, DocBinModel):
                data = data.docbin
            data.to_disk(self._filepath)
    
        def _exists(self):
            return Path(self._filepath).exists()