I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
The files config.cfg
, train.spacy
and dev.spacy
are all registered in my data catalog. I want to run this command with something similar to the following code:
import subprocess
def train_spacy_nlp_model(
config_filepath: str,
train_filepath: str,
dev_filepath: str,
output_dir: str
):
cmd = [
"python -m", "spacy",
"train", config_filepath,
"--output", output_dir,
"--paths.train", train_filepath,
"--paths.dev", dev_filepath
]
result = subprocess.run(" ".join(cmd), shell=True)
if result.returncode != 0:
raise RuntimeError("Spacy training failed")
But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?
This is probably not the most elegant solution to this, but it works for me so I'll use it until I get a better solution. The solution was to return the path with the object on my DataSet
implementation, I doubt that this would generalize for other datasets like SQL queries for example, but since I know that I have to be dealing with a file here, works fine. Here is my implementation:
from kedro.io import AbstractDataSet
from spacy.tokens import DocBin
from dataclasses import dataclass
from typing import Union
from pathlib import Path
@dataclass
class DocBinModel:
filepath: Path
docbin: DocBin
class SpacyDocBinDataSet(AbstractDataSet):
def __init__(self, filepath, save_args=None, load_args=None):
self._filepath = filepath
self._save_args = save_args or {}
self._load_args = load_args or {}
def _describe(self):
return dict(
filepath=self._filepath,
save_args=self._save_args,
load_args=self._load_args,
)
def _load(self):
with open(self._filepath, "rb") as f:
docbin = DocBin().from_bytes(f.read())
return DocBinModel(self._filepath, docbin)
def _save(self, data: Union[DocBin, DocBinModel]):
if isinstance(data, DocBinModel):
data = data.docbin
data.to_disk(self._filepath)
def _exists(self):
return Path(self._filepath).exists()