I'm trying to upload training job artifacts to S3 in a non-compressed manner.
I am familiar with the output_dir one can provide to a sagemaker Estimator, then everything saved under /opt/ml/output is uploaded compressed to the S3 output dir.
I want to have the option to access a specific artifact without having to decompress the output every time. Is there a clean way to go about it? if not any workaround in mind? The artifacts of my interest are small meta-data files .txt or .csv, while in my case the rest of the artifacts can be ~1GB so downloading and decompressing is quite excessive.
any help would be appreciated
You can specify parameter disable_output_compression=True
when specifying your Estimator
(details in docs here). Then all your outputs will be saved in output_dir
uncompressed.
Example:
import sagemaker
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="your-own-image-uri",
role=sagemaker.get_execution_role(),
sagemaker_session=sagemaker.Session(),
instance_count=1,
instance_type='ml.c4.xlarge',
disable_output_compression=True
)