[SOLVED] Transfer files saved in filestore to either the workspace or to a repo

Transfer files saved in filestore to either the workspace or to a repo

I built a machine learning model:

lr = LinearRegression()
lr.fit(X_train, y_train)

which I can save to the filestore by:

filename = "/dbfs/FileStore/lr_model.pkl"
with open(filename, 'wb') as f:
    pickle.dump(lr, f)

Ideally, I wanted to save the model directly to a workspace or a repo so I tried:

filename = "/Users/user/lr_model.pkl"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'wb') as f:
    pickle.dump(lr, f)

but it is not working because the file is not showing up in the workspace.

The only alternative I have now is to transfer the model from the filestore to the workspace or a repo, how do I go about that?

Solution

When you store file in DBFS (/FileStore/...), it's in your account (data plane). While notebooks, etc. are in the Databricks account (control plane). By design, you can't import non-code objects into a workspace. But Repos now has support for arbitrary files, although only one direction - you can access files in Repos from your cluster running in data plane, but you can't write into Repos (at least not now). You can:

Either export model to your local disk & commit, then pull changes into Repos
Use Workspace API to put file (only source code as of right now) into Repos. Here is an answer that shows how to do that.

But really, you should use MLflow that is built-in into Azure Databricks, and it will help you by logging the model file, hyper-parameters, and other information. And then you can work with this model using APIs, command tools, etc., for example, to move the model between staging & production stages using Model Registry, deploy model to AzureML, etc.