gitdatasetgoogle-colaboratorydvc

Error with DVC on Google Colab - dvc.scm.CloneError: Failed to clone repo


I'm having a problem trying to run "dvc pull" on Google Colab. I have two repositories (let's call them A and B) where repository A is for my machine learning codes and repository B is for my dataset.

I've successfully pushed my dataset to repository B with DVC (using gdrive as my remote storage) and I also managed to successfully run "dvc import" (as well as "dvc pull/update") on my local project of repository A.

The problem comes when I use colab to run my project. So what I did was the following:

  1. Created a new notebook on colab
  2. Successfully git-cloned my machine learning project (repository A)
  3. Ran "!pip install dvc"
  4. Ran "!dvc pull -v" (This is what causes the error)

On step 4, I got the error (this is the full stack trace. Note that I changed the repo URL in the stack trace for confidentiality reasons)

2022-03-08 08:53:31,863 DEBUG: Adding '/content/<my_project_A>/.dvc/config.local' to gitignore file.
2022-03-08 08:53:31,866 DEBUG: Adding '/content/<my_project_A>/.dvc/tmp' to gitignore file.
2022-03-08 08:53:31,866 DEBUG: Adding '/content/<my_project_A>/.dvc/cache' to gitignore file.
2022-03-08 08:53:31,916 DEBUG: Creating external repo https://gitlab.com/<my-dataset-repo-B>.git@3a3f2019efabff8ec71429da39b86688d1c98e75
2022-03-08 08:53:31,916 DEBUG: erepo: git clone 'https://gitlab.com/<my-dataset-repo-B>.git' to a temporary dir
Everything is up to date.
2022-03-08 08:53:32,154 ERROR: failed to pull data from the cloud - Failed to clone repo 'https://gitlab.com/<my-dataset-repo-B>.git' to '/tmp/tmp2x6y9z0edvc-clone'
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/scmrepo/git/backend/gitpython.py", line 185, in clone
    tmp_repo = clone_from()
  File "/usr/local/lib/python3.7/dist-packages/git/repo/base.py", line 1148, in clone_from
    return cls._clone(git, url, to_path, GitCmdObjectDB, progress, multi_options, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/git/repo/base.py", line 1079, in _clone
    finalize_process, decode_streams=False)
  File "/usr/local/lib/python3.7/dist-packages/git/cmd.py", line 176, in handle_process_output
    return finalizer(process)
  File "/usr/local/lib/python3.7/dist-packages/git/util.py", line 386, in finalize_process
    proc.wait(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/git/cmd.py", line 502, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git clone -v --no-single-branch --progress https://gitlab.com/<my-dataset-repo-B>.git /tmp/tmp2x6y9z0edvc-clone

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/dvc/scm.py", line 104, in clone
    return Git.clone(url, to_path, progress=pbar.update_git, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/scmrepo/git/__init__.py", line 121, in clone
    backend.clone(url, to_path, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/scmrepo/git/backend/gitpython.py", line 190, in clone
    raise CloneError(url, to_path) from exc
scmrepo.exceptions.CloneError: Failed to clone repo 'https://gitlab.com/<my-dataset-repo-B>.git' to '/tmp/tmp2x6y9z0edvc-clone'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/dvc/command/data_sync.py", line 41, in run
    glob=self.args.glob,
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/pull.py", line 38, in pull
    run_cache=run_cache,
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/fetch.py", line 50, in fetch
    revs=revs,
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/__init__.py", line 437, in used_objs
    with_deps=with_deps,
  File "/usr/local/lib/python3.7/dist-packages/dvc/repo/index.py", line 190, in used_objs
    filter_info=filter_info,
  File "/usr/local/lib/python3.7/dist-packages/dvc/stage/__init__.py", line 660, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
  File "/usr/local/lib/python3.7/dist-packages/dvc/output.py", line 918, in get_used_objs
    return self.get_used_external(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/output.py", line 973, in get_used_external
    return dep.get_used_objs(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/dependency/repo.py", line 94, in get_used_objs
    used, _ = self._get_used_and_obj(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/dependency/repo.py", line 108, in _get_used_and_obj
    locked=locked, cache_dir=local_odb.cache_dir
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/dist-packages/dvc/external_repo.py", line 35, in external_repo
    path = _cached_clone(url, rev, for_write=for_write)
  File "/usr/local/lib/python3.7/dist-packages/dvc/external_repo.py", line 155, in _cached_clone
    clone_path, shallow = _clone_default_branch(url, rev, for_write=for_write)
  File "/usr/local/lib/python3.7/dist-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/lib/python3.7/dist-packages/funcy/flow.py", line 274, in wrap_with
    return call()
  File "/usr/local/lib/python3.7/dist-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dvc/external_repo.py", line 220, in _clone_default_branch
    git = clone(url, clone_path)
  File "/usr/local/lib/python3.7/dist-packages/dvc/scm.py", line 106, in clone
    raise CloneError(str(exc))
dvc.scm.CloneError: Failed to clone repo 'https://gitlab.com/<my-dataset-repo-B>.git' to '/tmp/tmp2x6y9z0edvc-clone'
------------------------------------------------------------
2022-03-08 08:53:32,161 DEBUG: Analytics is enabled.
2022-03-08 08:53:32,192 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp4x5js0dk']'
2022-03-08 08:53:32,193 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp6x11s0dk']'

And btw this is how I cloned my git repository (repo A)

!git config - global user.name "Zharfan"
!git config - global user.email "zharfan@myemail.com"
!git clone https://<MyTokenName>:<MyToken>@link-to-my-repo-A.git

Does anyone know why? Any help would be greatly appreciated. Thank you in advance!


Solution

  • To summarize the discussion in the comments thread.

    Most likely it's happening since DVC can't get access to a private repo on GitLab. (The error message is obscure and should be fixed.)

    The same way you would not be able to run:

    !git clone https://gitlab.com/org/<private-repo>
    

    It also returns a pretty obscure error:

    Cloning into '<private-repo>'...
    fatal: could not read Username for 'https://gitlab.com': No such device or address
    

    (I think it's something related to how tty is setup in Colab?)

    The best approach to solve this is to use SSH like described here for example.