bashshellodbcdatabricksodbc-sql-server-driver

Running Bash Script to Download SQL Server ODBC Driver in Databricks Fails


I have bash script that looks something like this,

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get -q -y install msodbcsql17
python -m pip install --upgrade pip
pip install twine keyring artifacts-keyring
pip install -r requirements.txt

I am basically just trying to install a SQL Server and then running some Python commands.

I am trying to run this on a Databricks cluster.

When I do,

%sh
bash <path-to-bash-script.sh>

Or

%sh
sh <path-to-bash-script.sh>

I get an error when trying to download the driver,

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   983  100   983    0     0  12287      0 --:--:-- --:--:-- --:--:-- 12287
Warning: apt-key output should not be parsed (stdout is not a terminal)
gpg: invalid option "-
"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    79  100    79    0     0    975      0 --:--:-- --:--:-- --:--:--   975
E: Invalid operation update
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package msodbcsql17

Note: I am creating this file locally as part of a project and then I have a CICD pipeline that copies the file into a Databricks workspace.

However, when I take the commands in this file and just run it within a cell using %sh, it runs without an issue.

What exactly is the problem here?


Solution

  • The reason behind this is not entirely clear, however, my best guesses are as follows,

    1. It has something to do with hidden characters in the file that is created locally. For instance, Windows might be adding carriage returns instead of new lines and this could be affecting the execution of the file.
    2. It has something to do with file permissions (upon checking the permissions on the file, however, this does not seem to be the case).

    How I was able to resolve this issue is by simple creating the file inside of the Databricks workspace by using dbutils. For example,

    dbutils.fs.put("dbfs:/scripts/install_dependencies.sh","""
    #!/bin/bash
    curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
    curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
    apt-get update
    ACCEPT_EULA=Y apt-get -q -y install msodbcsql17""", True)
    

    This runs without an issue and it seems to be the recommended way to create any init scripts that you want to run on your clusters.

    The downside is that you can't exactly version control these scripts and will require them to be overwritten each time a change is required.