azure-devopsazure-vm-scale-setazure-devops-pipelines

Azure DevOps Pipelines - Scale Set Agents: Installing Docker


We've recently reconfigured our build process to run entirely in containers, and we're now looking to migrate away from on-premise build agents to using agents in an Azure Scale Set.

We want to avoid having to maintain our own VM images for the Azure Scale Set, and have opted to use the default Ubuntu 18.04 LTS image which is available in Azure.

This image does not include Docker, so we've configured the Azure Scale Set to use a cloud-config script which will install Docker when the VM first boots:

#cloud-config

apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88

packages:
  - docker-ce
  - docker-ce-cli

groups:
  - docker

This seems to work well, but sometimes the build jobs fail:

Starting: Initialize containers
/usr/bin/docker version --format '{{.Server.APIVersion}}'
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
'
##[error]Exit code 1 returned from process: file name '/usr/bin/docker', arguments 'version --format '{{.Server.APIVersion}}''.
Finishing: Initialize containers

enter image description here

It looks like either the cloud-init script failed, or the Azure DevOps agent started on the VM before the cloud-init script completed.

So far, I've seen the following scenarios:

Does anyone have a similar setup? Does it work properly? If not, what are alternative ways to deploy Docker to the VMs before the VM runs a container job?


Solution

  • When you configure Azure DevOps agent pool to use an Azure Scale Set to provision build machines, the Microsoft.Azure.DevOps.Pipelines.Agent/TeamServicesAgentLinux extension is automatically added to your scale set.

    This extension is responsible for installing the Azure DevOps agent on your VMs and adding it to your agent pool.

    The extension runs when the VM boots, at about the same time as the cloud-init script. This can cause race conditions.

    To work around this, add a bootcmd script to your cloud-config script which forces the walinuxagent agent service (which will launch the Azure DevOps extension) after the cloud-config script, like this:

    #cloud-config
    
    bootcmd:
      - mkdir -p /etc/systemd/system/walinuxagent.service.d
      - echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/walinuxagent.service.d/override.conf
      - sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
      - systemctl daemon-reload
    
    apt:
      sources:
        docker.list:
          source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
          keyid: 9DC858229FC7DD38854AE2D88D81803C0EBFCD88
    
    packages:
      - docker-ce
      - docker-ce-cli
    
    groups:
      - docker
    

    This allows you to create an Azure DevOps scale set agent pool which uses the standard Ubuntu 18.04 image, and installs docker on top of that image.

    See https://github.com/microsoft/azure-pipelines-agent/issues/2866 and https://github.com/Azure/WALinuxAgent/issues/1938#issuecomment-657293920 for more background.

    While you're at it, you may also want to mount /agent on the resource disk of your VM, which typically has better performance than the OS disk. You can add this to your cloud-init script to do so:

    disk_setup:
      ephemeral0:
        table_type: gpt
        layout: [66, [33,82]]
        overwrite: true
    
    fs_setup:
      - device: ephemeral0.1
        filesystem: ext4
    
    mounts:
      - ["ephemeral0.1", "/agent"]