rustamazon-ec2arm64ubuntu-20.04aws-graviton

Why does this Rust package take 2x as long to build for aarch64 than it does for x86-64?


Context: At work, I am compiling a set of packages for Intel64 and ARM64, and bundling them up into Linux packages (.rpm, .deb, .apk). I'm building out a whole pipeline to enable this, which will feed into our Artifactory installation.

We're building out self-hosted, native ARM64 runners for our GitHub Enterprise Server system, to pair alongside our existing self-hosted Intel64 runners. Our runners are built atop Amazon EC2 instances. Both are compute-optimized.

CPU Arch Instance type vCPUs Memory uname -m
amd64/x86_64 c5.2xlarge 8 16 GB x86_64
arm64/aarch64 c6g.2xlarge 8 16 GB aarch64
arm64/aarch64 c7g.2xlarge 8 16 GB aarch64

Hosts are EKS clusters running EKS-optimized Amazon Linux 2.

Our "runners" are K8S/EKS pods (~Docker containers) that die/re-spawn after each individual workflow. The Docker image is a multi-platform image — same software, same configuration, multiple CPUs. The container OS is Ubuntu "Focal Fossa" 20.04 LTS.

Using our self-hosted GitHub Actions runners, I wrote the GHA workflow to download the source of a Rust project and compile it — once on the Intel runner, and once on the ARM64 runner. I run uname -m and output the result as the first step in the workflow, and I see what I'm expecting. I also run file against the compiled binary, and I also see what I'm expecting.

(I'm working very hard to have an apples-to-apples comparison here.)

I'm building https://github.com/lycheeverse/lychee as a test project of the pipeline. I've not (yet) tested other compilations, but this felt complex enough to put the new ARM64 runners through the paces.

Here is the build script (${ARCH} is either x86-64 or aarch64, as appropriate):

sudo apt-get -y update
sudo apt-get -y install --no-install-recommends \
  build-essential \
  ca-certificates \
  curl \
  file \
  git \
  gpg \
  gpg-agent \
  gzip \
  libssl-dev \
  openssh-client \
  pkg-config \
  software-properties-common \
  tar \
  wget \
  ;

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "${HOME}/.cargo/env"

# shellcheck disable=2154
wget --header "Authorization: Bearer ${GITHUB_TOKEN}" \
    "https://github.com/lycheeverse/lychee/archive/refs/tags/v0.15.1.tar.gz"
tar zxvf "v0.15.1.tar.gz"

### Start measuring time

cd "lychee-0.15.1/" || true
cargo fetch --target="${ARCH}-unknown-linux-gnu" --locked
cargo install cargo-auditable --locked

# The `mold` linker is pre-installed.
mold -run cargo auditable build --timings --frozen --release

sudo install -Dm755 target/release/lychee -t "/usr/local/bin/"

### Stop measuring time

The Intel runner is one generation older than the Graviton/ARM64 runner. Same vCPUs, same amount of memory available. In the measured amount of time (script, above), I'm seeing these results (average of 5 builds):

I was expecting parity between the two CPU architectures, or maybe a slight edge for ARM64 seeing that the Graviton instance is one generation newer. I also know that languages like Haskell are still working on bringing things to ARM64, and I wonder if the same is true for Rust.

For Rustaceans: are there parts of the Rust build pipeline that are not yet optimized for ARM64 on glibc-based Linuxes?

Next, I'm going to try building a significant project in Go, just to try another language that I know is optimized for ARM64, and attempt to rule-out issues with the Graviton processors. I'm also going to set up another representative Rust project to see if I get different results.


Update (same day): I performed the same test on a Go project (OpenTofu). It has over 300,000 lines of code, and also depends on several external dependencies that have to be downloaded and compiled.

Here, arm64 was a 36% improvement over the 2-generation-old Intel instance. So I don't think my Rust issue is related to Amazon lying about Graviton price-performance. I think is has to do with something about Rust or lychee specifically.


Solution

  • To improve performance of Rust on Graviton, you should specify the use of the large-system extensions (LSE) via RUSTFLAGS before building your project. LSE is included in the Armv8.1 architecture, and improves overall system throughput. Graviton 2 and beyond, based on the Neoverse CPU line, all include the LSE feature.

    Enable LSE with the following line of code before your cargo build --release command:

    export RUSTFLAGS="-Ctarget-feature=+lse"
    

    So your final code should look like this after your sudo apt-get installs:

    # Install Rust
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
    source "${HOME}/.cargo/env"
    
    # shellcheck disable=2154
    wget --header "Authorization: Bearer ${GITHUB_TOKEN}" \
        "https://github.com/lycheeverse/lychee/archive/refs/tags/v0.15.1.tar.gz"
    tar zxvf "v0.15.1.tar.gz"
    
    ### Start measuring time
    
    cd "lychee-0.15.1/" || true
    cargo fetch --target="${ARCH}-unknown-linux-gnu" --locked
    cargo install cargo-auditable --locked
    
    # Set RUSTFLAGS to enable LSE
    export RUSTFLAGS="-Ctarget-feature=+lse"
    
    # The `mold` linker is pre-installed.
    mold -run cargo auditable build --timings --frozen --release
    
    sudo install -Dm755 target/release/lychee -t "/usr/local/bin/"
    
    ### Stop measuring time