Debugging a Rust Upstream CI Mystery

2023-06-10 15:17:12

Recently I have been working on a pull request for Rust. After it was approved, the CI jobs were automatically started by the bot (bors), to run a full set of tests. One of the jobs, dist-various-1, failed, and I fixed it in my PR. In order to make sure that the job can now pass, I manually ran dist-various-{1,2}, following the instructions from the official guide. This time both jobs failed with this error:

cp: cannot stat 'obj/build/metrics.json': No such file or directory
##[error]Process completed with exit code 1.

I was pretty sure that I never touched anything related to metrics.json or to metrics in general. Actually I didn't even know what metrics mean in this context. But at least my fix for the original problem seemed to work because this was a new error.

So, what is metrics.json? Where was that error from? Searching for metrics.json in the codebase quickly showed that the error came from the script upload-artifacts.sh that uploads the build artifacts to S3:

#!/bin/bash
# Upload all the artifacts to our S3 bucket. All the files inside ${upload_dir}
# will be uploaded to the deploy bucket and eventually signed and released in
# static.rust-lang.org.
# ...
# Build metrics generated by x.py.
cp "${build_dir}/metrics.json" "${upload_dir}/metrics-${CI_JOB_NAME}.json"
# ...

Luckily, this script answered my questions: metrics.json contains build metrics, which are collected and displayed on static.rust-lang.org. The error was because metrics.json was somehow missing.

Why was it missing then? Continuing with the search for metrics.json, it showed that this function is what generates that file:

impl BuildMetrics {
    // ...
    pub(crate) fn persist(&self, build: &Build) {
        // ...
        let dest = build.out.join("metrics.json");
        // ...
        t!(std::fs::create_dir_all(dest.parent().unwrap()));
        let mut file = BufWriter::new(t!(File::create(&dest)));
        t!(serde_json::to_writer(&mut file, &json));
    }
    // ...
}

And it's called like:

impl Build {
    // ...
    /// Executes the entire build, as configured by the flags and configuration.
    pub fn build(&mut self) {
        // ...
        #[cfg(feature = "build-metrics")]
        self.metrics.persist(self);
    }
    // ...
}

Note that the call is compiled only if the build-metrics feature is enabled. So probably the feature wasn't enabled in the CI job? Searching for build-metrics and excluding all the conditional compilation attributes led me to these lines in bootstrap.py:

class RustBuild(object):
    # ...
    def build_bootstrap(self, color, warnings, verbose_count):
        # ...
        if self.get_toml("metrics", "build"):
            args.append("--features")
            args.append("build-metrics")
        # ...
    # ...

This told me that the feature will be enabled if metrics is set to true in the [build] section of config.toml. It also hinted that searching instead for regex /build.metrics/ might be useful. Within the search result, this if-branch from src/ci/run.sh caught my attention:

if ! isCI || isCiBranch auto || isCiBranch beta || isCiBranch try || isCiBranch try-perf; then
    RUST_CONFIGURE_ARGS="$RUST_CONFIGURE_ARGS --set build.print-step-timings --enable-verbose-tests"
    RUST_CONFIGURE_ARGS="$RUST_CONFIGURE_ARGS --set build.metrics"
    HAS_METRICS=1
fi

Clearly --set build.metrics is only specified if

the build is not run with CI, or
the branch being built is refs/heads/auto, or
the branch being built is refs/heads/beta, or
the branch being built is refs/heads/try, or
the branch being built is refs/heads/try-perf

Unfortunately the failing job didn't satisfy any of these conditions because its environment included:

2023-06-05T02:37:02.9545370Z CI=true
2023-06-05T02:37:02.9552191Z GITHUB_REF=refs/pull/111626/merge

However the job triggered by bors has an environment including:

2023-06-02T22:35:24.2979000Z CI=true
2023-06-02T22:35:24.2990509Z GITHUB_REF=refs/heads/auto

Back to the beginning, why did the job that I manually launched fail? Because its environment didn't request the build to produce metrics.json at all. But wait, every time a PR is opened or pushed to, several default jobs will run. Why didn't these jobs fail? They shouldn't take the if-branch, because they were not launched by bors either. I then checked the CI result of one of these jobs, x86_64-gnu-llvm-14, and found that the step of uploading build artifacts to S3 was not executed. Looking at the definition of that step, I saw the following if expression at the end:

      - name: upload artifacts to S3
        run: src/ci/scripts/upload-artifacts.sh
        env:
          AWS_ACCESS_KEY_ID: "${{ env.ARTIFACTS_AWS_ACCESS_KEY_ID }}"
          AWS_SECRET_ACCESS_KEY: "${{ secrets[format('AWS_SECRET_ACCESS_KEY_{0}', env.ARTIFACTS_AWS_ACCESS_KEY_ID)] }}"
        if: "success() && !env.SKIP_JOB && (github.event_name == 'push' || env.DEPLOY == '1' || env.DEPLOY_ALT == '1')"

That says, the uploading step is only executed when

all the previous steps have succeeded, and
SKIP_JOB is not defined in the environment, and
either
- the triggering event is push, or
- DEPLOY is set to 1, or
- DEPLOY_ALT is set to 1

In my case, only the last three OR-ed conditions would make a difference. Comparing the environment of x86_64-gnu-llvm-14 to that of dist-various-1 indicated that the latter had DEPLOY set to 1, due to a previous step in the CI pipeline:

# Builders starting with `dist-` are dist builders, but if they also end with
# `-alt` they are alternate dist builders.
if [[ "${CI_JOB_NAME}" = dist-* ]]; then
    if [[ "${CI_JOB_NAME}" = *-alt ]]; then
        echo "alternate dist builder detected, setting DEPLOY_ALT=1"
        ciCommandSetEnv DEPLOY_ALT 1
    else
        echo "normal dist builder detected, setting DEPLOY=1"
        ciCommandSetEnv DEPLOY 1
    fi
fi

In addition, the jobs launched by myself had GITHUB_EVENT_NAME=pull_request, while the jobs launched by bors had GITHUB_EVENT_NAME=push. Therefore in my case, the job x86_64-gnu-llvm-14 didn't complain about metrics.json because it skipped the uploading step due to its environment settings. That's why the default jobs didn't fail.

Return...