2023-06-10 15:17:12
Recently I have been working on a pull request for Rust. After it was approved,
the CI jobs were automatically started by the bot (bors), to run a full set of tests.
One of the jobs, dist-various-1
, failed, and I fixed it in my PR.
In order to make sure that the job can now pass, I manually ran
dist-various-{1,2}
, following the instructions from the official guide.
This time both jobs failed with this error:
cp: cannot stat 'obj/build/metrics.json': No such file or directory
##[error]Process completed with exit code 1.
I was pretty sure that I never touched anything related to metrics.json
or
to metrics in general. Actually I didn't even know what metrics mean in this context.
But at least my fix for the original problem seemed to work because this was a new error.
So, what is metrics.json
? Where was that error from?
Searching for metrics.json
in the codebase quickly showed that the error came from
the script upload-artifacts.sh
that uploads the build artifacts to S3:
#!/bin/bash
# Upload all the artifacts to our S3 bucket. All the files inside ${upload_dir}
# will be uploaded to the deploy bucket and eventually signed and released in
# static.rust-lang.org.
# ...
# Build metrics generated by x.py.
cp "${build_dir}/metrics.json" "${upload_dir}/metrics-${CI_JOB_NAME}.json"
# ...
Luckily, this script answered my questions: metrics.json
contains build metrics, which are collected
and displayed on static.rust-lang.org. The error was because metrics.json
was
somehow missing.
Why was it missing then? Continuing with the search for metrics.json
,
it showed that this function is what generates that file:
impl BuildMetrics {
// ...
pub(crate) fn persist(&self, build: &Build) {
// ...
let dest = build.out.join("metrics.json");
// ...
t!(std::fs::create_dir_all(dest.parent().unwrap()));
let mut file = BufWriter::new(t!(File::create(&dest)));
t!(serde_json::to_writer(&mut file, &json));
}
// ...
}
And it's called like:
impl Build {
// ...
/// Executes the entire build, as configured by the flags and configuration.
pub fn build(&mut self) {
// ...
#[cfg(feature = "build-metrics")]
self.metrics.persist(self);
}
// ...
}
Note that the call is compiled only if the build-metrics
feature is enabled.
So probably the feature wasn't enabled in the CI job?
Searching for build-metrics
and excluding all the conditional compilation attributes
led me to these lines in bootstrap.py
:
class RustBuild(object):
# ...
def build_bootstrap(self, color, warnings, verbose_count):
# ...
if self.get_toml("metrics", "build"):
args.append("--features")
args.append("build-metrics")
# ...
# ...
This told me that the feature will be enabled if metrics
is set to true
in the [build]
section of config.toml
.
It also hinted that searching instead for regex /build.metrics/
might be useful. Within the search result,
this if-branch from src/ci/run.sh
caught my attention:
if ! isCI || isCiBranch auto || isCiBranch beta || isCiBranch try || isCiBranch try-perf; then
RUST_CONFIGURE_ARGS="$RUST_CONFIGURE_ARGS --set build.print-step-timings --enable-verbose-tests"
RUST_CONFIGURE_ARGS="$RUST_CONFIGURE_ARGS --set build.metrics"
HAS_METRICS=1
fi
Clearly --set build.metrics
is only specified if
refs/heads/auto
, orrefs/heads/beta
, orrefs/heads/try
, orrefs/heads/try-perf
Unfortunately the failing job didn't satisfy any of these conditions because its environment included:
2023-06-05T02:37:02.9545370Z CI=true
2023-06-05T02:37:02.9552191Z GITHUB_REF=refs/pull/111626/merge
However the job triggered by bors has an environment including:
2023-06-02T22:35:24.2979000Z CI=true
2023-06-02T22:35:24.2990509Z GITHUB_REF=refs/heads/auto
Back to the beginning, why did the job that I manually launched fail? Because its environment didn't request the build to
produce metrics.json
at all. But wait, every time a PR is opened or pushed to, several default jobs will run.
Why didn't these jobs fail? They shouldn't take the if-branch, because they were not launched by bors either.
I then checked the CI result of one of these jobs, x86_64-gnu-llvm-14
, and found that the step of
uploading build artifacts to S3 was not executed. Looking at the definition of that step,
I saw the following if
expression at the end:
- name: upload artifacts to S3
run: src/ci/scripts/upload-artifacts.sh
env:
AWS_ACCESS_KEY_ID: "${{ env.ARTIFACTS_AWS_ACCESS_KEY_ID }}"
AWS_SECRET_ACCESS_KEY: "${{ secrets[format('AWS_SECRET_ACCESS_KEY_{0}', env.ARTIFACTS_AWS_ACCESS_KEY_ID)] }}"
if: "success() && !env.SKIP_JOB && (github.event_name == 'push' || env.DEPLOY == '1' || env.DEPLOY_ALT == '1')"
That says, the uploading step is only executed when
SKIP_JOB
is not defined in the environment, andDEPLOY
is set to 1, orDEPLOY_ALT
is set to 1In my case, only the last three OR-ed conditions would make a difference. Comparing the environment of x86_64-gnu-llvm-14
to that of dist-various-1
indicated that the latter had DEPLOY
set to 1, due to a previous step in the CI pipeline:
# Builders starting with `dist-` are dist builders, but if they also end with
# `-alt` they are alternate dist builders.
if [[ "${CI_JOB_NAME}" = dist-* ]]; then
if [[ "${CI_JOB_NAME}" = *-alt ]]; then
echo "alternate dist builder detected, setting DEPLOY_ALT=1"
ciCommandSetEnv DEPLOY_ALT 1
else
echo "normal dist builder detected, setting DEPLOY=1"
ciCommandSetEnv DEPLOY 1
fi
fi
In addition, the jobs launched by myself had GITHUB_EVENT_NAME=pull_request
,
while the jobs launched by bors had GITHUB_EVENT_NAME=push
.
Therefore in my case, the job x86_64-gnu-llvm-14
didn't complain about metrics.json
because
it skipped the uploading step due to its environment settings. That's why the default jobs didn't fail.