testingbashdockercibats

Three Bash Projects, Eight Distros, Ninety Seconds

Ryan MacDonaldMay 12, 202610 min read

A stray mapfile -d in a shared library. It parses on the developer's Debian 12 box, passes lint, passes review, and ships. Six hours later a customer on CentOS 6 is staring at a declare: -A: invalid option from bash 4.1. That is the class of bug we want to fail on commit, not in production. Building a matrix CI that catches it in under two minutes, on hardware we already own, took three components and one afternoon of TLS plumbing.

We ship three Bash projects: APF, BFD, and maldet. Each deploys across a matrix spanning CentOS 6 (2011-era coreutils in /bin), modern Rocky and Ubuntu, and whatever long-tail distros our users are still on. For years the answer was “run it in a VM and see what breaks,” which scales to one OS, poorly to three, and not at all to eight. Cloud CI charges per minute and our workload is dominated by cold builds. We wanted the full matrix to finish in the time it takes to make coffee, on hardware we already own.

Stack at a Glance

BATS runner (TAP output, bash functions as tests)
batsman shared harness, git submodule at tests/infra/
Docker over TCP with mutual TLS on port 2376
BuildKit enabled daemon-side so the layer cache works
Parallel fanout across eight OS containers, TAP to stdout
Fallback lane to local Docker when the build host is down

Why BATS#

When the code under test is bash, the tempting answer is a higher-level runner in Python or Go. We tried it. The integration layer (spawn, capture, diff) grew to outweigh the tests, and it lied about the runtime: a subprocess launched from pytest has a different inherited environment than cron or systemd, and our regressions live exactly in that seam.

BATS (Bash Automated Testing System) is a shell-native test runner. Each test is a bash function. The run builtin captures stdout, stderr, and exit code; assertions hang off status and output. Output conforms to the Test Anything Protocol, so it pipes straight into any CI dashboard that understands TAP.

bash

#!/usr/bin/env bats
load '/usr/local/lib/bats/bats-support/load'
load '/usr/local/lib/bats/bats-assert/load'
source /opt/tests/helpers/assert-scan.bash

SAMPLES_DIR="/opt/tests/samples"
TEST_SCAN_DIR="/tmp/lmd-test-scan"

setup() {
    source /opt/tests/helpers/reset-lmd.sh
    mkdir -p "$TEST_SCAN_DIR"
}

teardown() {
    rm -rf "$TEST_SCAN_DIR"
}

@test "MD5 scan detects known test sample (EICAR)" {
    cp "$SAMPLES_DIR/eicar.com" "$TEST_SCAN_DIR/"
    run maldet -a "$TEST_SCAN_DIR"
    assert_scan_completed
    assert_output --partial "malware hits 1"
}

That is the whole test. It invokes the real maldet binary the way cron would, in the container's actual environment, and asserts on the output an operator would see. No mocks. No fixtures that drift. When a test fails, bash -x the failing bit and read the trace.

Each project keeps a helpers/ directory of domain-specific assert_* functions (scan completion, quarantine presence, firewall chain state). Every test sources bats-support and bats-assert from the container image. The tests read like prose and measure what production actually does.

Why Containers, Not VMs#

APF, BFD, and maldet each need to work on CentOS 6, CentOS 7, Rocky 8, Rocky 9, Ubuntu 20.04, Ubuntu 24.04, Debian 12, and (project-dependent) a legacy ubuntu1204 or a Rocky 10 preview. Three projects × eight images, rebuilt on every library change, is a container workload, not a VM workload.

Containers give us three things a VM farm does not:

Cold start in seconds. A fresh Rocky 9 container is running bash in under a second; a VM takes a minute even with KSM tricks.

Layer cache. Deterministic base layer (bash, coreutils, iptables, ClamAV, bats-*) caches once. Test-only changes are a no-op rebuild; source changes are a COPY. VMs provision from scratch.

Declarative images. Each Dockerfile.<os> is in version control. A new Rocky minor that breaks something is a diff on a twenty-line file, not a VM template rebuild.

CentOS 6 is the interesting one. Long past EOL, but our user base still includes hosts that refuse to leave it. Frozen base images live on Docker Hub and in community mirrors; they run under modern Docker with vsyscall=emulate on the host kernel. Our image installs bash 4.1 and coreutils from the Vault repo. When a dev used mapfile -d (bash 4.4+) in a shared library, the CentOS 6 container caught it before a user did. That one container justifies the whole stack.

The batsman Submodule#

Three projects, three harnesses, three copies of the same runner, drifting independently. First cut, painful. Every fix landed three times. Every new distro was added three times. Runner updates in one project took weeks to reach the others.

The fix is batsman, a shared harness repo. It owns the Docker runner, the Makefile.tests include fragment, and a base set of Dockerfiles per OS target. Consumers pull it in as a git submodule at tests/infra/.

bash

# In each consumer repo
$ git submodule add https://github.com/rfxn/batsman.git tests/infra
$ ls tests/infra/
  dockerfiles/    include/     lib/     scripts/

# Project-level Makefile (include the shared fragment)
$ cat tests/Makefile
BATSMAN_PROJECT   := lmd
BATSMAN_OS_MODERN := debian12 rocky9 ubuntu2404
BATSMAN_OS_LEGACY := centos7 rocky8 ubuntu2004
BATSMAN_OS_DEEP   := centos6
export BATSMAN_TEST_TIMEOUT ?= 180
include infra/include/Makefile.tests

Submodule semantics give us what we need. Each consumer pins a specific batsman commit. Upgrades are intentional: run git submodule update --remote tests/infra, verify the suite passes, commit the new pointer. No silent drift.

BATSMAN_PROJECT is the project's identity. The runner keys off it for the install script, the default container path ( /opt/tests by convention), and the container name prefix. Everything else (OS list, per-test timeouts, extra Docker flags) is a Make variable consumers override only when needed.

Consumers own the Dockerfiles that install their own code. batsman ships base images with bash, coreutils, and bats preinstalled; each project's tests/dockerfiles/Dockerfile.<os> extends the base (maldet adds ClamAV and YARA; APF adds iptables and ipset). The shared layer caches once; the project layer rebuilds only when project files change.

Docker over TCP with Mutual TLS#

Test host and dev host are different machines. We develop on freedom (the workstation) and test on anvil (a dedicated LAN build box with no other job). Running the matrix on freedom freezes the workstation for two minutes per push. Running it on anvil means shipping source over and collecting results somehow.

The common answer is SSH: pipe docker commands through ssh or rsync the source and trigger a remote make. Both work. Both add per-command connection overhead, serialize anything that could stream, and fight with the Docker client's assumption that it is talking to a daemon directly. BuildKit streaming in particular falls on its face over SSH pipes.

The better answer is the one Docker was designed for: expose the daemon over TCP with mutual TLS, and point the client at it.

Property	SSH tunnel	Docker over TCP + mTLS
per-command handshake	yes (SSH connect each time)	no (persistent TLS)
BuildKit streaming	broken / degraded	native
parallel `docker run`	serialized by tunnel	actually parallel on daemon
source at rest on build host	yes (rsync copy)	no (bind-mount only)
auth model	SSH keys	client cert signed by CA

anvil's daemon listens on tcp://0.0.0.0:2376. Server certs at /etc/docker/tls/, freedom's client cert and matching CA at ~/.docker/tls/. Both sides authenticate on every connection; the daemon rejects any client cert not signed by the CA. Running the matrix is a one-liner:

bash

# Run the full matrix on anvil, stream results back to /tmp
DOCKER_HOST=tcp://192.168.2.189:2376 \
DOCKER_TLS_VERIFY=1 \
DOCKER_CERT_PATH=~/.docker/tls \
  make -C tests test 2>&1 | tee /tmp/test-lmd-debian12.log | tail -30

# Or use a named context (one-time setup, no env vars after)
$ docker context create anvil \
    --docker host=tcp://192.168.2.189:2376,ca=~/.docker/tls/ca.pem,\
cert=~/.docker/tls/cert.pem,key=~/.docker/tls/key.pem
$ docker --context anvil ps
$ make -C tests test   # inherits DOCKER_HOST from context

Named Docker contexts make the everyday case one word ( --context anvil) instead of three env vars. The important properties are all practical. The client sees a local-feeling Docker daemon. BuildKit streams work natively. Parallel docker run calls for different OS containers actually run in parallel on the daemon side. There is no SSH connect handshake on every command, and no copy of the source code on anvil at rest. Source gets bind-mounted into each container for the duration of its run and disappears with the container.

The mTLS part is not fancy: a CA, a server cert for anvil, a client cert for freedom, ten-year validity, key directories at chmod 700. This is not a zero-trust stack. It is a pragmatic one-admin, two-host setup that keeps the daemon off the public internet and gates access on possession of a specific client cert.

The BuildKit Gotcha#

Callout

Without DOCKER_BUILDKIT=1 in the build host's shell environment, the daemon falls back to the legacy builder. BuildKit owns the content-addressable layer cache for TCP-delivered build contexts. The legacy builder does not. Every build rebuilds every layer. A warm matrix run that should take ninety seconds takes fifteen minutes instead. See the Docker Engine release notes for the history of BuildKit becoming the default.

We hit this three times before we believed it. The symptom is boring: anvil feels slow for no reason. The fix is one line in anvil's ~/.bashrc:

bash

# On anvil (the build host), exported for all shells
# Docker's legacy builder ignores layer cache for TCP-delivered build
# contexts; BuildKit is required for the cache to work at all.
export DOCKER_BUILDKIT=1

The test runner also prints a reminder to check echo $DOCKER_BUILDKIT on the daemon host if a warm-cache build exceeds five minutes. Config drift on the daemon side is invisible from the client; latency is the only signature, and latency is easy to blame on the network.

Timings#

The numbers we actually hit, measured across recent runs:

~70s

local, single OS

~45s

anvil, single OS

~110s

anvil, full matrix (8 OSes)

~15m

same matrix, no BuildKit

Configuration	Target	Warm	Cold
local (freedom)	single OS (Debian 12)	~70s	+60s image build
anvil over TCP	single OS (Debian 12)	~45s	+30s image build
anvil over TCP	full matrix (8 OSes)	~110s	+4-6 min image builds
anvil, no BuildKit	full matrix (8 OSes)	~15 min	~15 min

anvil beats local on single-OS runs because it has no other workload competing for CPU. The full-matrix number is the one that matters: eight distros, warm-cache run completes before you switch windows.

“Local iteration for single files. Anvil for the full matrix. Fallback for anvil outages. Three lanes, same command.”

Local runs still earn their keep. Iterating on a single .bats file against Debian 12 means no network hop; once it passes on the most forgiving OS, kick the full anvil run and go read Hacker News.

That matrix run is the payoff for everything else. When we rewrote maldet's scan engine and benchmarked a 43x parallel-scan speedup, proving it across every supported distro before merge took ninety seconds, not an afternoon of manual VM runs.

Fallback Discipline#

anvil is a single point of failure. When the LAN is flaky or anvil is rebooting for kernel updates, the fallback is freedom's own daemon, listening with the same mTLS setup on 127.0.0.1:2376:

bash

# Fallback: run on freedom's local Docker daemon
DOCKER_HOST=tcp://127.0.0.1:2376 \
DOCKER_TLS_VERIFY=1 \
DOCKER_CERT_PATH=~/.docker/tls \
  make -C tests test 2>&1 | tee /tmp/test-lmd-debian12.log | tail -30

# Or via a named context
$ docker --context freedom-tcp ps
$ DOCKER_CONTEXT=freedom-tcp make -C tests test

Fallback is also the right lane for tests that need host-local bind-mounted data: a production signature snapshot, a captured pcap, a corpus too large to sync. Those get tagged “freedom-only” in the Makefile and skipped on anvil. The rule is simple: if the test needs data outside the git tree, it runs on freedom; committed fixtures under tests/samples/ run anywhere.

What We Test, What We Skip#

Container-based testing covers runtime behavior. It does not cover every install-time path. Here is the honest split.

Caught reliably

Shell portability: bash version skew, pre- vs post-usr-merge coreutils, absent binaries on minimal images.

Regression coverage: every bug we fix gets a regression @test in the nearest BATS file. Fails loudly if the bug returns.

CLI surface: every argument, every config variable, every documented exit code.

Still needs a live host

Cron delivery: containers are short-lived and cron is not running. We test the scripts cron.daily invokes, not cron itself firing them.

Init integration: systemd units and SysV scripts only run if the image includes the init system. Most minimal images do not.

Install-time paths under an existing package-manager install (RPM/DEB upgrade): the fresh-install symlinks get tested; upgrade-path behavior only gets covered on a live host.

We are fine with that gap, and honest about its shape. The container suite catches every bug class that lint misses and runs fast enough to be a pre-commit hook. Install-path coverage (cron firing, init wrappers, package-manager upgrade paths) runs on a small set of live VMs on the release path, not on every commit. This is not a production CI replacement for every team: it is single-admin, two-host, and the mTLS setup presumes you are comfortable signing your own certs. If those assumptions hold, the rest is mechanical.

Conclusion#

The bill for this CI stack: one LAN build host we already owned, one-time effort on ten Dockerfiles and a shared Makefile fragment, an afternoon on TLS certs. No monthly invoice, no per-minute billing, no queue depth, no provider outages.

The payoff is every commit getting exercised against the distro matrix our users actually run. A bash quirk that only bites on CentOS 6 fails the CentOS 6 test. A mapfile -d in a shared library fails the moment it hits the legacy container. That feedback loop is the difference between shipping confidently and shipping cautiously. For the companion piece on why the matrix matters in the first place (the coreutils-location split, command prefix discipline, and other portability landmines), see portable bash across the pre-usr-merge boundary.

APF, BFD, and maldet are all open source under GPLv2. batsman is open source under GPLv2.

The Minimum Stack

If you are shipping bash (or any shell-adjacent code) to a distro matrix, the pattern reproduces from scratch in an afternoon:

Pick bats-core as the runner; add bats-assert and bats-support for readable failures.
Template a Dockerfile.<os> per target; keep base layer (bash, coreutils, bats) separate from the project layer so the cache helps.
On a spare LAN box, enable Docker over TCP with mutual TLS: CA, one server cert, one client cert, lock key perms, done.
Export DOCKER_BUILDKIT=1 in the daemon host's shell so the layer cache works for TCP-delivered contexts.
Wire a named Docker context on the client, point make test at it, keep a local-Docker context as fallback.

No monthly bill. No queue depth. No provider outages. Just the matrix, running on hardware you already own.

References#

Why BATS#

Why Containers, Not VMs#

The batsman Submodule#

Docker over TCP with Mutual TLS#

The BuildKit Gotcha#

Timings#

Fallback Discipline#

What We Test, What We Skip#

Caught reliably

Still needs a live host

Conclusion#

References#

Testing and BATS

Docker and BuildKit

Related rfxn Research