Zero Downtime Deployment Pipeline | Case Study

The Problem

Manual SSH-based Magento deployments are inherently risky. They require developer presence, create maintenance windows that mean real downtime, have no automatic rollback capability, and are inconsistently executed — especially under pressure when something goes wrong mid-deploy. For an agency managing production deployments across multiple client stores, every deployment event carries meaningful risk.

The consequences of a bad deployment are disproportionate to the size of the change. A one-line config fix deployed badly causes the same user-facing outage as a major release gone wrong.

The Solution

A fully automated zero-downtime deployment pipeline built on GitHub Actions and Ansible. The pipeline handles every stage from pre-deploy validation through to post-deploy health verification, with automatic rollback if health checks fail after cutover.

Stage 1 — Pre-Deploy Validation

Before any code changes reach the production web tier:

Composer validation: Dependencies resolve cleanly against the target environment’s PHP version and extension matrix
Static content deploy: SCD completes successfully for all configured themes and locales; output verified for completeness
Database migration dry-run: Pending schema changes checked for destructive operations; migration list logged to deployment record
Application health: Pre-deploy health endpoint verified on the incoming build directory before traffic is shifted

Failures at this stage abort the deployment with no change to production. The failure is logged, an alert dispatched, and the deployment marked failed in the audit log.

Stage 2 — Atomic Cutover

The new application build is deployed to a parallel release directory on the server (standard symlink-based release management). Cutover is a single symlink update — the window during which production points at an indeterminate state is measured in milliseconds.

PHP-FPM reload (not restart): Connection pool maintained through the cutover. Long-running requests complete against the previous build. New requests begin serving from the new build immediately.

Varnish cache clear happens after cutover, not before. Clearing before cutover causes unnecessary cache misses against old code; clearing after ensures new responses populate the cache from the new build.

Stage 3 — Post-Deploy Verification

Immediately after cutover:

Health endpoint verified against the new live build
New Relic transaction error rate checked for spike against the pre-deploy baseline
Varnish hit rate verified — a sharp drop signals a misconfiguration in the new build’s cache headers
Response time P95 checked for regression

If any check fails: Automatic rollback — symlink reverted, PHP-FPM reloaded against the previous release, Varnish cleared, alert dispatched. Time from failure detection to restored production: under 30 seconds.

Deployment Record

Every deployment generates an audit record: who triggered it, what commit, which stores, pre/post health check results, and outcome. Rollbacks are recorded against the failed deployment with the triggering health check attached.

Why This Matters

Rollback capability changes how teams think about releasing. When a bad deployment can be automatically reversed in 30 seconds before users notice, the calculation around releasing changes. Teams ship more frequently — smaller changesets, more predictable diffs, lower per-deployment risk.

The operational pattern before this pipeline: deployments scheduled for low-traffic windows, engineers on standby, manual rollback procedure. The pattern after: deployments run in CI like any other job, automatic verification, no standby required.

What We Learned

The most important investment in zero-downtime deployments is the pre-deploy validation stage. Most failed deployments are caused by things that could have been detected before cutover — missing environment variables, failed SCD, problematic migrations. Catching these before cutover means the rollback cost is zero.

The automatic rollback has triggered. Catching a bad deployment automatically, in under 30 seconds, before a client noticed anything was wrong, demonstrates the value more clearly than any metric.

#The Problem

#The Solution

#Stage 1 — Pre-Deploy Validation

#Stage 2 — Atomic Cutover

#Stage 3 — Post-Deploy Verification

#Deployment Record

#Why This Matters

#What We Learned