High-Availability Magento Hosting at SMB Cost | Case Study

The Problem

Enterprise Magento hosting — Adobe Commerce Cloud, Nexcess, Acquia — is priced for enterprise budgets. SMB Magento merchants with stores turning over £500k–£5m don’t have those budgets, but they have the same uptime and performance requirements. Single-server shared hosting is accessible but brittle: a hardware failure or traffic spike causes downtime with no automatic recovery.

The gap between “what SMB merchants can afford” and “what production Magento actually needs” was a real constraint. The goal was to close it.

The Architecture

A multi-service infrastructure designed for high availability from the ground up, deployed on commodity cloud infrastructure (primarily AWS) with cost optimisation baked into the architecture decisions.

Web Tier

Multiple EC2 application nodes behind an Application Load Balancer. Auto Scaling Group configured to trigger on CPU utilisation and request queue depth — the cluster scales out automatically under load and scales in during quiet periods to control cost.

Session storage is on Redis (not in-process), so requests can be handled by any node without sticky routing. A node can be terminated without any request failing — the ALB health check removes it from rotation and the scaling group replaces it.

Database Tier

MySQL primary with read replicas. Magento’s read model queries — product lists, category pages, search — route to replicas. Write operations go to the primary. Point-in-time recovery enabled via automated snapshots.

Automated failover configured: if the primary becomes unavailable, a replica is promoted automatically. RTO for database failover: under 2 minutes. Manual intervention not required.

Cache Tier

Redis Cluster for both session storage and Magento backend cache (config, block HTML, full page). Sentinel configuration for automatic failover — if a Redis node fails, the cluster continues operating without manual intervention.

Varnish sits in front of the web tier as the full-page cache layer. Cache hit rates on production stores consistently above 90% — the vast majority of requests served directly from Varnish memory without touching PHP or MySQL. The performance delta between a Varnish hit (~5ms) and an uncached Magento page response (~500ms) is the core performance story for any well-configured Magento store.

Search

Elasticsearch cluster with replica shards. A single node failure degrades performance but doesn’t take search offline — the cluster continues serving from replicas. Full rebuild from the MySQL source on the surviving nodes can be triggered manually if needed.

Message Queue

RabbitMQ cluster for Magento’s asynchronous operations — bulk order processing, email dispatch, inventory update events, and any custom async flows. Messages are persisted to disk and replicated across cluster nodes. Node failure doesn’t lose messages — they continue processing from the queue once the cluster recovers.

Observability

Centralised ELK stack for log aggregation: Nginx access logs, PHP-FPM logs, Magento exception logs, deployment events. New Relic APM for application performance monitoring. Infrastructure-level dashboards (CPU, memory, disk I/O, network) distinct from application dashboards (transaction times, error rates, Varnish hit rate).

Alerts configured for: error rate spike, P95 response time regression, Varnish hit rate drop, disk utilisation threshold, and any service entering a degraded state.

Cost Model

The architecture achieves enterprise-grade reliability at SMB-accessible cost through three levers:

Reserved instance pricing: 1-year or 3-year commitments on the baseline capacity, significantly reducing EC2 costs versus on-demand
Right-sizing: Services sized to actual load with headroom, not theoretical peaks. RDS instance size, Redis node size, and Elasticsearch cluster size all tuned to real traffic patterns with autoscaling for spikes
Selective managed services: RDS for MySQL (managed backup, failover, patching justifies the premium); self-hosted Redis, Elasticsearch, and RabbitMQ on EC2 (the overhead is low, the cost saving is real)

Impact

Merchants who previously ran on single-server cPanel hosting now operate on infrastructure that can lose any single component without a service interruption. The 99.99% uptime figure — under an hour of downtime per year — represents a qualitative change in reliability for stores that previously planned around maintenance windows and accepted occasional outages.

The autoscaling capacity handles promotional traffic spikes (Black Friday, seasonal campaigns) that would previously have required advance capacity planning or caused partial outages under unexpected load.

#The Problem

#The Architecture

#Web Tier

#Database Tier

#Cache Tier

#Search

#Message Queue

#Observability

#Cost Model

#Impact