Platform Recovery
Brought in to stabilise a failing e-commerce platform during peak trading period. Diagnosed issues and led recovery.
Context
The problem
The e-commerce platform had been experiencing intermittent outages for weeks. With Black Friday two weeks away, the site couldn't handle even moderate traffic spikes. The development team had tried various fixes but couldn't identify the root cause. Customer complaints were mounting, and stakeholders were considering pulling the plug on the peak trading campaign.
Constraints
What we were working with
- →Cannot do major architectural changes before peak period
- →Limited visibility into production systems
- →Team is exhausted and demoralised from weeks of firefighting
- →Must maintain existing functionality—no feature regressions
- →Budget for infrastructure changes needs executive approval
Approach
How I tackled it
- 01Established immediate visibility: added APM tooling and log aggregation
- 02Identified the primary bottleneck through systematic load testing
- 03Implemented quick wins: database query optimisation, caching layers, connection pooling
- 04Set up proper alerting so issues were detected before customers noticed
- 05Created runbook for known issues so anyone could respond
- 06Ran controlled load tests to verify fixes under realistic conditions
Architecture
Technical design
The platform was a standard three-tier architecture (CDN, application servers, database) but lacked the instrumentation to identify problems. The focus was on adding observability without changing the core architecture.
- •Added application performance monitoring (APM) across all services
- •Implemented database query analysis to identify slow queries
- •Added Redis caching layer for expensive database operations
- •Configured proper connection pooling between app and database
- •Set up synthetic monitoring to detect issues proactively
Technical Artifact
The platform architecture after stabilisation, showing added observability layer and caching that enabled 10x traffic handling.
What you're looking at:
- •Redis caching reduced database load by 60%
- •APM + PagerDuty for proactive issue detection
- •Connection pooling fixed the primary bottleneck
- •p95 latency dropped 80% with query optimisations
Delivery
What I shipped
Emergency stabilisation: database query optimisations that reduced p95 latency by 80%
Caching implementation: reduced database load by 60% on product pages
Monitoring dashboard: real-time visibility into system health for the team
Alert configuration: PagerDuty integration for critical metrics
Runbook documentation: step-by-step guides for common issues
Load testing framework: repeatable tests to verify system capacity
Outcome
The result
The platform handled 10x normal traffic on Black Friday with zero downtime. The team went from reactive firefighting to proactive monitoring. Post-peak, we used the visibility tools to identify and fix 12 additional issues that would have caused problems at scale. The business ran a successful peak campaign and the team rebuilt confidence in the platform.
Lessons
What I learned
- 1
You can't fix what you can't see—observability is the first priority in any crisis
- 2
Most performance problems are a few bad queries or missing indexes
- 3
A calm, systematic approach is more effective than heroic all-nighters
- 4
Document everything during a crisis; you'll need it for the post-mortem
- 5
Team morale matters—small wins build momentum for bigger fixes
Facing a similar challenge?
I'd be happy to discuss your situation and explore whether I can help.