← Back to Work·Technical Leadership

Platform Recovery

Brought in to stabilise a failing e-commerce platform during peak trading period. Diagnosed issues and led recovery.

Industry: E-commerce

Role: Technical Lead (Contract)

Timeline: 2 weeks initial, 3 months stabilisation

Crisis ManagementPerformanceTroubleshootingDevOps

48 hours

Time to stable

10x normal

Peak traffic handled

Downtime during peak

Recurring issues fixed

Context

The problem

The e-commerce platform had been experiencing intermittent outages for weeks. With Black Friday two weeks away, the site couldn't handle even moderate traffic spikes. The development team had tried various fixes but couldn't identify the root cause. Customer complaints were mounting, and stakeholders were considering pulling the plug on the peak trading campaign.

Constraints

What we were working with

→Cannot do major architectural changes before peak period
→Limited visibility into production systems
→Team is exhausted and demoralised from weeks of firefighting
→Must maintain existing functionality—no feature regressions
→Budget for infrastructure changes needs executive approval

Approach

How I tackled it

01Established immediate visibility: added APM tooling and log aggregation
02Identified the primary bottleneck through systematic load testing
03Implemented quick wins: database query optimisation, caching layers, connection pooling
04Set up proper alerting so issues were detected before customers noticed
05Created runbook for known issues so anyone could respond
06Ran controlled load tests to verify fixes under realistic conditions

Architecture

Technical design

The platform was a standard three-tier architecture (CDN, application servers, database) but lacked the instrumentation to identify problems. The focus was on adding observability without changing the core architecture.

•Added application performance monitoring (APM) across all services
•Implemented database query analysis to identify slow queries
•Added Redis caching layer for expensive database operations
•Configured proper connection pooling between app and database
•Set up synthetic monitoring to detect issues proactively

Technical Artifact

Platform architecture showing three-tier system with newly added Redis cache, APM monitoring, and performance improvements

The platform architecture after stabilisation, showing added observability layer and caching that enabled 10x traffic handling.

What you're looking at:

•Redis caching reduced database load by 60%
•APM + PagerDuty for proactive issue detection
•Connection pooling fixed the primary bottleneck
•p95 latency dropped 80% with query optimisations

Delivery

What I shipped

Emergency stabilisation: database query optimisations that reduced p95 latency by 80%

Caching implementation: reduced database load by 60% on product pages

Monitoring dashboard: real-time visibility into system health for the team

Alert configuration: PagerDuty integration for critical metrics

Runbook documentation: step-by-step guides for common issues

Load testing framework: repeatable tests to verify system capacity

Outcome

The result

The platform handled 10x normal traffic on Black Friday with zero downtime. The team went from reactive firefighting to proactive monitoring. Post-peak, we used the visibility tools to identify and fix 12 additional issues that would have caused problems at scale. The business ran a successful peak campaign and the team rebuilt confidence in the platform.

Lessons

What I learned

1
You can't fix what you can't see—observability is the first priority in any crisis
2
Most performance problems are a few bad queries or missing indexes
3
A calm, systematic approach is more effective than heroic all-nighters
4
Document everything during a crisis; you'll need it for the post-mortem
5
Team morale matters—small wins build momentum for bigger fixes

Facing a similar challenge?

I'd be happy to discuss your situation and explore whether I can help.

Start a Conversation View More Work