Degradation in checkout performance.

Incident Report for Shopflo

Postmortem

Root Cause Analysis (RCA)

Incident Summary

Date and Time: 2025-03-20, between 12:40 PM and 1:10 PM
Duration: 30 minutes
Impact:
- Discount, Shipping, and Rewards services experienced service degradation due to an unprecedented spike in traffic.
- The database became a bottleneck, causing delays and increased response times for dependent services.
- The Checkout Aggregator timed out calls to these degraded services, preventing prolonged waiting times.
- Circuit breakers effectively kicked in, ensuring that the Checkout service continued to function smoothly, albeit with degraded features.
- The database was scaled to manage the load, and the auto-scaling policy was reviewed and adjusted to prevent delays in future load spikes.

Timeline

12:40 PM:
- Unprecedented traffic spike hit the system, resulting in an increased load on Discount, Shipping, and Rewards services.
- The database experienced a sudden surge in queries, causing latency and response time degradation.
- Auto-scaling did not trigger immediately, leading to a delay in scaling up resources.
12:45 PM:
- The Checkout Aggregator timed out calls to downstream services (Discount, Shipping, and Rewards) after exceeding predefined thresholds.
- Circuit breakers isolated these services, allowing the Checkout service to continue operating with limited functionality.
12:50 PM:
- System degradation worsened as the database remained under strain, affecting overall service response times.
1:00 PM:
- Auto-scaling eventually triggered but was delayed due to conservative threshold settings.
- Database and application layer began to recover as new instances were spun up.
1:10 PM:
- Traffic stabilized, and system performance gradually returned to normal.

Root Cause

Database Saturation Due to Traffic Spike:
- An unforeseen surge in traffic led to a massive increase in database queries, overwhelming available capacity.
- Database connection limits and query execution time increased, resulting in service degradation for Discount, Shipping, and Rewards services.

Contributing Factors

Delayed Auto-Scaling:

* Auto-scaling policies had conservative thresholds, causing a delayed reaction to the surge in load.
* The system did not scale up resources fast enough to handle the increased demand.

Lack of Predictive Scaling for Database Load:

* Auto-scaling was focused primarily on the application layer, with insufficient predictive scaling for database load.

Circuit Breaker and Timeout Settings:

* While circuit breakers and timeouts protected the core Checkout service, they masked the root cause, allowing the issue to persist until scaling occurred.

Impact Analysis

Service Degradation:
- Discount, Shipping, and Rewards services experienced slower response times and partial unavailability.
Checkout Service Continuity:
- The Checkout service remained functional due to circuit breakers, but with degraded features (e.g., missing or delayed discounts, rewards, and shipping updates).
Increased Latency and Potential Errors:
- Higher latency and intermittent timeouts affected user experience during the peak period.

Resolution

Database Scaling:
- Scaled the database to handle the increased query volume.
Auto-Scaling Policy Review:
- Updated auto-scaling thresholds to trigger scaling earlier during high-load scenarios.

Corrective Actions

Proactive Auto-Scaling for Database Layer

* Introduce predictive scaling based on query volume and connection saturation to scale database resources dynamically.

Fine-Tune Auto-Scaling Thresholds

* Reduce latency thresholds and trigger auto-scaling earlier when load spikes are detected.
* Implement pre-warming for known peak traffic periods to mitigate delays.

Optimize Query Performance and Caching

* Review and optimize query patterns for Discount, Shipping, and Rewards services.
* Add caching for frequently accessed data to reduce database load during traffic surges.

Enhance Circuit Breaker Monitoring

* Fine-tune circuit breaker thresholds to detect degradation faster and ensure service stability.

Load Testing and Stress Simulations

* Conduct periodic load tests to simulate traffic spikes and validate the effectiveness of auto-scaling and database handling mechanisms.

‌

Preventive Measures

Traffic Pattern Analysis and Forecasting
- Use historical traffic patterns and predictive models to anticipate traffic spikes and pre-scale resources.
Database Connection Pooling Enhancements
- Optimize connection pooling to prevent saturation during load spikes.

‌

Conclusion

The incident was caused by an unexpected traffic surge that led to database saturation and service degradation for downstream services. Although circuit breakers ensured that the Checkout service remained functional, the delayed auto-scaling exacerbated the situation. Immediate scaling and auto-scaling policy reviews were performed to mitigate future occurrences.

Posted Mar 20, 2025 - 19:36 IST

Resolved

Hi Team,

We observed a degradation in the checkout experience between 12:40 PM and 1:10 PM on March 20, 2025. Our team is actively investigating the issue, and we are prioritizing the Root Cause Analysis (RCA).

We will share our findings soon and provide updates on any necessary actions

Posted Mar 20, 2025 - 13:21 IST

This incident affected: Rewards Service and Discount Service.