Degradation in checkout performance.

Incident Report for Shopflo

Postmortem

Root Cause Analysis (RCA)

Incident Summary

  • Date and Time: 2025-03-19, between 3:25 PM and 3:50 PM
  • Duration: 25 minutes
  • Impact:

    • The requestor was sending requests to the service and retrying if the service did not respond within 10 seconds.
    • Instead of waiting for a response, the requestor disconnected the connection after 10 seconds and initiated a new request.
    • The service, however, continued processing the original requests even after the requestor had disconnected, causing unnecessary resource consumption and increased load on the system.

Timeline

  • 3:25 PM: Requestor started sending requests and disconnecting after 10 seconds if the service didn’t respond.
  • 3:27 PM: The service continued processing requests that were no longer needed after the connection was terminated.
  • 3:30 PM: Multiple retries from the requestor led to an accumulation of requests being processed unnecessarily.
  • 3:35 PM: System resources began experiencing strain due to the increased number of open connections and pending requests.
  • 3:50 PM: The issue was identified and mitigated.

Root Cause

  • Unnecessary Request Processing After Disconnection: The service did not terminate or cancel the processing of requests after the requestor disconnected. As a result:

    • The service continued processing stale or irrelevant requests.
    • Accumulated processing of disconnected requests led to increased resource consumption and degraded system performance.

Contributing Factors

  1. Lack of Connection Termination Handling
* The service did not detect and stop processing requests when the client disconnected the connection.
  1. Requestor’s Aggressive Retry Pattern
* Since the requestor retried immediately after disconnecting, it led to multiple requests being processed simultaneously, further increasing system load.
  1. Absence of Request Cancellation Mechanism
* No mechanism was in place to abort or cancel in-progress requests if the connection was closed.

Impact Analysis

  • Increased Resource Usage:

    • The service was processing unnecessary requests that had no valid response destination.
  • Potential Service Degradation:

    • As requests accumulated, available resources were exhausted, risking performance degradation.
  • Risk of Latency and Timeout:

    • Legitimate requests faced higher latency due to contention with stale, lingering requests.

Resolution

  • Temporary Mitigation:

    • Manual intervention was used to clean up lingering connections and reduce load.
    • Traffic was stabilized by rate-limiting requestor retries to avoid compounding the problem.

Corrective Actions

  1. Implement Request Cancellation on Connection Termination
* Enable request abortion if the client disconnects mid-request.
* Configure appropriate listeners or interceptors to detect connection termination and cancel ongoing requests.
  1. Introduce Connection Timeout at Service Level
* Set a server-side timeout that matches or is slightly lower than the requestor’s 10-second retry interval.
* Ensure that long-running requests are terminated to free up resources.
  1. Retry Backoff for Requestor
* Introduce exponential backoff with jitter to prevent aggressive retries.
* Add logic to prevent immediate retries after disconnection.
  1. Connection Pool Optimization
* Fine-tune connection pool size and idle connection timeout settings to ensure connections are promptly cleaned up.

Preventive Measures

  • Monitoring and Alerts

    • Set up alerts for excessive connection terminations and high request counts.
    • Track and log aborted/terminated requests for analysis.
  • Load and Failover Testing

    • Conduct simulations where requestors retry aggressively to observe system behavior.
    • Validate that connection termination effectively aborts ongoing requests.

Conclusion

The incident was caused by the service continuing to process requests even after the requestor disconnected. This resulted in resource saturation due to unnecessary processing. Immediate mitigation steps were taken, and corrective actions have been outlined to prevent a recurrence.

Posted Mar 20, 2025 - 19:34 IST

Resolved

Hi Team, the checkout is now back to normal. The team is investigating the cause, and we will share the RCA soon.
Posted Mar 19, 2025 - 16:14 IST

Investigating

Hi Team,

We've noticed a decline in Shopflo checkout performance. The team is actively investigating this as a priority.
Posted Mar 19, 2025 - 15:50 IST
This incident affected: Checkout Service.