26 - Root Cause Analysis

Incident Summary

On Monday 16 March 2026, Visualcare experienced a major service disruption affecting API availability.

The incident was triggered by inefficient database query behaviour within a background worker process responsible for form-related data processing. This resulted in a surge of long-running queries that exhausted available database connections.

As database resources became constrained, API requests were unable to complete, leading to application worker saturation and instability across API nodes, causing significant slowdowns and incomplete queries in both the Visualcare Web Application and Worker Mobile App.

While initial mitigation actions restored partial functionality, the underlying database pressure persisted, resulting in repeated instability until a controlled recovery was completed.

Service access was fully restored at ~13:11 AEST at a reduced speed and the system resumed to normal behaviour at ~14:22 AEST as database pressure subsided.

Impact

Customer Impact

Intermittent failures and timeouts when accessing the platform
Slow response times and request timeouts
Periods where the platform was unavailable

Duration

Start:
~09:56 AEST
Resolved:
~13:11 AEST
Total duration:
~3 hours 15 minutes

What Happened

A background worker responsible for processing form data executed queries that scaled poorly under certain data conditions, resulting in significantly longer execution times than expected.

As these queries accumulated:

Database connections became heavily utilised
API requests began queuing while waiting for available connections
Application workers became saturated handling blocked requests
API nodes became unstable under sustained load

This, combined with elevated system load at the time, accelerated database resource exhaustion, resulting in request backlogs, application worker saturation, and progressive service degradation.

As API nodes became increasingly unstable, overall platform performance deteriorated significantly. Requests were delayed or failed as application processes remained blocked waiting for database access.

Initial recovery actions (traffic redistribution and application restarts) provided only temporary relief, as they did not address the underlying database load:

The worker process continued generating high database activity

Resource contention rapidly reoccurred after each recovery attempt

This resulted in a

repeating cycle of degradation and partial recovery

, significantly extending the duration of the incident

Detection

The issue was initially identified through customer reports, followed by internal validation of:

API responsiveness degradation
Elevated database connection usage
Application instability

Gap Identified

At the time of the incident, there were limited proactive alerts for:

Database connection saturation
Long-running query thresholds

Resolution

Service was restored through:

Terminating long-running queries across our database shards.
Controlled recovery of application processes across API nodes
Careful management of traffic during recovery to prevent recurrence
Allowing the database load to return to normal operating levels

Once database pressure was reduced and application services stabilised, normal service resumed.

Root Cause

Inefficient database query behaviour in a background worker process led to sustained resource consumption, exhausting database connections and causing cascading failure across API services.

What We’re Improving

We are implementing several improvements to prevent recurrence and strengthen system resilience:

Workload Protection

Introduce safeguards to prevent excessive resource consumption from any single workload
Improve isolation of database usage across different request types

Query Optimisation & Limits

Optimise form-related query patterns
Enforce execution time limits on database queries

Query Controls

Detect and manage long-running database activity more proactively

Observability & Alerting

Add alerts for:
Database connection utilisation
Query execution duration

Closing Statement

We recognise the impact this incident had and take full responsibility for the disruption.

This event has led to clear improvements in how we:

Protect shared system resources
Detect abnormal behaviour earlier
Maintain stability under load

These changes are already planned and underway to ensure a more resilient and reliable platform moving forward.