Incident Summary
On Monday 16 March 2026, Visualcare experienced a major service disruption affecting API availability.
The incident was triggered by inefficient database query behaviour within a background worker process responsible for form-related data processing. This resulted in a surge of long-running queries that exhausted available database connections.
As database resources became constrained, API requests were unable to complete, leading to application worker saturation and instability across API nodes, causing significant slowdowns and incomplete queries in both the Visualcare Web Application and Worker Mobile App.
While initial mitigation actions restored partial functionality, the underlying database pressure persisted, resulting in repeated instability until a controlled recovery was completed.
Service access was fully restored at ~13:11 AEST at a reduced speed and the system resumed to normal behaviour at ~14:22 AEST as database pressure subsided.
Impact
Customer Impact
  • Intermittent failures and timeouts when accessing the platform
  • Slow response times and request timeouts
  • Periods where the platform was unavailable
Duration
  • Start:
    ~09:56 AEST
  • Resolved:
    ~13:11 AEST
  • Total duration:
    ~3 hours 15 minutes
What Happened
A background worker responsible for processing form data executed queries that scaled poorly under certain data conditions, resulting in significantly longer execution times than expected.
Top SQL
As these queries accumulated:
  • Database connections became heavily utilised
  • API requests began queuing while waiting for available connections
  • Application workers became saturated handling blocked requests
  • API nodes became unstable under sustained load
This, combined with elevated system load at the time, accelerated database resource exhaustion, resulting in request backlogs, application worker saturation, and progressive service degradation.
As API nodes became increasingly unstable, overall platform performance deteriorated significantly. Requests were delayed or failed as application processes remained blocked waiting for database access.
DB load
Initial recovery actions (traffic redistribution and application restarts) provided only temporary relief, as they did not address the underlying database load:
The worker process continued generating high database activity
Resource contention rapidly reoccurred after each recovery attempt
This resulted in a
repeating cycle of degradation and partial recovery
, significantly extending the duration of the incident
CPUUtil
Detection
The issue was initially identified through customer reports, followed by internal validation of:
  • API responsiveness degradation
  • Elevated database connection usage
  • Application instability
Gap Identified
At the time of the incident, there were limited proactive alerts for:
  • Database connection saturation
  • Long-running query thresholds
Resolution
Service was restored through:
  • Terminating long-running queries across our database shards.
  • Controlled recovery of application processes across API nodes
  • Careful management of traffic during recovery to prevent recurrence
  • Allowing the database load to return to normal operating levels
Once database pressure was reduced and application services stabilised, normal service resumed.
Root Cause
Inefficient database query behaviour in a background worker process led to sustained resource consumption, exhausting database connections and causing cascading failure across API services.
What We’re Improving
We are implementing several improvements to prevent recurrence and strengthen system resilience:
  1. Workload Protection
  • Introduce safeguards to prevent excessive resource consumption from any single workload
  • Improve isolation of database usage across different request types
  1. Query Optimisation & Limits
  • Optimise form-related query patterns
  • Enforce execution time limits on database queries
  1. Query Controls
  • Detect and manage long-running database activity more proactively
  1. Observability & Alerting
  • Add alerts for:
  • Database connection utilisation
  • Query execution duration
Closing Statement
We recognise the impact this incident had and take full responsibility for the disruption.
This event has led to clear improvements in how we:
  • Protect shared system resources
  • Detect abnormal behaviour earlier
  • Maintain stability under load
These changes are already planned and underway to ensure a more resilient and reliable platform moving forward.