Reducing Latencies and Downtime: Concurrent Refresh of Materialized View
optimization
database
postgresql
Context
Handling the vast amount of data required to compute salary information for NodeFlair Salaries was a challenge. To tackle this, we leveraged Materialized Views to speed up the process.
If youāre curious about the details, check out my previous blogpost on how this works.
š Issue - Itās Slow and Down
Over time, weāve noticed the latencies for NodeFlair Salaries creeping up, leading to some serious downtime.
Our average uptime, defined as loading times exceeding 5 seconds, plummeted to just 84% ā yikes!
And the average latency? A staggering 2155ms. Just think about the pain for p0.9 and beyond!
š The Investigations
Iāve been digging into Amazon RDS Performance Insights and itās clear: our Average Active Sessions (AAS) are regularly blowing past the max vCPU limit of 2. Weāre running on a t4g.medium
instance, so this should be expected.
Whatās driving this? Itās mostly IO:DataFileRead
and IPC:BufferIO
.
Reading into Amazon RDSās documentation, hereās the TL;DR of the potential causes:
IO:DataFileRead
- Your shared buffer pool might be too small. Itās time to upsize thatIPC:BufferIO
- Typically happens when multiple connections are gunning for the same page while an I/O operation is still pending.
ā The Solution - Concurrent Refreshing of Materialized Views
The IPC:BufferIO
gave us a solid lead. Itās rare to see multiple connections hitting the same page during I/Oāafter all, we donāt change data THAT often.
Or maybe we do?
Weāre using the Scenic gem for our database views. By default, it uses a non-concurrent refresh, which locks the view while itās being updated.
Weāve made the switch to concurrent refresh to allow for real-time assessment during updates. Just remember, this needs unique indexes on the views - no biggie.
ā” Now Itās Fast and Up!
The Upgrade - nodeflair.com/salaries: We slashed average latency by a staggering 64%, bringing it down from 2155ms to just 780ms. Downtime is now a thing of the past, dropping from 15.8% to a perfect 0%.
And thatās not all. Here are some additional wins:
- nodeflair.com: Average latency cut by 25%, from 1000ms to 750ms.
- nodeflair.com/jobs: Average latency improved by 20%, from 500ms to 400ms.
š āāļø Solutions I Considered But Didnāt Work Or Pursue
Here are some potential solutions I thought about but decided against due to cost or complexity. They remain options only if all else fails.
Caching
We hit a snag with Railsā built-in Fragment Caching. The real bottleneck wasnāt in the views; it was in the database.
What about Low-Level Caching?
Rails defaults to MemoryStore
for caching, but thatās not cutting it for us. With multiple EC2 servers, each with its own memory cache, other servers canāt access the data. Plus, Elastic Beanstalkās auto-scaling and deployment cycles nuke cached data indirectly by creating new instances.
So, we switched gears and integrated Redis as our cache store ā since weāre already using Redis with Sidekiq. This allowed us to cache results and invalidate them only when the view gets successfully materialized. We also cached other common, infrequently changing DB queries, which cut down on database load and improved response times.
Unfortunately, while itās an improvement, it didnāt completely solve the issue.
Upgrading RDS Instance Type
t4g.large
- Costs 2x more but only doubles the RAMt4g.xlarge
- Provides 8GB of RAM and 4 vCPUs, but costs 4x more.
Partitioning
Partitioning is a new territory for me, and thereās a lot of uncertainty around its effectiveness.
Also, given our current data volume, itās unlikely we need it at this stage.