AWSAWS Aurora

PostgreSQL – LWLock:multixact_offset

On one of the PostgreSQL db , the active sessions were waiting on IO:SLRURead and LWLock:multixact_offset causing application latency.
Multioffset
As per PostgreSQL doc, SLRURead is waiting for a read of an SLRU page and multixact_offset is waiting for I/O on a multixact offset buffer.
Diagnosing the issue took some time so cut to the chase
1. Identified the SQLs, relation (and its size) for which these wait events were high. These table had ~1500 DMLs per 30mins.

\i seg_size.sql
Enter the relation name:xy_abc_queues
+---------------+---------+
|    relname    |  size   |
+---------------+---------+
| xy_abc_queues | 1658 MB |
+---------------+---------+
(1 row)
Time: 24.340 ms
\i seg_size.sql
Enter the relation name:xy_abc_queue_sla
+------------------+--------+
|     relname      |  size  |
+------------------+--------+
| xy_abc_queue_sla | 856 MB |
+------------------+--------+
(1 row)

2. Reviewed the last_autoanalyze and autovacuum_count. Both the tables were getting analyzed and vacuum twice in 30mins.
3. Installed pgstattuple to obtain tuple level stats.
4. Relation stats

SELECT * FROM pgstattuple('owner.xy_abc_queues');
+-[ RECORD 1 ]-------+------------+
| table_len          | 1738113024 |
| tuple_count        | 6880       |
| tuple_len          | 8947718    |
| tuple_percent      | 0.51       |
| dead_tuple_count   | 741053     |
| dead_tuple_len     | 1378505651 |
| dead_tuple_percent | 79.31      |
| free_space         | 337899240  |
| free_percent       | 19.44      |
+--------------------+------------+
Time: 219706.769 ms (03:39.707)
SELECT * FROM pgstattuple('owner.xy_abc_queue_sla');
+-[ RECORD 1 ]-------+-----------+
| table_len          | 897990656 |
| tuple_count        | 43623     |
| tuple_len          | 49133804  |
| tuple_percent      | 5.47      |
| dead_tuple_count   | 738004    |
| dead_tuple_len     | 604300048 |
| dead_tuple_percent | 67.29     |
| free_space         | 232187892 |
| free_percent       | 25.86     |
+--------------------+-----------+
Time: 475798.157 ms (07:55.798)

5. Vacuum should get rid of these dead tuples so that the space can be reused. Even though the tables were auto-vacuumed, still the tables had high dead tuple %.
6. One of the reason for autovacuum not to free the dead tuple is long running transaction.
7. Reviewed pg_stat_activity and noticed PIDs with open transaction for over a month.
8. Killed the PIDs using pg_terminate_backend(pid).
9. Executed vacuum manually.

vacuum analyze verbose owner.xy_abc_queue_sla;
vacuum analyze verbose owner.xy_abc_queues;
SELECT * FROM pgstattuple('owner.xy_abc_queues');
+-[ RECORD 1 ]-------+------------+
| table_len          | 1738113024 |
| tuple_count        | 3562       |
| tuple_len          | 2514281    |
| tuple_percent      | 0.14       |
| dead_tuple_count   | 5906       |
| dead_tuple_len     | 11542204   |
| dead_tuple_percent | 0.66       |
| free_space         | 1713851592 |
| free_percent       | 98.6       |
+--------------------+------------+
SELECT * FROM pgstattuple('owner.xy_abc_queue_sla');
+-[ RECORD 1 ]-------+-----------+
| table_len          | 897990656 |
| tuple_count        | 42127     |
| tuple_len          | 48074726  |
| tuple_percent      | 5.35      |
| dead_tuple_count   | 5059      |
| dead_tuple_len     | 3900324   |
| dead_tuple_percent | 0.43      |
| free_space         | 837034048 |
| free_percent       | 93.21     |
+--------------------+-----------+

Post this the wait events IO:SLRURead and LWLock:multixact_offset reduced and the application latency was back to normal.

Leave a Reply