One of the Production Aurora PostgreSQL instance running on db.r4.16xlarge instance (64 vCPU and 488 GB ) was reporting high CPU Utilization spiking upto 100%.
With such issues, one of the first thing is to look for the SQLs with high buffers shared hit. I have built a small tool called pgsnap which is something similar to AWR respostory in Oracle maintaining the SQL stat history. So, with pg_stat_statements and hist_pg_stat_statements(that’s what I call it), I was able to identify the SQL.
select col1, col2, col3, col4, col5 from cltn_errs redis0_ where redis0_.sbmn_id=123456;
Lets look at the execution plan
Gather (cost=1000.00..2070031.26 rows=2696 width=262) (actual time=17475.126..18771.216 rows=1 loops=1) Output: col1, col2, col3, col4, col5 Workers Planned: 7 Workers Launched: 0 Buffers: shared hit=3945515 -> Parallel Seq Scan on demo.cltn_errs redis0_ (cost=0.00..2068761.66 rows=385 width=262) (actual time=17474.807..18770.895 rows=1 loops=1) Output: col1, col2, col3, col4, col5 Filter: (redis0_.sbmn_id = '123456'::numeric) Rows Removed by Filter: 52390761 Buffers: shared hit=3945515 Planning time: 0.652 ms Execution time: 18771.384 ms
The problem was obvious!! Full table scan of table cltn_errs (~15Gb table). So, I restored the latest snapshot, created an index on “sbmn_id” column and the execution plan changed to
Index Scan using idx_sbmn_id on demo.cltn_errs redis0_ (cost=0.57..182.80 rows=3086 width=272) (actual time=0.031..0.032 rows=1 loops=1) Output: col1, col2, col3, col4, col5 Index Cond: (redis0_.sbmn_id = '123456'::numeric) Buffers: shared hit=5 Planning time: 0.573 ms Execution time: 0.085 ms
Wow!! After the index, Buffers: Shared hit and the total execution time has magnificent improvement, more than 100x. So, with this testing, I created the index on Prod and after this little change CPU Utilization graph had to say it all.