Jul 29

Memory Pressure Analysis Risk State

I have a Test database that is a 2-node RAC system. I am working towards the goal of getting the production database to Oracle in about a month’s timeframe. This of course means that I have to get Grid Infrastructure upgraded prior to the db upgrade. I have upgraded GI on my standby cluster and on my Test database as well. The primary GI upgrade is scheduled for this evening.

Ever since I upgraded GI in Test a few weeks ago, I’ve been getting alerts from EM12c similar to the following:

Target type=Cluster
Target name=test-scan
Message=Server is under elevated memory pressure and services on all instances on this server will be stopped
Event reported time=Jul 29, 2015 1:05:13 PM CDT
Operating System=Linux
Event Type=Metric Alert
Event name=wlm_event:wlm_qosm_mpa_risk_state
Metric Group=QoS Events
Metric=Memory Pressure Analysis Risk State
Metric value=RED

Some of the alert details were removed for brevity.

So where is this coming from? Why does it mean to me?

This error is coming from Oracle’s Quality of Service (QoS) in Grid Infrastructure. It relies on Cluster Health Monitor (CHM) information. More specifically, this alert is coming from Memory Guard. For some information on Memory Guard, see this PDF, specifically the end of the second page.

Memory Guard is trying to save me from myself, and as we will see, it is doing a poor job of it. The idea is that when the server has memory pressure, Memory Guard will take any services on that node out-of-service. Allowing more connections would consume even more memory and could make the situation worse. New connection requests must go to another node in the cluster running that service. This is exactly what the Message value in the alert is telling me.

According to this EM 12c document, section 4.3.2, Memory Pressure Analysis Risk State, the alert text is supposed to contain the server name. Yet the message text above does not tell me which server is having the problem. Luckily for me, it’s only a 2-node RAC cluster, so I don’t have too many to examine.

When I do look at the CPU utilization, everything is fine. Swap usage is practically zero on both nodes. Free memory is more than 25% on both nodes. Curious…why the alert in the first place?

Everytime I get this alert, I can another email that says the condition is cleared up within a few minutes. So the issue is short lived. Yet the alerts keep coming.

It turns out, after some investigation, that Oracle made a change to Memory Guard in Grid Infrastructure In earlier versions, Memory Guard only looked after policy-managed databases. In GI, Memory Guard started looking after admin-managed databases as well. And my RAC databases are typically admin-managed, which is one reason why I’m seeing this now.

To further add to the issue, apparently, GI has known Bug 1582630 where the amount of free memory if calculated incorrectly. Note 1929994.1 lists a workaround and there is a patch as well. I applied the workaround and it resolved my problem. I’ll get the patch applied to Test before I proceed to production in the not-too-distant future.

Thankfully, I discovered this before my production GI upgrade later tonight. Otherwise I would have had upset end users that may have experienced issues connecting to the database. This is just one more example of why I have a good test platform with which to discover and resolve the issues before the change is made in production.