Jul 29

Memory Pressure Analysis Risk State

I have a Test database that is a 2-node RAC system. I am working towards the goal of getting the production database to Oracle 12.1.0.2 in about a month’s timeframe. This of course means that I have to get Grid Infrastructure upgraded prior to the db upgrade. I have upgraded GI on my standby cluster and on my Test database as well. The primary GI upgrade is scheduled for this evening.

Ever since I upgraded GI in Test a few weeks ago, I’ve been getting alerts from EM12c similar to the following:

Host=host01
Target type=Cluster
Target name=test-scan
Categories=Business
Message=Server is under elevated memory pressure and services on all instances on this server will be stopped
Severity=Warning
Event reported time=Jul 29, 2015 1:05:13 PM CDT
Operating System=Linux
Platform=x86_64
Event Type=Metric Alert
Event name=wlm_event:wlm_qosm_mpa_risk_state
Metric Group=QoS Events
Metric=Memory Pressure Analysis Risk State
Metric value=RED

Some of the alert details were removed for brevity.

So where is this coming from? Why does it mean to me?

This error is coming from Oracle’s Quality of Service (QoS) in Grid Infrastructure. It relies on Cluster Health Monitor (CHM) information. More specifically, this alert is coming from Memory Guard. For some information on Memory Guard, see this PDF, specifically the end of the second page.

Memory Guard is trying to save me from myself, and as we will see, it is doing a poor job of it. The idea is that when the server has memory pressure, Memory Guard will take any services on that node out-of-service. Allowing more connections would consume even more memory and could make the situation worse. New connection requests must go to another node in the cluster running that service. This is exactly what the Message value in the alert is telling me.

According to this EM 12c document, section 4.3.2, Memory Pressure Analysis Risk State, the alert text is supposed to contain the server name. Yet the message text above does not tell me which server is having the problem. Luckily for me, it’s only a 2-node RAC cluster, so I don’t have too many to examine.

When I do look at the CPU utilization, everything is fine. Swap usage is practically zero on both nodes. Free memory is more than 25% on both nodes. Curious…why the alert in the first place?

Everytime I get this alert, I can another email that says the condition is cleared up within a few minutes. So the issue is short lived. Yet the alerts keep coming.

It turns out, after some investigation, that Oracle made a change to Memory Guard in Grid Infrastructure 12.1.0.2. In earlier versions, Memory Guard only looked after policy-managed databases. In GI 12.1.0.2, Memory Guard started looking after admin-managed databases as well. And my RAC databases are typically admin-managed, which is one reason why I’m seeing this now.

To further add to the issue, apparently, GI 12.1.0.2 has known Bug 1582630 where the amount of free memory if calculated incorrectly. Note 1929994.1 lists a workaround and there is a patch as well. I applied the workaround and it resolved my problem. I’ll get the patch applied to Test before I proceed to production in the not-too-distant future.

Thankfully, I discovered this before my production GI upgrade later tonight. Otherwise I would have had upset end users that may have experienced issues connecting to the database. This is just one more example of why I have a good test platform with which to discover and resolve the issues before the change is made in production.

Jul 27

Remember RAC Instances in Perf Tools

I was using Lighty to examine performance of a new Oracle RAC database that went into service today. You can see something similar in Oracle Enterprise Manager 12c Cloud Control if you choose. Anyway, here is the performance graph of the RAC database in question.

Lighty_SatIO_01

 

 

 

 

It is common wisdom that if the chart is below the green line representing the CPU Limit, that we don’t have major contention. But this is an Oracle RAC database. The chart above is for all instances (this is a 2-node RAC db) as I selected both instances in Lighty.

Lighty_SatIO_02

 

 

 

But if I switch to one instance, I see a different story.

Lighty_SatIO_03

 

 

 

 

Notice that during the same time period, I have user I/O saturation.

The moral of the story is that when looking at Oracle RAC databases with performance tuning tools, you sometimes have to look at and individual instance. In my example, looking at the overall database (all instances), things looked fine but when I zoomed into a specific instance, the same performance tool painted a different picture.

Jul 27

The Data Guardian

This morning, I was reading this article about employees lack of understanding how critical it is to protect business data. The article got me to think about a few different points.

As a DBA, I’ve always tried to be my company’s Data Guardian. To me, the data inside the databases I manage is a valuable corporate asset. After all, the database exists for one reason, to provide efficient and timely access to the data. That data has value otherwise the company wouldn’t pay all the money it does to store it, to manage it, to make it available to end users. So in that vain, I’ve always strived to be the one that says we can’t just let people have unfettered access to the data. We must have proper security controls in place in our database. We must be vigilant in our security mechanism by making sure the database is patched regularly, is on current database versions, that we implement the concept of “least privilege”, etc.

Because being a DBA is my business and I’ve always been the Data Guardian, I’ve been involved in numerous conversations with people inside my company about protecting that valuable corporate asset, the data. Admittedly, most of these conversations are with people already in the company’s IT department. It is rare for me to have these types of conversations with people outside of IT. Yet as this article points out, 43% of respondants to a poll “admitted that they were either somewhat or completely ignorant about the value of corporate data.” I probably need to get out of the IT department once in awhile and have these same conversations with non-IT people.

Then the article discussed something that I hadn’t really considered before, at least not consciously, even is my role as the company’s data guardian. The article talks about how employees “were more worried about losing their personal data, than leaking out business data which could be far more damaging.”

Like many people in the 21st century, I too have been victim of my private, personal data being breached, data that was held by some other party. I was one of the individuals that was victim to the Target data breach in 2013. My bank quickly issued a new card since my card swiped at Target was at risk. Target even offered the victims some form of identity protection. More recently, I know that my private data has been breached in the US OPM systems hack. Me and 4.2 million other individuals. Since this data breach is more recent and does apply to me directly, I’ve been giving thought about data breaches in general and how they affect me.

But as this article points out, have you given thought to how much a data breach at our company could affect you personally? I certainly give thought to how much a data breach affects myself and my family when Target and OPM were hacked. But how does a data breach affect me and my family if the databases I’m in charge of at my job get hacked? The databases in my control contain very little of my Personally Identifiable Information (PII). But they do contain PII of our customers and clients, which I am very well aware of. If their PII gets out, how does it affect me?

Well certainly, if I am found at fault, my company could fire me. I could lose my job. I think that’s the most obvious affect of the data breach. But what else? I know of a data breach at a company that I used to work for in the past. That breach occurred well over a decade after I left that company so it had nothing to do with me. But I do personally know some of the individuals who were fired over the incident, people that I did work with and built relationships with. The company was sued over the data breach.  And some of my ex-coworkers were named in the lawsuit. Even if it is ultimately decided that these individuals are not at fault, they still have the nightmare of needing to defend themselves and their reputation. What other ways are there for me to be personally affected by my company’s data breaches?

Like many employees, I feel that I have a vested interest in how well the company performs. Many companies try to instill this mindset into their employees by offering stock options and bonuses tied to company performance. Can one metric of company performance can be the number of data breaches exposing valuable corporate data to hackers? If nothing else, our customers may lose confidence in our ability to keep their PII safe. Lost confidence means that our customers may quickly become some other company’s new customers.

So even if I’m only thinking about my personal, self-serving viewpoint, it is incumbent upon me to be more vigilant as the corporate Data Guardian. The more the company succeeds, the better it will be for me personally in the long run. After reading this article, I’m not sure that others in the company have the same values. Maybe I need to do more to spread the word. Most likely, I need to do more as the Data Guardian than sit in my office only talking to other IT staff about data security.

 

 

Jul 23

Lighty In A Nutshell

As you may know from a previous post on this blog, or if you follow me on Twitter (BPeaslandDBA), I like Lighty for Oracle, a product by Orachrome. I love this product and how easy it is to obtain performance diagnostic information. I’m very familiar with Oracle’s Enterprise Manager and I use EM12c all the time when I get calls about poor database performance. But I am now hoping that Lighty becomes another tool in my arsenal.

I don’t expect that Lighty will replace Enterprise Manager. EM12c does so much more than Lighty can, i.e. monitoring and alerting, and various database administration tasks. So I’ll be keeping EM12c thank you very much. But I will be relying on Lightly more for the one part of my job that Lighty does really well, provide information to aid in performance tuning efforts. Lighty is only for performance tuning and because it has a singular focus, it does the job very well.

I’ll try to show some highlights of why I think Lighty is a great tool to use. I’ll contrast with EM12c. For starters, in EM12c, I go to Performance –> Performance Home and I see a screen similar to the following.

Lighty01

Right away I can see that I have CPU resource contention. The main screen in Lighty looks like the following.

Lighty02

So we have very similar information. And that’s where the differences end for me…right at the start. In EM12c, I am forced to make a decision, do I drill down into the CPU utilization, or one of the wait classes like User I/O? Since the above shows CPU contention, in EM12c, I click on the CPU portion of the chart to drill down. At this point in EM12c, I am only considering CPU usage for my analysis. As we’ll see later, Lighty doesn’t require me to make that choice if I don’t want to. Below is my chart from EM12c.

Lighty03

The chart above is very common for me in my Oracle RAC database. The chart above is showing the CPU utilization for three instances of my clustered database. Which one is dominating the CPU utilization? Well they all look fairly equal. Yet EM12c forces me to choose one instance for analysis. I can’t examine all 3 instances in one screen. A lot of my work is with Oracle RAC performance tuning and EM12c’s limitations bother me here. I have to choose a wait class or CPU and then choose an instance before I can obtain information on the top SQL statements or top sessions. If I drill down far enough in EM12c, I can get a screen like the following.

Lighty04

I’ve finally gotten somewhere at long last! I can see the top SQL statements and the top sessions. Notice the highlighted part. This is detail for a 5 minute window. The five minute window is represented by the shaded box in the CPU utilization chart below for the instance.

Lighty05

In EM12c, I can drag the shaded box to a different time and the top SQL and top sessions information will change to match. But I can’t change this to a 10-minute interval or some other time period. I’m stuck at 5 minutes.

Now let’s contrast this to Lighty. We’ve already seen the performance chart just like EM12 has. Below that performance chart in Lighty is a detail pane on the top SQL statements. I didn’t have to click on anything to get the top SQL statements.

Lighty06

Right away, I see one difference between Lighty and EM12c. My top SQL statements are not solely restricted to CPU or a specific wait classIf I do want to restrict the information, I can choose from a menu box above the performance graph.

Lighty07

I typically have All selected, but I can choose to see just CPU if I choose. Lighty doesn’t limit me like EM12c does, unless I make the choice to subset the information.

Notice in the top SQL that I am presented a timeline of the events for that SQL. For example, we can see the top SQL statement in the list is using lots of CPU. Other SQL statements in the list start and stop their CPU usage, shown in light green. User I/O is simultaneously shown here in blue. If I expand a SQL statement, I can get a breakdown of that SQL statement.

Lighty08

Just by click on the Plus sign next to the SQL statement, I can see that this statement has two different execution plans and one of them is consuming 93.92% of the total CPU utilization. If I expand that plan, I can see where it is spending its time, broken down by wait event.

So let’s recap this. I started Lighty and was immediately presented with my top SQL statements and in a single click, I know that the top one has a CPU-consuming execution plan. That’s super quick and easy in my opinion. EM12c makes it harder to get to this information and EM12c will not show me when a SQL statement started consuming CPU and when it stopped like the graphs Lighty provides.

If I click on a SQL statement, much like I would in EM12c, Lighty will show me stats on the specific SQL. If I click on the Usage tab, I can see all of the sessions that ran this statement. Notice that the chart shows a timeline of when that session was consuming that resource for this SQL statement. EM12c won’t show me that level of detail.

Lighty09

With the above, I can easily see that it is multiple sessions executing the same statement at different times.

Remember that EM12c made me choose an instance for this Oracle RAC database? In Lighty, I don’t have to make that choice. But I can quite easily. Lighty is RAC-aware and detected the multiple instances. By default, I have all of the instances chosen. A simple pulldown menu lets me select one of them and the information on the screen automatically adjusts to just that instance.

Lighty10

Remember in EM12c where that shaded window represented a 5-minute period of time. In Lighty, the SQL and session details are for the entire graph. If you want a 5 minute window, you can choose the 5 minute option above the graph. By default, it has a 60 minute window chosen.

Lighty11

And I can choose other options as well, including a Custom date/time range. In EM12c, the graph is for 1 hour and I can’t change it.

EM12c has its ASH Analytics for letting me look at performance in the past. But I’ve always had problems with it. With ASH Analytics (which lets me view the Active Session History), I can change that gray window to be something other than 5 minutes. But for me, EM12c simply never returns the data. Maybe I have something wrong with my EM12c environment. But Lightly works out of the box with historical ASH data. Let’s say I got a call about a performance issue that occurred between 8:00 and 9:00 this morning. I simply choose a Custom time range.

Lighty12

In EM12c, I’m forced to go to a different web page, Performance –> ASH Analytics.

 

So far, you’ve probably latched onto the idea that I think that Lighty is a quick and easy way to obtain much of the same information I can get from EM12c. And you’d be right! Here is another area that I like about Lighty. If I go to Window –>Multiple Database Monitoring then I can see graphs of many different performance metrics for multiple databases. Here is an example.Lighty13At a glance, I can monitor multiple databases on one screen. EM12c won’t let me do that. I can see that the database on the far right is pretty idle while the database in the middle is getting hit pretty hard. Scrolling down the page, I get charts for many metrics:

  • Parses
  • User Calls
  • Transactions
  • Reads and Writes
  • Disk latency
  • Disk I/O throughput
  • Global Cache Transfer rates
  • Redo generation rates
  • Network Traffic

In EM12c, I would have to visit too many pages to obtain the same information that Lighty gives me in one page.

All of my examples are relying on the same Diagnostics and Tuning Packs that I have licensed. If you do not license those, Lighty provides similar functionality with Statspack and their version of L-ASH (Lighty ASH).

All in all, I think Lighty is a great tool for the DBA who is responsible for performance tuning. The cost is very reasonable as well. The more I use the tool, the more I come to appreciate how quickly I can get information. I can turn things on or off with no more than two clicks to be able to easily drill down to the root cause of my performance issues.I encourage all Oracle DBA’s to download the trial and see if this product is right for them. Lighty can be found here: http://www.orachrome.com/en/index.html

Jul 16

Lighty for Oracle

Not that long ago, I was pointed in the direction of Lighty for Oracle by orachrome. This is a wonderful performance tuning tool. I’m still using the 30 day trial and I hope to get my company to purchase it soon. As soon as I get time, I’ll try to post something which shows why I think this is a very cool tool!

Jul 16

Web Site Down and Twitter Feed

My web site was down for the last 4 days. My apologies. I had to spend a lot of time going back and forth between my ISP and my domain registrar. But it seems to be sorted out now.

So the good news, depending on your perspective, is that once my site came back up, I felt the need to spend time with it. I decided to do something that I’ve been meaning to get accomplished for quite some time. I added a widget to the right which shows my last 5 tweets. That widget probably only gets a smile out of me and not anyone who reads the blog. :)

Jul 01

Who is using my swap space?

Most of the time, I post entries in this blog to share knowledge with others. Sometimes, I post an entry so that in the future, I can quickly find an answer to something I know I forgot. This is one of those times. If other’s find this useful, great!

I’ve been working on a Linux system that has had some memory-related issues for one of my Oracle databases. I ended up needing to increase the swap space as it was set too low. After swap was increased, I saw some swap space being used and I wanted to know if this was Oracle using swap or something else. So who is using my swap space on this Linux system? I can run the following command to find out.

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{print ""}' $file; done | sort -k 2 -n

The output is sorted in ascending order with the most egregious offenders at the bottom of the list.

To sort in reverse order and then stop the output from scrolling off the screen, use this version instead.

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r | less

Jun 25

TNS-12519 w/o Max Processes Reached

I got a call from someone that they were getting TNS-12519 errors in their application. Sure enough, those messages were in the listener.log file as well.

TNS-12519: TNS:no appropriate service handler found

For those that aren’t familiar with this error, it typically means one of two things. Either the service name is not registered with the listener or that the database has reached its maximum number of processes. For the latter, what happens is that the Listener knows that the database can’t accept any new processes so it takes the service name out-of-service so to speak. A quick “lsnrctl status” shows me that the service name is registered correctly. So it must be the latter. I then issue the following query and confirm my suspicions.

 

SQL> select resource_name,current_utilization,max_utilization
 2 from v$resource_limit where resource_name='processes';
RESOURCE_NAME   CURRENT_UTILIZATION MAX_UTILIZATION
--------------- ------------------- ---------------
processes                       299             300

Sure enough. I’ve reached max processes, which is defined in my SPFILE to be 300. I increased the parameter to 600 and bounced the instance.  We hit the error again with double the number of processes. I obviously need to dig into this further.

For some background, and for something that will be important later on, it is important to know that this database supports our automated testing efforts. We have a test harness that exercises our primary production application. This test suite launches the application, connects to the database, pushes a few buttons and selects a few menu items and then disconnects.

I examined the listener.log file to see where the connection requests were coming from. These requests were coming from a rogue application server, not our test suite. I knew something was amiss because we had not yet started the test suite and we were getting the errors. We fixed the application server and I didn’t see the errors return. We then fired up the test suite and some time later, the TNS-12519 errors returned. Hmmm…I thought I found the culprit. But let’s check our process utilization.

SQL> select resource_name,current_utilization,max_utilization
 2 from v$resource_limit where resource_name='processes';
RESOURCE_NAME   CURRENT_UTILIZATION MAX_UTILIZATION
--------------- ------------------- ---------------
processes                       89             157

 

So I’m currently seeing 89 processes with a maximum utilization of 157. I’m nowhere near my new limit of 600. So what gives? It took me awhile to figure out what the issue was. The service name is correctly registered and I’m nowhere near my limit. MOS NOTE 552765.1 talks about how the Listener arrives at the TNS-12519 error. Previously, I was seeing the most common cause. Max PROCESSES had been reached. But not this time So what gives?

After investigation, I found my answer in the listener’s log. But it wasn’t obvious like some big error message. The first occurrence of the TNS-12519 error was at 9:38 am.  I grep’ed for “service_update” in the listener log and saw these entries.

25-JUN-2015 09:17:08 * service_update * orcl * 0
25-JUN-2015 09:17:26 * service_update * orcl * 0
25-JUN-2015 09:17:29 * service_update * orcl * 0
25-JUN-2015 09:17:44 * service_update * orcl * 0
25-JUN-2015 09:17:50 * service_update * orcl * 0
25-JUN-2015 09:17:53 * service_update * orcl * 0
25-JUN-2015 09:18:56 * service_update * orcl * 0
25-JUN-2015 09:18:59 * service_update * orcl * 0
25-JUN-2015 09:19:50 * service_update * orcl * 0
25-JUN-2015 09:20:11 * service_update * orcl * 0
25-JUN-2015 09:21:27 * service_update * orcl * 0
25-JUN-2015 09:22:09 * service_update * orcl * 0
25-JUN-2015 09:24:05 * service_update * orcl * 0
25-JUN-2015 09:27:53 * service_update * orcl * 0
25-JUN-2015 09:29:32 * service_update * orcl * 0
25-JUN-2015 09:34:07 * service_update * orcl * 0
25-JUN-2015 09:41:45 * service_update * orcl * 0

Notice that this service update occurs regularly at 9:17 and 9:18, but the time between the service updates takes longer and longer. Notice that there was 8 minutes 38 seconds between service updates at the end (9:34 to 9:41). Why is this important?

This is an Oracle 11.2.0.4 database. For 11.2 and earlier, PMON is responsible for cleaning up after processes and for passing information to the Listener. On database startup, PMON tries to register the services with the Listener. One other thing PMON does is to tell the Listener how many max processes can be serviced. In this case, PMON tells the listener that it can have up to 600 processes. PMON does more, but for purposes of this discussion, that’s enough.

One important piece to know is that the Listener never knows how many processes are currently connected. It only knows how many connection requests it has helped broker. The Listener never knows if processes disconnect from the database. The service_update above is where PMON is telling the Listener how many processes are actually being used. So at 9:34:07, the PMON service update tells the Listener that there are 145 processes in use. The Listener is now up-to-date. When a new connection request comes in, the Listener increments this to 146 processes. Between the service updates, the Listener is totally unaware that 1 or more processes may have been terminated, normally or abnormally. It keeps incrementing its count of processes until the next service update when the Listener gets an accurate account of how many processes are spawned.

So we have that 8.5 minute gap between service updates. Why did it take PMON so long to get back to the Listener? Well the clue for that is in the listener.log as well. I stripped out everything from the log prior to the 9:34 service_update and after the 9:41 service update. From there, it was easy to grep for “(CONNECT_DATA=” in what remained and pipe to “wc -l” to get a count of lines.

During this 8.5 minute interval, I had well over 450 new connection requests! Yet most of those new connections terminated as evidenced by V$RESOURCE_LIMIT showing me that I had a max of 150.  PMON was so busy cleaning up for the application exiting its database connections that it had a big lag before it updated the Listener. As far as the Listener was concerned, the 150 current connections plus the 450 new connections meant that it reached its limit of 600.

PMON can take up to 10 minutes to get back to the Listener with its next service update. Cleaning up after sessions exit the instance has a higher priority than service updates to the Listener. At the 10 minute mark, PMON makes the service update the top priority if the service update had not been previously done in that time interval.

Remember that this is a database to support automated testing. We have to live with this many connect/disconnect operations because we have an automated robot testing out our application in rapid-fire fashion. We don’t want to change how the application works because it works very well when run by a single user. Our automated test suite is executing the application differently than what it was designed to do. But we want to exercise the application as its written to potentially expose bugs before the code changes hit production.

For now, I’ve increased the PROCESSES parameter to a value that we’ll never reach. All this so that the Listener can not hit the limit in its internal counter. The more PROCESSES, the more memory needed in the SGA to support that higher number. But this database can handle it.

If anyone knows of a way to get the service update to occur in a 5 minute window, please let me know. There aren’t any documented parameters to control this behavior and I’ve been unable to find an undocumented one as well.

Lastly, this issue is in one of my 11.2.0.4 database. Oracle 12c changes the architecture a bit. The new Listener Registration (LREG) background process handles the work that PMON used to do. I don’t have a system quite yet to test, but I bet that LREG won’t have the same issue in 12c that PMON is exhibiting here in 11g as LREG won’t have to handle cleanup duties that PMON does. MOS Note 1592184.1 shows that LREG will do the service updates.

 

Update: Since I wrote this article I did have a chance to upgrade the database to 12c with its new LREG process. With LREG handling the Listener’s service updates, we saw the issue disappear. Even during times of heavy session activity, specifically connecting and disconnecting, LREG made regular service updates that PMON would not have been able to perform as often.

Jun 18

ADDM on SearchOracle.com

Here is another article I wrote for SearchOracle.com in case you missed it. This one focuses on how to use ADDM to quickly diagnose a performance problem.

http://searchoracle.techtarget.com/tip/Oracle-ADDM-helps-resolve-SQL-performance-problems

 

Jun 18

Which Linux Partition?

I’m trying to upgrade a system’s Grid Infrastucture to the latest/greatest version. The OUI says that I need 7GB of free space and I’m a little short. This is on virtual machines so we can grow the disk easily enough. Everything is installed in the root partition (this isn’t production) and my SysAdmin is asking me if he needs to grow /dev/sda1 or /dev/sda2. Well the disk device is /dev/sda which is the first disk device on the system. This disk device has two partitions, named sda1 and sda2. Which partition is for the root file system? Today, I stumbled across the lsblk command which shows me exactly how this disk is partitioned. The lsblk command is short for List Block Devices. Here is the output on my system.

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda    8:0    0   42G 0 disk
├─sda1 8:1    0  400M 0 part /boot
└─sda2 8:2    0 31.6G 0 part /
sr0   11:0    1 1024M 0 rom

The output above clearly shows that sda1 is for the boot partition and sda2 is for the root partition.

Older posts «