Mar 02

GWFG in Oracle RAC

I have a deadlock being reported in my 3-node Oracle RAC database (version 11.2.0.4) as can be seen in the Alert Log. Being that this is an Oracle RAC database, resources are managed globally and the Lock Manager Daemon (LMD) gets involved. The message in the alert log pointed me to a LMD trace file which contained this Global Wait-For-Graph (GWFG).

*** 2015-02-27 04:16:33.183
Submitting asynchronized dump request [1c]. summary=[ges process stack dump (kjdglblkrdm1)].
Global blockers dump end:-----------------------------------
Global Wait-For-Graph(WFG) at ddTS[0.394d] :
BLOCKED 0x551b2c698 2 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[6B000-0004-0000022D] inst 4
BLOCKER 0x5513ed318 2 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[B6000-0006-00000099] inst 6
BLOCKED 0x5513ed318 4 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[B6000-0006-00000099] inst 6
BLOCKER 0x5513ef5b8 4 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[66000-0005-00000FDB] inst 5
BLOCKED 0x5513ef5b8 4 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[66000-0005-00000FDB] inst 5
BLOCKER 0x551b2c698 4 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[6B000-0004-0000022D] inst 4
* Cancel deadlock victim lockp 0x551b2c698

 

Note:  The actual SQL and a few other details may have been changed to protect the innocent.

So there I have 3 sessions involved in the deadlock. One each on instance id 4, 5, and 6.

 

I was looking at the trace file generated on instance id 4. Above the GWFG was this information:

 

user session for deadlock lock 0x551b2c698
 sid: 1727 ser: 539 audsid: 996549224 user: 13944/MP_SYS
 flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x40009) -/-/INC
 pid: 107 O/S info: user: oracle, term: UNKNOWN, ospid: 11229
 image: oracle@myhost04.naucom.com
 client details:
 O/S info: user: web-svc, term: web-server1, ospid: 4276:864
 machine: DOMAIN\web-server1 program: iis.exe
 client info: user: WEBSERVICE
 current SQL:
 INSERT INTO MP_SYS.T2( column_list) SELECT column_list FROM MP_SYS.T1 WHERE MP_SYS.T1.P_ID=:B1
DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK
 possible owner[107.11229] on resource TM-0011FFA3-00000000

 

So I can see information about the machine, program, and SQL statement being executed. The user session identifier in red above corresponds to the id value in the GWFG. Lets look at the first two lines of the GFWG again.

BLOCKED 0x551b2c698 2 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[6B000-0004-0000022D] inst 4
BLOCKER 0x5513ed318 2 wq 2 cvtops x1 TM 0x11ffa3.0x0(ext 0x0,0x0)[B6000-0006-00000099] inst 6

So the SQL statement and session details match this first line. This session is blocked on inst 4. The blocker is on inst6 and is identified as 0x5513ed318. Well who is this? There are no other details in this GWFG to help us with the BLOCKER.

To learn more about the blocker, I went to inst 6 and did this:

cd /u01/app/oracle/diag/rdbms/admin/orcl/orcl6/trace
grep 0x5513ed318 *

So I just did a grep on the session identifying value and was given a LMD trace file. Looking in that LMD trace file on the other instance gives me details about the blocking session.

user session for deadlock lock 0x5513ed318
 sid: 1206 ser: 2673 audsid: 996459926 user: 13944/MP_SYS
 flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
 flags2: (0x40009) -/-/INC
 pid: 182 O/S info: user: oracle, term: UNKNOWN, ospid: 7049
 image: oracle@myhost06.naucom.com
 client details:
 O/S info: user: web-svc, term: web-server2, ospid: 4276:864
 machine: DOMAIN\web-server2 program: iis.exe
 client info: user: WEBSERVICE
 current SQL:
 DELETE FROM MP_SYS.T1 WHERE P_ID = :B1
DUMP LOCAL BLOCKER: initiate state dump for DEADLOCK
 possible owner[182.7049] on resource TM-0011FFA3-00000000

I can see that the blocked session was issuing a INSERT statement on a table and the blocker was issuing a DELETE on the same table.

There is another session involved but at this point, it is elementary to get session details using the same steps above.

 

Hopefully, this blog entry has shown how to use the GWFG to diagnose the SQL statements and the object involved in the global deadlock. I know know the exact SQL statements involved in the deadlock, and by extension, the objects involved as well. The resolution to the problem is no different than deadlock resolution in single-instance databases.

 

For more information about Oracle RAC Global Enqueue Services (GES), read Chapter 2 of my Oracle RAC Performance Tuning book.

 

 

 

 

 

 

Feb 12

Oracle RAC Performance Tuning

I am proud to announce that my book is now available for purchase! Please order Oracle RAC Performance Tuning today!

http://www.amazon.com/Oracle-RAC-Performance-Tuning–Focus/dp/0986119415/ref=sr_1_2?ie=UTF8&qid=1423756491&sr=8-2&keywords=peasland

 

Writing this book was a great experience for me and I hope the readers get value out of it.

 

 

Jan 27

Sample Schemas on GitHub

Oracle’s sample schemas are an extra download. For Oracle 12c, the sample schemas are in file 7 of 8 of the downloads. So if you want the sample schemas, you have to go to OTN or Metalink for the patchset version and download the file.

Someone also posted the sample schemas on GitHub here: https://github.com/oracle/

To give full credit, this was not anything done by me. The blog that brought this to my attention is here: https://blogs.oracle.com/opal/entry/a_personal_victory_oracle_database

Jan 27

SQLT and Partitioning

I was made aware of this today: Note 1951018.1 SQLT Utility and Partitioning Tables

The person who pointed this out was afraid that they would be violating licensing if they used SQLT, which would create partitioned tables. If you read the note carefully, SQLT will only create partitioned tables if you have Partitioning installed and enabled. If you aren’t licensed, why is this installed? Installing an unlicensed option is a very easy way to have someone accidentally use it and get yourself caught in an audit.

Jan 23

SEC_CASE_SENSTIVE_LOGON in 12c

Oracle 12c deprecates the SEC_CASE_SENSITIVE_LOGON parameter. It is still there for backwards compatibility. But Oracle would like to see everyone use case sensitive logons in the future.

The one thing that did surprise me was when I recently upgraded a test db from 11.2.0.4 to 12.1.0.1 using the DBUA. At the end of the DBUA, I got this information in the Post Upgrade results:

s_c_s_l

 

 

 

 

Shamefully, the DBUA removed the parameter from my SPFILE without my consent. The parameter is still valid. Removing it may break some applications. I would rather see the DBUA remove obsolete parameters not deprecated ones.

 

 

Jan 22

Deprecation of non-CDB

The rumors started flying shortly after Oracle 12c was released with its new multi-tenant architecture that someday, there would be no Oracle databases that were not multi-tenant. Today, the Twitterverse has had a number of posts that reference this part of the Oracle documentation:

https://docs.oracle.com/database/121/UPGRD/deprecated.htm#BABDBCJI

 

So the non-CDB architecture is deprecated now. Better start learning Multi-tenant even if it means you only have one Pluggable Database in your Container Database.

 

 

 

Jan 22

SQL Server Outage

Just like many Oracle DBA’s, I have to support other database platforms as well, which in my case does include SQL Server. My shop has regularly scheduled maintenance windows every Wednesday evening. Last night, our SysAdmins were replacing some disk in our production SQL Server box. The disk wasn’t one that SQL Server uses so I just has the SysAdmins disable the services for SQL Server. They could then power down the server and be on their way to handling that disk. I got the call shortly before 11pm that there was a separate issue on the server and the server could not be powered up. Some other hardware part decided it was time to break on us.

To get the production database up and running, the SysAdmins created a Virtual Machine for us to use temporarily. Not sure where our SQL Server media was, I downloaded the correct version and Service Pack from my MSDN account.

One of the nice things about our architecture is that we use NFS disks just about everywhere. NFS disks give us a layer of abstraction. There are no HBA cards the servers need or SAN Fabric switches to cable to the box. Disks are not physically attached to a server. NFS isn’t the only option. One could use iSCSI or FCoIP as well.

With the new VM up and running, the Storage Admin mounted up the database’s disk units to the VM. Because SQL Server was down when the server hardware broke, the database files were intact and consistent.

With the VM ready and the disk units mounted. I installed SQL Server on the VM. I let it create the database on the C: drive where it was installing the software. After the install completed, I applied the correct Service Pack to get SQL Server up to the expected version level. So I now had an empty SQL Server instance at the correct version. To get our databases up and running and available again, all I had to do was to configure SQL Server to point to the correct masterdb database. A simple parameter change on SQL Server started (documented everywhere on the Internet…just Google for it…) and SQL Server came right up! We were back in business. Data access had been restored.

The only issue I encountered was that any SQL Server authenticated accounts had their passwords marked as expired. I’m unsure why and it was after 3am at this point. So I unlocked the accounts and changed the passwords to a known value. The SysAdmin then had to go to the applications and let them know the  new passwords and life was good.

Virtualization is cool and that level of abstraction on the storage side helped make this all possible. Other than being a very long night (I got to bed after 5:30am), it wasn’t too bad. We had everything restored and open for business before people starting arriving for the workday.

On the Oracle side, our production databases use Oracle RAC and Data Guard for high availability. I already have plans in the works for a SQL Server HA solution, but unfortunately this outage happened before implementation.

 

 

 

Jan 16

cursor: pin S wait on X

On my main production RAC database, I do see periods of slowness and the dominant wait event, system wide, is “cursor: pin S wait on X”. The event comes and goes, but I do see it from time to time. So I needed to get the bottom of this. Note, that this is not a RAC problem. This event can be easily seen on single-instance databases as well. When I do see this on multiple instances of my Oracle RAC database, it is because I have multiple sessions from the same application spread out among the instances, all doing the same thing, thus all having the same problem.

First, what is the wait event all about? Any of the “cursor:” waits are bottlenecks in the Shared Pool in the SQL Area. A long time ago, this portion of the Shared Pool was protected by latches. But as is the case with many areas of the Shared Pool, Oracle is now using mutexes. With the change in the protection mechanism, we now have new wait events.

In the case of this particular wait event, we have a cursor that is wanting a Shared pin but must wait for another session to release its eXclusive mutex. A cursor is trying to be parsed. But it can’t be parsed because another session is holding on to the same mutex.

There are three main causes to sessions waiting on this event.

  • High hard parses
  • A high number of versions of the SQL statement
  • Bugs

Unfortunately, there are a number of bugs related to this wait event. Most of the ones that I have seen are fixed in 11.2.0.4 or 12.1.0.1 so if you are lagging behind in versions, consider upgrading to one of the more recent Oracle versions.

So let’s see if we can walk through an example to determine the cause of the problem. To do that, I used the following query:

select s.inst_id as inst,
       s.sid as blocked_sid, 
       s.username as blocked_user,
       sa.sql_id as blocked_sql_id,
       trunc(s.p2/4294967296) as blocking_sid,
       b.username as blocking_user,
       b.sql_id as blocking_sql_id
from gv$session s
join gv$sqlarea sa
  on sa.hash_value = s.p1
join gv$session b
  on trunc(s.p2/4294967296)=b.sid
 and s.inst_id=b.inst_id
join gv$sqlarea sa2
  on b.sql_id=sa2.sql_id
where s.event='cursor: pin S wait on X';

 

Running this in one of my production RAC databases, I get the following output:

INST BLOCKED_SID BLOCKED_USER BLOCKED_SQL_ID BLOCKING_SID BLOCKING_USER BLOCKING_SQL_ID
---- ----------- ------------ -------------- ------------ ------------- ---------------
   4         723 USER12345    cn7m7t6y5h77g          1226 USER12345     cn7m7t6y5h77g 
   4         723 USER12345    cn7m7t6y5h77g          1226 USER12345     cn7m7t6y5h77g 
   4         723 USER12345    cn7m7t6y5h77g          1226 USER12345     cn7m7t6y5h77g 
   4         723 USER12345    cn7m7t6y5h77g          1226 USER12345     cn7m7t6y5h77g 
   4        1226 USER12345    cn7m7t6y5h77g          1796 USER12345     cn7m7t6y5h77g 
   4        1226 USER12345    cn7m7t6y5h77g          1796 USER12345     cn7m7t6y5h77g 
   4        1226 USER12345    cn7m7t6y5h77g          1796 USER12345     cn7m7t6y5h77g 
   4        1226 USER12345    cn7m7t6y5h77g          1796 USER12345     cn7m7t6y5h77g

The first thing to note is that the mutex is only within that instance for Oracle RAC databases. For single-instance databases, the query above will still work. For Oracle RAC, the output from this query will show which instance is having the problem.

In the example above, we have session 723 blocked by session 1226. Session 1226 is further blocked by session 1796. Notice that all three sessions are issuing the same query with SQL ID cn7m7t6y5h77g.

Now that we know the SQL ID, we can easily query V$SQL to determine the SQL statement involved in the problem. I used this query to obtain more information.

select sql_id,loaded_versions,executions,loads,invalidations,parse_calls
from gv$sql 
where inst_id=4 and sql_id='cn7m7t6y5h77g';

The output from querying V$SQL is as follows:

SQL_ID        LOADED_VERSIONS EXECUTIONS LOADS      INVALIDATIONS PARSE_CALLS
------------- --------------- ---------- ---------- ------------- -----------
cn7m7t6y5h77g               1        105        546           308        3513

We can now see that this query has only 1 version in the SQL Area. So right away, we’ve eliminated one of the potential problem areas. In a future blog post, I will discuss queries with high number of versions in the SQL Area. But that isn’t our problem today so we proceed.

It should be obvious from the above that there is a very high number of parse calls. The query has only been executed 105 times but has been parsed 3513 times. Yikes!  The high number if invalidation’s probably has something to do with this as well.

In this example, we now have a good idea of what the problem is. This is an application issue. The application is over-parsing the query. So we’ll send this back to development and dig into the application code. The usual reasons for over-parsing need to be examined.

If the number of versions were low and excessive parsing/invalidations/loads was not an issue, then I would suspect a bug and file a SR with Oracle Support.

Jan 12

Bad Standby

On Friday, I discovered that my standby database had not applied transactions for 8 days. This is very bad for me because I only keep 7 days worth of archived redo logs. So I had no redo to apply. I used the steps in Note 836986.1 to roll forward my standby with an RMAN incremental backup. The steps in the document worked very well for me. My primary generated about 700GB of redo during this 8 day period, but the RMAN incremental backup was about 285GB in total size. It took me about 24 hours to create the backup since RMAN had to read my entire 14TB database, block by block. Recovering the standby from this backup took about 8 hours. Not too bad.

 

After the restore, I did get some errors about my SRLs that needed clearing. My alert log showed:

 

SRL log 7 needs clearing because log has not been created
SRL log 8 needs clearing because log has not been created
SRL log 9 needs clearing because log has not been created
SRL log 10 needs clearing because log has not been created
SRL log 11 needs clearing because log has not been created

 

This was easily handled with:

SQL> alter database recover managed standby database cancel;
Database altered.
SQL> alter database clear logfile group 7;
Database altered.
SQL> alter database clear logfile group 8;
Database altered.
SQL> alter database clear logfile group 9;
Database altered.
SQL> alter database clear logfile group 10;
Database altered.
SQL> alter database clear logfile group 11;
Database altered.
SQL> alter database recover managed standby database disconnect from session;
Database altered.

 

Dec 31

2014 Blog Stats

As the end of the year is right upon us, I thought it was time to look at this year’s stats for my blog. I’m averaging over 30,000 hits per month and about 3,000 unique visitors each month. I’m pleased with the attention my poor little blog is receiving. :)

 

Here are the stats:

monthly_history

 

 

 

 

 

 

 

 

 

 

I haven’t updated my blog as much as I wanted to. I’ve been too busy on a side project which will be announced in early 2015! After that project is done, I’ll get back to more regular postings. Have a great new year!

Older posts «