Jul 19

ORA-4031 errors with Direct NFS

This is an old problem for me which I resolved quite a while ago, but I figured I’d finally write it up. When I started at my new company, the production database would crash about once a month with ORA-4031 memory errors. The current DBA figured the problem was lack of bind variables in application code, which is normally the most likely culprit. Lack of bind variables was a problem, but as I soon found out, the problem was not related to SQL statements at all. Rather, the problem was due to a bug with Direct NFS. Apparently, Direct NFS had a memory leak in it. The memory leak affected the Shared Pool and given enough time, the free memory in the Shared Pool became non-existent, ORA-4031 errors would be tossed, and the database would crash.

I worked with Oracle Support to confirm that this was a bug. This problem was filed as Bug 10237987 and affects versions 11.1.0 and 11.2.0. I ran into this bug on a 3 node RAC cluster running Oracle Enterprise Linux. I was never able to get an answer if this bug occurred on other OS platforms or for non-RAC environments. The bug cause the KGNFS pools in the Shared Pool to grow significantly over time.

I regret that I was never able to see this bug through to a conclusion with Oracle Support. Support wanted me to capture a trace when the database crashed due to ORA-4031 errors with the KGNFS pools. What they failed to realize was that the KGNFS pools caused a shortage of space in all other pools in the Shared Pool. The KGNFS pools would squeeze out the other pools and it was often when space was requested in the other pools that the ORA-4031 error was reached. And the only time the trace files were generated was to wait for the instance(s) to crash. I could not sit and wait for our instances to crash in the middle of the day just to capture a trace file for Oracle Support, especially since the trace file may not even be from the KGNFS pools!

In the end, our workaround was to stop using Direct NFS in our configuration. Our testing showed no performance differences with or without Direct NFS. Since we quit using Direct NFS, we have not seen any of the ORA-4031 errors. I see the bug is still out there on Metalink but no fix as of yet.

Jul 18

ORA-12519 TNS: no appropriate service handler found

Every 10 minutes or so, I have Grid Control sending me an alert similar to the following:

Failed to connect to database instance: ORA-12519: TNS:no appropriate service handler found (DBD ERROR: OCIServerAttach).

The database was running just fine and should be up and running. Signing on to the database server and issuing “lsnrctl status” showed that all of the services were configured in the Listener for normal operations:

Services Summary…
Service “orcl.acme.com” has 1 instance(s).
Instance “orcl”, status READY, has 1 handler(s) for this service…
Service “orclXDB.acme.com” has 1 instance(s).
Instance “orcl”, status READY, has 1 handler(s) for this service…
Service “orcl_XPT.acme.com” has 1 instance(s).
Instance “orcl”, status READY, has 1 handler(s) for this service…
The command completed successfully

So why am I getting alerted every ten minutes or so? A quick connection to the database gave me the problem:

sqlplus /nolog

SQL*Plus: Release – Production on Mon Jul 18 13:47:16 2011

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

SQL> connect / as sysdba
ORA-00020: maximum number of processes (%s) exceeded

Oops! We have exceeded the process limit defined by the PROCESSES parameter. Why Oracle does not give us the ORA-00020 error instead of the ORA-12519 is beyond me.

Jul 18

DBCA Create Database bad REMOTE_LISTENER

I recently upgraded Oracle Cluster Ready Services to Oracle Grid Infrastructure on a testbed. I then attempted to run the DBCA to create an Oracle on top of the Grid installation. After walking through the DBCA wizard,I would always receive messages similar to the following:

ORA-00119: Invalid specification for system parameter REMOTE_LISTENER
ORA-00132: Syntax error or unresolved network name LISTENER_sid
ORA-1078: Failure in processing system parameters.

Shouldn’t the DBCA take care of setting up the LISTENER_sid entries in my tnsnames.ora config file? After being perplexed and trying a few things, I remembered that the Listener is not configured in the database $ORACLE_HOME/network/admin directory like more database installations. Beginning with Grid Infrastructure, the Listener runs out of CRS_HOME. Adding the LISTENER_sid entry to the tnsnames.ora file in CRS_HOME/network/admin did the trick. I was then able to run the DBCA to create my database.

I’m not sure if this same behavior exists in the DBCA. I’m creating a database to test a database upgrade to, which is why I am using the older DBCA version.

Jul 06

CRS 11.2.0

I am working on upgrading our existing Oracle Cluster Ready Services from to 11gR2 ( Things are not going as smoothly as I had hoped and I am learning a lot about the changes in 11gR2. This is not a minor upgrade as the version number differences would suggest. There have been lots of changes in CRS 11gR2. The main changes are as follows (in no particular order):

  • Cluster Ready Services (CRS aka Clusterware) is now being called Grid Infrastructure, or GRID for short.
  • If you use ASM, it is no longer installed in a separate home. Oracle GRID includes Clusterware and ASM in the same home now.
  • Oracle GRID 11gR2 now includes a Single Client Access Name (SCAN) listener. To keep things simple, you will need to make a SCAN virtual IP address just like your usual VIP’s and register them with DNS. The SCAN VIP should be the same name as your cluster name. The SCAN VIP should have 3 IP addresses associated with it as Oracle GRID will start up to 3 SCAN listeners.
  • Oracle GRID 11gR2 now supports multi-casting. I had to apply patch 9974223 since my configuration used a secondary port for multi-casting. There is a mutli-cast test tool which can help determine if you are configured for multi-casting or not.
  • While you can restart CRS with “crsctl stop/start crs”, I was always used to “/etc/init.d/init.crs stop/start”. The /etc/init.d/init.crs script is no longer available. It has been replaced by /etc/init.d/init.ohasd instead.

These are just a few change that I am finding along the way as I perform my upgrades and troubleshoot issues that arise.

This upgrade has proven to me that it is valuable to have a RAC testbed before working on these tasks in your production environments. The last place I worked at only had one RAC environment and that was our production database. It was deemed too costly to set up another RAC environment for development/testing. My current employe was wise enough to set up a RAC testbed where I could completely destroy things and test, test, test before attempting in production. Things have changed in recent years where one can use Virtual Machines to set up test environments much cheaper than previously where we had to procure hardware just for testing.

That being said, I wish my current RAC testbed was in a VM environment. If it were in a VM, I could take a snapshot of the VM with CRS 11.1.0 running and if I ran into issues that were difficult to recover from I could revert back to the snapshot. As it stands now, if I run into problems with the upgrade and I really make a mess of things, I have to manually uninstall everything, reinstall CRS 11.1.0 and recreate a database before I can reattempt a CRS 11.2.0 upgrade. This takes time and a VM could save me lots of time here.

Jun 13

Explain Plan Cost Pain Point

Earlier today, I was working with a developer on a query that had poor performance. This query was large and complex, and initially it looked like a daunting effort to find out where the performance problem lies. With Explain Plan we can sometimes use the cost to help narrow down the performance pain point of a large, complex query.

Looking at an Explain Plan of this query, we can see its overall cost is pretty high.

When looking at the details, we can see that the FULL table scan (FTS) on the DETAIL_RECORD table has a high cost of 51018. Notice how the high cost of the FTS propagates up the plan. All operations above this FTS have a high cost because of the high cost of this single table access. Accessing the CIMS_POLICIES_TO_PROCESS table has a relatively low cost, but the HASH JOIN operation receives its high cost only because of the high cost to access the DETAIL_RECORD table.

The overall cost is only a little more than the cost to access this table. It is clear that the FTS on this table is largest contributor to the pain point of this query being analyzed.

By looking at the Explain Plan costs in this manner, we were able to very quickly focus in on the one area of a very complex query that is causing the most performance pain. Without the cost analysis done here, determining which portion of the query below is causing problem would have been a lot of work.

Jun 07

KGXGN polling error (15)

When trying to start the second instance in a two-node RAC cluster, the second instance will not start. If the instance on node1 is running, the instance on node2 will not start. If the instance on node2 is running, the instance on node1 will not start. The Alert Log shows the following:

Error: KGXGN polling error (15)
Errors in file /u01/app/oracle/diag/rdbms/bsp/bsp1/trace/bsp1_lmon_9151.trc:
ORA-29702: error occurred in Cluster Group Service operation
LMON (ospid: 9151): terminating the instance due to error 29702

Unfortunately, the LMON trace file only gives the same error messages so nothing to go on there.

This error is occurring because of a misconfiguration for the cluster-interconnect. If you look at the OCR to see the cluster interconnect, you can see the NIC device is eth4.1338:

[oracle@myhost bin]$ oifcfg getif -global
eth2 global public
eth4.1338 global cluster_interconnect

On one node, the device eth4 is correct. However, on the second node the device is eth5.1338 and the OCR is shared between the nodes. The OCR is expecting the device to be eth4.1338. Both servers need the cluster interconnect to be on the same network device. The server’s network configuration was changed so that both nodes were configured on the eth5.1338 device. Once the servers were configured identically, we redefined the OCR config:

[oracle@myhost bin]$ ./oifcfg setif -global eth5.1338/

Looking at the config, we can see that both eth4 and eth5 are still in OCR:

[oracle@myhost bin]$ ./oifcfg getif -global
eth2 global public
eth4.1338 global cluster_interconnect
eth5.1338 global cluster_interconnect

So we remove the eth4 device:

[oracle@myhost bin]$ ./oifcfg delif -global eth4.1338/

We now have the OCR reconfigured. We restarted CRS and both instances came up on both nodes!

This was one of those errors where the error messages really did not point to a root cause of the problem. Instead, I had to poke around the areas I felt were the most likely culprits when I rather blindly discovered the configuration differences.

Mar 30

Columns to Rows

I was doing a code review of a view that would be going into production. The view’s SELECT statement was defined as:

SELECT status.STATUS_ID, status.ITEM_ID, status.REQUEST_FLAG, status.ITEM_CODE, status.PLAN_CODE, status.STATUS_CODE
) status;

As you can see, the view is doing a UNION ALL of the same table. Worse yet, the table is accessed via a Full Table Scan (FTS). We cannot get around the FTS, but we can perform this FTS only once instead of 5 times as is written in the SELECT statement! In this view, you can see where I want to stack STATUS_CODE_1, STATUS_CODE_2, … , STATUS_CODE_5 on top of each other. In other words, I want to turn those five columns into five rows. To do this, and to reduce the number of FTS operations from 5 down to 1, I used a nifty trick as can be seen below:

CASE WHEN row_type.rowno=1 then ‘CODE1’
WHEN row_type.rowno=2 THEN ‘CODE2’
WHEN row_type.rowno=3 THEN ‘CODE3’
WHEN row_type.rowno=4 THEN ‘CODE4’
WHEN row_type.rowno=5 THEN ‘CODE5’
END code_name,
CASE WHEN row_type.rowno=1 THEN STATUS_CODE_1
WHEN row_type.rowno=2 THEN STATUS_CODE_2
WHEN row_type.rowno=3 THEN STATUS_CODE_3
WHEN row_type.rowno=4 THEN STATUS_CODE_4
WHEN row_type.rowno=5 THEN STATUS_CODE_5
END code_val
(SELECT rownum rowno FROM dual
CONNECT BY LEVEL <=5) row_type;

In the SQL above, we introduced a subquery in the FROM clause which does nothing more than generate the numbers 1 through 5. I need numbers 1 through 5 since I will be transforming 5 columns to row. Now that I have this number being calculated, I can use it in the CASE statements. The first CASE statement says that if this is row 1, return “CODE1” to indicate which column returned this row’s value. The second CASE statement says that if this is row 1, return the value from the first status code column. And so on for each of the five columns I want to return as rows.

By resorting to this little trick, we reduce the number of FTS operations on the same table from five to one. The Explain Plan cost reduced from 2066 to 415 in my specific example.

Mar 03

Fun With Database Restore

I was recently given a copy of an Oracle database that I needed to bring up in our environment. The only other requirement I had was to change the database name. Changing the database name means I must create the control files anew and restore using a backup control file. On examination of the database contents, I discovered the copy I was given had everything except for the online redo logs. Since I would be restoring using a backup controlfile, I could open with RESETLOGS and create the online redo logs even though they were missing, so no worries there.

To bring up the database, I copied the datafiles to our database server.I had been given a CREATE CONTROLFILE script, otherwise I would have to generate it by hand (not an impossible feat). The CREATE CONTROLFILE script starts off as:


Since the database is being renamed, the SET DATABASE clause must be used.

At this point, I did the following:

1. sqlplus /nolog

2. connect / as sysdba

3. startup nomount

4. run the CREATE CONTROLFILE script.

The control files were created. Now to proceed with recovery. When recovering using the backup control file (until cancel),  the following could be seen in the Alert Log:

Sun Feb 27 11:10:46 2011
alter database recover using backup controlfile until cancel
Media Recovery Start
WARNING! Recovering data file 1 from a fuzzy file. If not the current file
it might be an online backup taken without entering the begin backup command.

The WARNING message repeated for every file in the database. It is obvious that whomever created the copy of the database did not perform a clean shutdown and they did not perform a ALTER DATABASE BEGIN BACKUP command.

Luckily, I was also given the archived redo logs. So we should still be able to open the database. Without the archived redo logs, this database would either not be able to be opened, or would be corrupt if it were open.

The copy of the database had each file with a timestamp of 3:00 pm. When trying to apply recovery, I had to apply redo logs until the first log that completed after 3:00pm. I manually entered each archived redo log when prompted in SQL*Plus in response to the RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CANCEL command. After I applied the last redo log that would be needed, I replied:



Everything preceding this point has been an exercise in restoring a database and it not necessarily blog-worthy as this sort of restore is old-hat to many Oracle professionals. What follows was new to me.

Since the online redo logs were missing, OPEN RESETLOGS noticed the log files were absent and generated errors in the Alert Log.

Wed Mar 02 14:29:35 2011
alter database open resetlogs
Wed Mar 02 14:29:38 2011
RESETLOGS after incomplete recovery UNTIL CHANGE 3989345711
Resetting resetlogs activation ID 4158449604 (0xf7dce7c4)
Wed Mar 02 14:29:38 2011
Errors in file c:\oracle\product\10.2.0\rdbms\trace\olddb_ora_216.trc:
ORA-00313: open failed for members of log group 4 of thread 1
ORA-00312: online log 4 thread 1: ‘E:\ORADATA\olddb\REDO01A.LOG’
ORA-27041: unable to open file
OSD-04002: unable to open file
O/S-Error: (OS 2) The system cannot find the file specified.

I had never seen the above errors. But this is not a problem since opening with RESETLOGS would recreate these files. Now comes the real troublesome part. The OPEN with RESETLOGS failed with the following error:

Wed Mar 02 14:30:05 2011
Errors in file c:\oracle\product\10.2.0\rdbms\trace\olddb_ora_216.trc:
ORA-30012: undo tablespace ‘UNDOTBS’ does not exist or of wrong type

Wed Mar 02 14:30:05 2011
Error 30012 happened during db open, shutting down database
USER: terminating instance due to error 30012

It turns out the parameter file specified undo_tablespace=undotbs but this tablespace does not exist. Looking at the database copy, I do have a file called “undotbs01.dbf”. But what is the undo tablespace name? I guessed it was UNDOTBS1. So I changed the parameter file. I was still unable to open the database. So I deleted the database files, copied the files from the external media again, and repeated the CREATE CONTROLFILE and applying recovery again. At this point, I was able to open with RESETLOGS successfully.

We cannot control how we receive things, especially when they come from outside our company. If it were up to me, the database copy would have been a cold backup or when the database was in backup mode. If it were up to me, I would have received a parameter file with the database copy. But sometimes, even when it is not the way we prefer, we can still achieve the end-goal which was to open up a copy of this database in our environment.

Feb 16

The Phone Is The PC

When the iPhone was initially introduced some years ago, a colleague of mine mentioned that he thought the device was going to replace desktops and laptops. At the time, I balked at his idea. No one wants to read a large spreadsheet on the tiny 4″ screen on my Droid X. No one wants to Swype their way to creating a long 500 page Word document. My reply to this colleague was that the iPhone (and now the Android) operating systems will start to become contenders to replace laptops and desktops when the phone can be plugged into a dock which then gives you a better keyboard, mouse, and a larger display. It looks like my statement made a long time ago is starting to come to fruition.

Now this one of the first attempts and it most likely has a long way to go. But the way that iOS and Android are being adopted by the consumer, there will someday come a time when your computer walks around with you and you plug it into a dock of some sort to access a better interface or other computing components, i.e. an external harddrive for backups.

Microsoft’s latest foray into the mobile OS seems to have fallen flat. If MS doesn’t do something soon, iOS and Android will be so ubiquitous that it will force MS Windows out of most consumer’s homes.

Feb 16

Fun with ER/Studio

I’ve been trying to work with Embarcadero’s ER/Studio Data Architect for a while now. My goal is to reverse-engineer an existing production application schema. The problem whit this schema is that there is a major lack of defined FK constraints. So any ERD reverse-engineered is mostly just entities and no relationships. Part of my job will be to define those relationships and add them into the ERD at a later date. Keep in mind, I inherited the current application schema.

The big problem with ER/Studio and other modeling tools I have used is that they expect certain rules to be followed, rules which often get ignored in the real world. If I were starting from scratch, I would be able to enforce those rules. But ER/Studio is causing me problems when trying to work with an existing, yet ill-designed model. My particular condundrum was trying to map a relationship between two entities where the Foreign Key constraint does not reference a Primary Key in the parent table. After much trial and error, I have finally figured out how to get around ER/Studio’s apparent limitation to define FK constraints. I documented my approach, complete with screen shots here:

Adding FK Constraint in ER Studio (right click and choose Save As then open in Word)