Nov 14

High Space Usage From crfclust.bdb

I have a 2-node testbed running Oracle RAC 11.2.0.4 on OL6. Pretty much everything is in the system disk. This is just a testbed after all. The root partition has been filling up. I got an alert from EM about the disk space issue and went in and cleaned up some log files. As I was cleaning up old log files, my brain was telling me that log file space utilization was not out of control and that there must be another underlying issue. Sure enough, three days later I got the alert that the disk was filling up again. I knew that I needed to dig further into this. There must be another file or two hogging up the space. After some digging, I came to this directory in my Grid Infrastructure installation:

 

[oracle@host01 host01]$ pwd
/u01/app/crs11.2.0.4/crf/db/host01
[oracle@host01 host01]$ ls -l
total 10945448
-rw-r--r-- 1 root root 1773999 Jul 2 13:54 02-JUL-2014-13:54:50.txt
-rw-r--r-- 1 root root 1120665 Jul 2 14:00 02-JUL-2014-14:00:06.txt
-rw-r--r-- 1 root root 16953 Mar 25 2014 25-MAR-2014-19:51:58.txt
-rw-r----- 1 root root 280764416 Nov 13 16:15 crfalert.bdb
-rw-r----- 1 root root 9850126336 Nov 13 16:14 crfclust.bdb
-rw-r----- 1 root root 8192 Jul 2 13:59 crfconn.bdb
-rw-r----- 1 root root 352174080 Nov 13 16:15 crfcpu.bdb
-rw-r----- 1 root root 249356288 Nov 13 16:15 crfhosts.bdb
-rw-r----- 1 root root 265261056 Nov 13 16:14 crfloclts.bdb
-rw-r----- 1 root root 172232704 Nov 13 16:14 crfts.bdb
-rw-r----- 1 root root 24576 Jul 2 13:54 __db.001
-rw-r----- 1 root root 401408 Nov 13 16:15 __db.002
-rw-r----- 1 root root 2629632 Nov 13 16:15 __db.003
-rw-r----- 1 root root 2162688 Nov 13 16:15 __db.004
-rw-r----- 1 root root 1187840 Nov 13 16:15 __db.005
-rw-r----- 1 root root 57344 Nov 13 16:15 __db.006
-rw-r----- 1 root root 16777216 Nov 13 16:06 log.0000008765
-rw-r----- 1 root root 16777216 Nov 13 16:15 log.0000008766
-rw-r--r-- 1 root root 120000000 Jul 2 13:55 host01.ldb
-rw-r----- 1 root root 8192 Jul 2 13:54 repdhosts.bdb

 

The crfclust.bdb file is about 9.8GB. My system disk is only 30GB so this one file is taking up 33% of the entire space. And it keeps growing.  To fix the problem, I performed these steps:

 

[oracle@host01 host01]$ /u01/app/crs11.2.0.4/bin/crsctl stop resource ora.crf -init
CRS-2673: Attempting to stop ‘ora.crf’ on ‘host01′
CRS-2677: Stop of ‘ora.crf’ on ‘host01′ succeeded
[oracle@host01 host01]$ su
Password:
[root@host01 host01]# rm -rf *
[oracle@host01 host01]$ /u01/app/crs11.2.0.4/bin/crsctl start resource ora.crf -init
CRS-2672: Attempting to start ‘ora.crf’ on ‘host01′
CRS-2676: Start of ‘ora.crf’ on ‘host01′ succeeded

 

Why did this work? These files are the Berkeley database used for the Cluster Health Monitor (CHM). One of the files is only supposed to be about 1GB in size and regularly purge older data. But the purge step is not working. By manually removing the files, I will lose historical performance data, but that is acceptable to me at this point. On startup, CHM will create the files anew if they are missing.

 

After fixing the issue, I did find Metalink Note 1343105.1 which describes the problem. I haven’t yet been able to find a specific Bug number, but it is clear that a bug exists.

 

Nov 13

InMemory DUPLICATE Confusion in Oracle RAC

Most people are probably aware of the new Oracle 12.1.0.2 feature, the InMemory database option. When using this option on Oracle RAC, the DBA can specify the DUPLICATE clause to have an object be duplicated among the InMemory column store in all instances. This clause is for Oracle’s Engineered Systems like Exadata. However, in non-Engineered systems, Oracle seems to allow this clause but it doesn’t work as one might expect. To illustrate, follow this example, which was run on a two-node RAC database on my MacBook Pro with VirtualBox…definitely not an Engineered system.

 

First, a table is created and then is altered for INMEMORY DUPLICATE.

 

SQL> create table db_objs
 2 as select * From dba_objects;
Table created.
SQL> alter table db_objs inmemory duplicate;
Table altered.

 

Shouldn’t setting this clause raise an error since this is a non-Engineered system?

The table is verified to show that DUPLICATE is specified.

SQL> select inmemory,inmemory_duplicate 
 2 from user_tables where table_name='DB_OBJS';
INMEMORY INMEMORY_DUPL
-------- -------------
ENABLED  DUPLICATE

 

A simple “select *” form the table is issued on instance 1. We can then verify that the table is InMemory.

SQL> select inst_id,owner,segment_name,populate_status,inmemory_duplicate
 2 from gv$im_segments;
INST_ID    OWNER      SEGMENT_NA POPULATE_ INMEMORY_DUPL
---------- ---------- ---------- --------- -------------
         1 SCOTT      DB_OBJS    COMPLETED DUPLICATE

Notice that the results above show that the segment is only in instance 1. The same table is queried in instance 2, but querying GV$IM_SEGMENTS still shows only instance 1.

From instance 1:

SQL> select avg(object_id) from db_objs;
AVG(OBJECT_ID)
--------------
 11095.2049
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 1349857420
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 10 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS INMEMORY FULL| DB_OBJS | 21319 | 104K| 10 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

From instance 2:

 


SQL> select avg(object_id) from db_objs;
AVG(OBJECT_ID)
--------------
 11095.2049
Elapsed: 00:00:00.03
Execution Plan
----------------------------------------------------------
Plan hash value: 1349857420
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 5 | 4 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 5 | | |
| 2 | TABLE ACCESS INMEMORY FULL| DB_OBJS | 21319 | 104K| 4 (0)| 00:00:01 |
---------------------------------------------------------------------------------------

So from either instance, the table was accessed INMEMORY. But we can see that only instance 1 has the segment InMemory.

 

All signs point to the DUPLICATE clause as working on a non-Engineered system, which we know is an error. DBA_TABLES seems to indicate that DUPLICATE is in play here. The Explain Plan provides concurrence. But GV$IM_SEGMENTS disagrees and shows that DUPLICATE is not working in this system.

 

 

 

Nov 04

ORA-00838

Ran into an interesting issue the other day that took me a while to fix. I was trying to start a non-production database and received this error:

ORA-00838: Specified value of MEMORY_TARGET is too small, needs to be at least 1428M
ORA-01078: failure in processing system parameters

For starters, the Oracle documentation for my version says that the minimum value for MEMORY_TARGET is 512MB but the above error is telling me that I need to be using a larger value. So the next logical step is to get help for this error message.

[oracle@host ~]$ oerr ora 838
00838, 00000, "Specified value of MEMORY_TARGET is too small, needs to be at least %sM"
// *Cause: The specified value of MEMORY_TARGET was less than the sum of the
// specified values for SGA_TARGET and PGA_AGGREGATE_TARGET.
// *Action: Set MEMORY_TARGET to at least the recommended value.

Well thats interesting. Here is my PFILE contents.

*.audit_file_dest='/u01/app/oracle/admin/orcl/adump'
*.audit_trail='DB_EXTENDED'
*.compatible='11.2.0.0.0'
*.control_files='/u01/app/oracle/oradata/orcl/data01/control01.ctl','/u01/app/oracle/oradata/orcl/data02/control02.ctl','/u01/app/oracle/oradata/orcl/data03/control03.ctl'
*.db_block_size=8192
*.db_domain='naucom.com'
*.db_files=1100
*.db_name='orcl'
*.db_recovery_file_dest_size=214748364800
*.db_recovery_file_dest='/archivelog'
*.diagnostic_dest='/u01/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=orclXDB)'
*.event=''
*.job_queue_processes=10
*.memory_target=1048576000
*.open_cursors=300
*.processes=1600
*.remote_login_passwordfile='EXCLUSIVE'
*.sec_case_sensitive_logon=FALSE
*.undo_tablespace='UNDOTBS4'

 

Notice that I am not specifying SGA_TARGET or PGA_AGGREGATE_TARGET. So the error message help was of no help here. After working the problem, I discovered that someone had changed my parameter file.

*.processes=1600

This is a small non-production database. My MEMORY_TARGET is only 1GB. So why is PROCESSES set to such a large value? When I returned this to a much smaller value of 600, the instance had no problems starting up. The number of processes derives the number of sessions, which necessitates a higher SGA. The initial error message was correctly but provided no clues as to why the problem existed.

 

 

Oct 28

ORA-24247: network access denied by access control list (ACL)

I have a regularly scheduled job on an Oracle RAC database that will send me an email alert for a condition. This happens every 30 minutes. The job has been failing on one of the nodes, but not the others. The job spits out these errors:

 

ORA-12012: error on auto execute of job "OWNER"."JOB_NAME"
ORA-24247: network access denied by access control list (ACL)
ORA-06512: at "SYS.UTL_TCP", line 17
ORA-06512: at "SYS.UTL_TCP", line 267
ORA-06512: at "SYS.UTL_SMTP", line 161
ORA-06512: at "SYS.UTL_SMTP", line 197
ORA-06512: at "SYS.UTL_MAIL", line 386
ORA-06512: at "SYS.UTL_MAIL", line 599
ORA-06512: at line 41

What is odd about this one is that the following works on all instances:

SQL> exec utl_mail.send(sender=>'me@acme.com', -
> recipients=>'me@acme.com', -
> subject=>'test from orcl1', -
> message=>'test from orcl1', -
> mime_type=>'text; charset=us-ascii');
PL/SQL procedure successfully completed.

So when I send the email on the instance, it works fine. But the job owner is getting the error. So create a ACL and assign privs.

 

SQL> exec dbms_network_acl_admin.create_acl ( -
> acl=>'utl_mail_acl.xml', -
> description=>'ACL for using UTL_MAIL', -
> principal=>'OWNER', -
> is_grant=>TRUE, -
> privilege=>'connect', -
> start_date=>SYSTIMESTAMP, -
> end_date=>NULL);
PL/SQL procedure successfully completed.
SQL> exec dbms_network_acl_admin.assign_acl( -
> acl=>'utl_mail_acl.xml', -
> host=>'smtprelay.acme.com', -
> lower_port=>25, upper_port=>NULL);
PL/SQL procedure successfully completed.
SQL> commit;
Commit complete.

 

Now the procedure works as directed.

Oct 23

Oct2014CPU Crashes ArcGIS Desktop

Right after I applied the Oct2014 SPU to our development database, members of our IT staff started complaining that direct-connect connections with ArcCatalog and ArcMap would crash. The app wouldn’t even connect to the database. I tried various things…even upgrading Oracle Client to 11.2.0.4 to match the database version (it was 11.2.0.1) but nothing worked. I even went so far as to enable both 10046 tracing and client-side SQL*Net tracing. In the 10046 trace, I could see where SQL statements were issued to the database. The Listener log confirmed the client established a connection and the 10046 shows the standard SQL statements that were issued to the the Oracle database any time ArcCatalog makes a direct-connect connection. Except at the end of the 10046 trace file, was this last SQL statement:

 

PARSING IN CURSOR #140250835575144 len=279 dep=0 uid=9459 oct=3 lid=9459 tim=1413920974829489 hv=3533534632 ad='7963a438' sqlid='5hq4svb99uxd8'
SELECT r.owner, r.table_name, x.column_name, x.column_id, x.index_id, x.registration_id, x.minimum_id, x.config_keyword,x.xflags FROM SDE.table_registry r, SDE.sde_xml_columns x WHERE r.registration_id = x.registration_id AND (( r.table_name = 'GDB_ITEMS' AND r.owner = 'SDE'))
END OF STMT
PARSE #140250835575144:c=4999,e=5796,p=0,cr=147,cu=0,mis=1,r=0,dep=0,og=1,plh=1755489251,tim=1413920974829487
WAIT #140250835575144: nam='SQL*Net message to client' ela= 3 driver id=1413697536 #bytes=1 p3=0 obj#=297281 tim=1413920974829548

So the SQL was issued and parse. And then before execution, the SQL*Net message to client wait event occurred. And that’s the end of it.  So I turned to SQL*Net tracing. That trace revealed the following:

 

DDE: Flood control is not active
Incident 1 created, dump file: c:\oracle\product\11.2.0\client_2\log\oradiag_bpeasland\diag\clients\user_bpeasland\host_525531546_80\incident\incdir_1\ora_26000_24088_i1.trc
oci-24550 [3221225477] [Unhandled exception: Code=c0000005 Flags=0
] [] [] [] [] [] [] [] [] [] []

 

Well the OCI-24550 error wasn’t very informative. I was trying to do some more digging when a colleague found an ESRI document that describes this exact behavior and they now have Bug # 82555. Here is that document:

http://support.esri.com/en/knowledgebase/techarticles/detail/43293?utm_source=esri&utm_medium=email&utm_term=73110&utm_content=article&utm_campaign=oracle_patch_2014

 

ESRI says to avoid the patch. But I’d rather not wait for ESRI and Oracle to quit pointing fingers at each other. It has also been my experience that ESRI bugs are not fixed expeditiously. The workaround to grant the SELECT_CATALOG_ROLE role has worked quite well for me. I hope this helps others who have the same problem.

 

Oct 01

Zero Data Loss Recovery Appliance

Oracle’s standby databases have been around for a long time now. The primary ships redo to the standby to keep them in sync. It seems to be a natural fit that Oracle has now extended this concept to a backup and recovery appliance. The idea is that you take one backup of your database at the start. That’s it…one backup. No more full or incremental backups. The Oracle database sends redo to the appliance which then applies the redo to the backup on the device. The backup on the appliance is always kept up-to-date.

When I attending Open World last year, I had heard about this device. But even then, Oracle was quick to say that the appliance was not released for general availability at that time. This year, the device is available and was discussed at the conference this week.

More information can be found here: http://www.oracle.com/us/corporate/features/zero-data-loss-recovery-appliance/index.html

 

Sep 19

Error 1033 received logging on to the standby

Upgraded production to 11.2.0.4 a few nights ago. The primary is 3-node RAC and the standby is 2-node RAC. Notice that one of the threads was not transmitting redo to the standby. Saw this repeatedly in the alert log:

 

Error 1033 received logging on to the standby

 

Turns out this was a problem of my own making. In $ORACLE_HOME/dbs, I had the following:

 

-rw-rw—- 1 oracle oinstall 1544 Sep 18 01:44 hc_ncpp5.dat
-rw-r–r– 1 oracle oinstall 55 Sep 18 01:38 initncpp5.ora
lrwxrwxrwx 1 oracle oinstall 40 Sep 18 01:38 orapwnp5 -> /u01/app/oracle/admin/ncpp/dbs/orapwncpp
lrwxrwxrwx 1 oracle oinstall 45 Sep 18 01:38 spfilencpp5.ora -> /u01/app/oracle/admin/ncpp/dbs/spfilencpp.ora

Since the primary is RAC, I put the password file and spfile on shared storage. I then create softlinks in $ORACLE_HOME/dbs. The softlink was a typo. That’s what I get for staying up until 3am while sick when trying to upgrade a production database. The fix was as simple as:

mv orapwnp5 orapwncpp5

That fixed everything for me!

 

 

Sep 16

Good Time for DBAs?

Is this a good time to be a DBA? My biased opinion is that any time is a good time to be a DBA. The US Bureau of Labor Statistics released an outlook indicating that DBA positions are expected to increase 15% between 2012 and 2022.

Now comes this article that says about 50% of DBAs are expected to leave the market in the next 10 years.

Demand is rising!

Sep 05

Oracle 12.2.0.1 coming in 2016

Oracle will be releasing Oracle 12cR2 in the first half of 2016. See Metalink Note 742060.1 for the current release schedule.

The Oracle 12.1.0.3 patchset is not on the list but there is a chance it will be out before 12cR2. We’ll have to wait and see I guess.

Now for the burning question…do I upgrade to 12.1.0.2 or maybe 12.1.0.3 or hold off until 12.2.0.1?

Sep 02

Importance of Testing

I am working on upgrading all of our production databases to the Oracle 11.2.0.4 version. My company’s most important database serves a custom, in-house developed application so we have the luxury (and sometime the curse) of having complete control over the application source code. If I discover a version-specific issue with a their party application, I file a trouble ticket with the vendor to get the issue fixed. But for our own application, I often have to diagnose the problem and determine how to get the issue fixed.

Since I have been at this company, I have upgraded from 11.1.0.7 to 11.2.0.2 and then to 11.2.0.3 and now to 11.2.0.4. The two previous upgrades went just fine. No problems. So I have been very surprised that the upgrade from 11.2.0.3 to 11.2.0.4 has been problematic for our application.

EVEN WHEN YOU THINK THE UPGRADE IS “minor“, PERFORM ADEQUATE TESTING!!!!

I never expected to find issues with this simple patchset upgrade. I’m not skipping versions and 11.2.0.4 shouldn’t introduce too many problems.  My first issue I blogged about here:

http://www.peasland.net/2014/08/19/sticky-upgrade-problem/

The next problem is a query similar to the following in our application code:

SELECT DISTINCT columnA
FROM our_table
ORDER BY columnB;

The above query will now return an ORA-01791 error in Oracle 11.2.0.4 but it ran just fine in previous versions. When a DISTINCT is used, and the ORDER BY clause has a column not seen in the SELECT clause, the ORA-1791 error will be raised. Oracle says that the fact that this used to work is a bug. The bug is fixed in 11.2.0.4 so the above now raises an exception.

When I first was made aware of this issue, my initial thought was why are we ordering by a column not in the SELECT caluse? The end user won’t know the data is ordered because they can’t see the order. Then I found out that this routine is only used for internal processing. Well machine’s work just fine without ordering the data. So the simple fix on our end was to remove the GROUP by clause. As soon as the code changes gets into production, I can proceed with my database upgrade.

It is so important that I’ll say it again:

EVEN WHEN YOU THINK THE UPGRADE IS “minor“, PERFORM ADEQUATE TESTING!!!!

At this company, we follow a strict process for changes. The change is made in development first. And then after a period of time, the change is made in Test environment. And then after a period of time, if there are no issues, the change can proceed to production. We also have a custom test application that exercises key components of our application so even if our testers are not hitting that portion of the app, our automated test suite will.

Without adequate changes, the two issues we encountered would most likely not have been noticed until the change was in production. Then the DBA would have been blamed, even though both of these issues were application code problems. Test, test, and test again.

Older posts «