Recreate Bad RAC Node

Just recently, I was trying to apply the latest and greatest Patch Set Update (PSU) to a 2-node Oracle RAC system. Everything went smoothly on the first node. I did have problems when trying to apply the PSU to the second node. The problem wasn’t with OPatch or the PSU, but rather, I could not even bring down Grid Infrastructure (GI) successfully. And to make matters worse, it would not come up either.

I tracked my issue down to the Grid Inter Process Communication Daemon (gipcd) When issuing ‘crsctl stop crs’, I received a message stating that gipcd could not be successfully terminated. When starting GI, the startup got as far as trying to start gipcd and then it quit. I found many helpful articles on My Oracle Support (MOS) and with Google searches. Many of those documents seemed to be right on track with my issue, but I could not successfully get GI back up and running. Rebooting the node did not help either.  The remainder of this article can help even if your issue is not with gipcd, it was just the sticking point for me.

So at this juncture, I had a decision to make. I could file a Service Request (SR) on MOS. Or I could “rebuild” that node in the cluster. I knew if I filed a SR, I’d be lucky to have the node operational any time in the next week. I did not want to wait that long and if this were a production system, I could not have waited that long. So I decided to rebuild the node.  This blog post will detail the steps I took. At a high level, this is what is involved:

  1. Remove the node from the cluster
  2. Cleanup any GI and RDBMS remnants on that node.
  3. Add the node back to the cluster.
  4. Add the instance and service for the new node.
  5. Start up the instance.

In case it matters, this system is Oracle 12.1.0.2 (both GI and RDBMS) running on Oracle Linux 7.  In my example, host01 is the “good” node and host02 is the “bad” node. The database name is “orcl”.  Where possible, my command will have the prompt indicating the node I am running that command from.

First, I’ll remove the bad node from the cluster.

I start by removing the RDBMS software from the good node’s inventory.

[oracle@host01]$ ./runInstaller -updateNodeList ORACLE_HOME=$RDBMS_HOME "CLUSTER_NODES={host01}" 
LOCAL_NODE=host01

Then I remove the GI software from the inventory.

[oracle@host01]# ./runInstaller -updateNodeList ORACLE_HOME=$GRID_HOME "CLUSTER_NODES={host01}" 
CRS=TRUE -silent

Now I’ll remove that node from the cluster registry.

[root@host01]# crsctl delete node -n host02
CRS-4661: Node host02 successfully deleted.

Remove the VIP.

[root@host01]# srvctl config vip -node host02
VIP exists: network number 1, hosting node host02
VIP Name: host02-vip
VIP IPv4 Address: 192.168.1.101
VIP IPv6 Address: 
VIP is enabled.
VIP is individually enabled on nodes: 
VIP is individually disabled on nodes: 
[root@host01]# srvctl stop vip -vip host02-vip -force
[root@host01]# srvctl remove vip -vip host02-vip
Please confirm that you intend to remove the VIPs host02-vip (y/[n]) y

Then remove the instance.

[root@host01]# srvctl remove instance -db orcl -instance orcl2
Remove instance from the database orcl? (y/[n]) y

At this point, the bad node is no longer part of the cluster, from the good node’s perspective.

Next, I’ll move to the bad node and remove the software and clean up some config files.

[oracle@host02]$ rm -rf /u01/app/oracle/product/12.1.0.2/
[root@host02 ~]# rm -rf /u01/grid/crs12.1.0.2/*
[root@host02 ~]# rm /var/tmp/.oracle/*
[oracle@host02]$ /tmp]$ rm -rf /tmp/*
Men all around the globe tend to face this disorder of  cheap viagra online males. The pill comes in 100mg power, which has to be exercised completely  levitra overnight delivery with enough amount of water without crushing or breaking it. However, the degree of erection may differ such as in one case; man may not able to erect for pleasing plus  order uk viagra satisfactory physical intimacy. Powerful herbs include Shilajit, Safed Musli, Kaunch and Salabmisri in this herbal pills increases buy tadalafil in canada  semen volume. [root@host02]# rm /etc/oracle/ocr*
[root@host02]# rm /etc/oracle/olr*
[root@host02]# rm -rf /pkg/oracle/app/oraInventory
[root@host02]# rm -rf /etc/oracle/scls_scr

I took the easy way out and just used ‘rm’ to remove the RDBMS and Grid home software. Things are all cleaned up now. The good node thinks its part of a single-node cluster and the bad node doesn’t even know about the cluster. Next, I’ll add that node back to the cluster. I’ll use the addnode utility on host01.

[oracle@host01]$ cd $GRID_HOME/addnode
[oracle@host01]$ ./addnode.sh -ignoreSysPrereqs -ignorePrereq -silent "CLUSTER_NEW_NODES={host02}" 
"CLUSTER_NEW_VIRTUAL_HOSTNAMES={host02-vip}"

This will clone the GI home from host01 to host02. At the end, I am prompted to run root.sh on host02. Running this script will connect GI to the OCR and Voting disks and bring up the clusterware stack. However, I do need to run one more cleanup routine as root on host02 before I can proceed.

[root@host02]# cd $GRID_HOME/crs/install
[root@host02]# ./rootcrs.sh -verbose -deconfig -force

It is possible that I could have run the above earlier when cleaning up the node. But this is where I executed it at this time. Now I run the root.sh script as requested.

[root@host02]# cd $GRID_HOME
[root@host02]# ./root.sh

At this point, host02 is now part of the cluster and GI is up and running. I verify with “crs_stat -t” and “olsnodes -n”. I also check the VIP.

[root@host02]# srvctl status vip -vip host02-vip
VIP host02-vip is enabled
VIP host02-vip is running on node: host02

Now back on host01, its time to clone the RDBMS software.

[oracle@host01]$ cd $RDBMS_HOME/addnode
[oracle@host01]$ ./addnode.sh "CLUSTER_NEW_NODES={host02}"

This will start the OUI. Walk through the wizard to complete the clone process.

Now I’ll add the instance back on that node.

[oracle@host01]$ srvctl add instance -db orcl -instance orcl2 -node host02

If everything has gone well, the instance will start right up.

[oracle@host01]$ srvctl start instance -db orcl -instance orcl2
[oracle@host01]$ srvctl status database -d orcl
Instance orcl1 is running on node host01
Instance orcl2 is running on node host02
SQL> select inst_id,status from gv$instance;
INST_ID STATUS
---------- ------------
 1 OPEN
 2 OPEN

Awesome! All that remains is to reconfigure and start any necessary services. I have one.

srvctl modify service -db orcl -service hr_svc -modifyconfig -preferred "orcl1,orcl2"
srvctl start service -db orcl -service hr_svc -node host02
srvctl status service -db orcl

 

That’s it. I now have everything operational.

Hopefully this blog post has shown how easy it is to take a “bad” node out of the cluster and add it back in. This entire process took me about 2 hours to complete. Much faster than any resolution I’ve ever obtained from MOS.

I never did get to the root cause of my original issue. Taking the node out of the cluster and adding it back in got me back up and running. This process will not work if the root cause of my problem was hardware or OS-related.

And the best part for me in all of this? Because host01 already had the PSU applied bo both GI and RDBMS homes, cloning those to host02 means I did not have to run OPatch on host02. That host received the PSU patch. All I needed to do to complete the patching was run datapatch against the database.

4 comments

Skip to comment form

    • Adhika on April 30, 2016 at 13:31

    If that happens in a production, I am sure at some point, the management would ask what was the cause of the issue and how to avoid it in the future.
    I think, that is the most difficult part.

    1. At some point, management may ask such a thing. But there are always tradeoffs. I’d point out to my boss that I had a decision to make. I could have spent another week or more tracking down the root cause, during which time that node would be out-of-service. Or I could take more immediate action to restore service at the expense of never knowing what the root cause was. I’d probably choose the latter almost every time with this sort of thing. As for avoiding it in the future. I’m going to hazard a guess that this is a one-off issue and I won’t see it again in the future. But if it does come back, then it is at that time where I dig down to the root cause more diligently. If this only happens once, then I’m not concerned about the root cause. It if happens more than once, then I certainly don’t want to rebuild that node again.

  1. This sql script took about 75 minutes to complete on my 2 nodes RAC database. This is quite long and then I decided to try to reduce this duration.

    1. Which SQL script are you referring to? This blog post doesn’t have any SQL script in it.
      Thanks,
      Brian

Comments have been disabled.