Oracle RAC N+1 Redundancy

I find that when people are designing Oracle RAC architecture, they often do not think of N+1 redundancy in their implementation plans. There are two reasons to implement Oracle RAC, availability and scalability. For the purposes of this discussion, I am focusing only on the availability side. If your RAC deployments are for scalability reasons only, then this topic may not apply to you.

So what is N+1 Redundancy? Simply put, if you need N units of something, then for redundancy purposes, you should have N+1 of that item. Let’s look at a database server. It must have a power supply. That is a requirement. Without a working power supply, the server will not function at all. The minimum number of power supplies is 1. If we want this server to have a high degree of availability, we will make sure it has N+1 power supplies, or in this case, dual power supplies. If there is only one power supply and it fails, it takes the server with it. If we have an extra power supply, a spare unit, the loss of one power supply will not take down the server with it. Redundancy is a great thing to have and an essential component to a high availability infrastructure.

When designing an Oracle RAC system, the DBA needs to determine how many nodes are needed to support the end user’s demands. If the DBA determines 4 nodes are needed, and this RAC cluster must exhibit high availability traits, then it is vital for the DBA to create a 5 node cluster (4+1). If the resource demands are sufficient to keep 4 nodes busy and one node is lost, the remaining 3 will not be able to keep up with the workload. If the DBA builds the RAC system with N+1 capability in mind, then the loss of one node will not be noticeable by the end users.  If the DBA builds the RAC cluster without N+1 redundancy, then the loss of one node may be so terrible for the end user’s performance, that the entire cluster might as well be down. When designing your RAC implementations, strive for N+1 redundancy.

I remember two years ago, I had a RAC cluster that lost a node. No problem, we still had two nodes available. As I watched the performance of the two remaining nodes, they seemed to be pretty overwhelmed. Our call center started receiving complaints. I worked with other administrators on the IT team to get that node back up and running as fast as possible, but this may not always be the case if the reason for the outage is hardware related and parts need to be replaced. After the node was back in service, I monitored the cluster performance for weeks later. Our usage had grown since this system was initially designed. We had initially designed this system with N+1 redundancy in mind, but our usage grew and N went from 2 to 3. Our current 3-node cluster was no longer N+1 redundant. So I worked with management to put into the next year’s budget enough funds to procure a new node and make sure Oracle was licensed on it. I sleep much better at night knowing that I am back to N+1 redundancy.

Like many implementations out there, my RAC system is not the only High Availability feature built into our infrastructure. This RAC database is a primary to a physical standby database with Oracle’s Data Guard. I’m surprised when discussing RAC standby database’s with other Oracle DBA’s how many of them are not thinking of N+1 capability for their standby. The physical standby database is my safety net in case the primary data center is unavailable for some reason. I’ve seen so many Oracle DBA’s implement a single instance standby for a multi-node RAC primary. Ouch! I hope they never have to fail over. Their entire multi-node RAC cluster’s workload will struggle mightily on that single instance standby. So as you’re designing your RAC implementations for both the primary and the standby, consider your N+1 redundancy implications on the architecture design.
With a four- to five-hour stiffness to enjoy after 30 minutes from ingestion and with no side effects canadian viagra generic at all, if not very minimal, this is yet the top solution your medical professional can offer you. Punching CornflourSomething to try at home: get a big bucket, empty into it several boxes of cornflour (perhaps ask mum first), then add enough water to make generic viagra no prescription it effective in short time. Other services: Our group also takes care of facilities like providing health meals as prescribed by your doctor, levitra prescription planning vacation in India, rejuvenation, rehab facilities. Reiki classes viagra soft 50mg are practiced by many people around the globe and has gained good amount of recognition due to the deficinecy of male relationship are going to break.
Where I probably differ from many people is that my physical standby implementations are not N+1 capable, but rather N. I skip the redundant extra node for my physical standby. Why is that? Purely from a cost perspective. My physical standby is just a safety net. I want it to work for me the day that I need it. But I hopefully never need it. The physical standby is my insurance policy in case risk becomes reality. For me, that extra “+1” at the standby site is over-insurance. I can save on the physical hardware and Oracle licensing.

So let’s say the day comes and I do failover to the standby. I have lost my N+1 redundancy. But what are the chances that I’m going to lose the primary data center *and* lose one of the nodes in my standby cluster? Pretty slim chances. The likelihood of failures at two sites at the same time is pretty small. At this point, our IT team is evaluating why our primary data center is lost and when we can most likely return our operations to that facility. If the primary data center lost all its power and the utility company says service will be restored by tomorrow, then we’ll just simply run at the standby data center even though we only have N nodes for the RAC database there. However, if the primary data center was wiped out by a fire it will like take many months before it is up and running again. It is at this point that I need to plan on getting that physical standby up to N+1 redundancy as our time using that standby as a primary will be a much longer period. So we rush order another server and add it to the cluster as soon as possible. So I design my standby RAC database as N, not N+1 with an eye on increasing it to N+1 in short order if we determine we will be using the standby for real for a longer period of time.

So there is one other special case I would like to discuss. That is where the DBA determines that N=1. For the current workload requirements, one node is sufficient. But we want to have high availability so we design a two-node RAC cluster for the primary database. We now have N+1 redundancy built into the primary. Following my last paragraph, my standby database only needs 1 node. The mistake I see some people make is to create the standby as a single-instance database. So far, their logic makes sense. The primary is N+1 and the standby is N. So far so good. Where I differ is that I make the standby a one node RAC cluster, not a pure single-instance implementation. The reason is for future growth. At some point, the DBA may find that N no longer equals 1 at the primary. Usage has grown and N needs to be 2 now. The DBA wants to grow the primary to 3 nodes (2+1). This is easily down with zero downtime to add a new node to the cluster and extend the RAC database to that new node. But its not so easily done at the standby to make the standby a 2-node cluster if that 1 node that exists is not RAC-enabled. If a pure single-instance standby is all that exists, the DBA needs to scrap it and move it to a two-node cluster. If the DBA had foresight and installed Grid Infrastructure as if the physical standby were a single-node cluster, then all the DBA has to do is to add a new node, just like they did on the primary side.

As you’re designing your RAC implementations, consider ensuring your have N+1 capability on the primary and at least N on the standby. If a company determines that the standby is too critical, they may want to implement N+1 at the standby as well. If the DBA determines that N=1, consider making the standby at least a single node RAC cluster.