I was participating in a recent thread on the OTN community where someone was asking questions about downgrading after a database upgrade. One of the responses asked how many people actually practice database downgrades. I created this poll to find out.
I was surprised to find one contribution to that thread which said:
I have done my fair share of upgrades – and never had to downgrade even once
Now that poster didn’t explicitly say it, but it was almost as if that individual was saying that practicing downgrades was a waste of time because they won’t ever need it. I’ll give the poster the benefit of the doubt and that this Oracle employee was not actually saying this. I’m not trying to pick on this individual. I’ll let this thread provide me the opportunity to discuss the topic from a more generic viewpoint. (Update: the poster who prompted me to write this blog entry has come back to the thread in the time it took me to write this and did say, ” did not mean to imply that we should not ‘test’ downgrades.” )
Back in July, I wrote a blog post about The Data Guardian. In that blog post, I said:
the database exists for one reason, to provide efficient and timely access to the data.
The DBA needs to protect the data. That is job #1. Job #2 is for the DBA to provide efficient and timely access to the data. What good is having the data if the people who need access to it cannot get to the data? If those people have terrible performance when interacting with the data, then they might as well have no access.
As the DBA, we need to perform risk management. We need to determine what risks might become reality. The DBAs job is to measure those risks and determine two plans of action. What steps can be taken to avoid that risk becoming reality and what steps do I need to take to resolve the issue when that risk does become a reality?
Even a junior-level DBA will understand the importance of backups. Backups are a risk management strategy. If data is lost, we can recover the data from the backup. And even a junior-level DBA understands the importance of being able to restore from the backup.
In this OTN thread, I wrote this:
Painful fact of anyone in IT: The moment we become complacent and willfully refuse to perform risk mitigation is the moment in time the wheels are set in motion and that risk becomes reality. It is that moment our careers hang in the balance. Will you be prepared to respond?
To me, this is a Murphy’s Law sort of thing. I’ve said similar things in the past. The idea (and its the whole point of this blog entry) is that if I don’t take appropriate risk management steps, then I’m just asking the gods to turn that risk into reality. If I refuse to adjust my rear view mirror and use it when I’m backing up my vehicle, well that’s the day I back into something. If I refuse to tie my shoelaces, well that’s the day I step on one and trip. They day I refuse to wear protective googles when using a powertool is the day I get something in my eye. The day I go to the beach and refuse to put on sun screen is the day I’ll come home with a sunburn. You get the idea.
Some readers may be thinking that I’m crazy and that the universe doesn’t have this master plan to screw with me just because I’m being complacent. And I would agree. So I’ll say it another way, if I do not plan to mitigate risk, then I have done nothing to stop it from becoming a reality. The chances of it becoming a reality do not decrease because of my inaction.
There are two major components to risk management. 1) determining the probability of that risk item occurring and 2) determining the impact when that risk does occur. The items that have the highest probability of occurring are mitigated first. This is easy and something that many working on risk management often do. They put the risk items into a spreadsheet and fill in some value for the probability of that risk occurring. When complete, they sort on the probability column and start risk mitigation from the top down. Many risk management strategies draw a line somewhere in the middle of the list and decide any risk item below that line has too low probability that we won’t worry about that risk item. We can’t mitigate all possible risks in the universe. There just isn’t enough time to handle it all. So we have to draw the line somewhere.
One of the failings I see all the time is that risk management does not spend much time focusing on the impact of that risk becoming reality. The spreadsheet needs to include a similar column providing a rating of the impact to the business for that risk item. The risk manager needs to sort the spreadsheet on this column as well. Any items that have a big impact needs to have risk mitigation activities even if that item has a low probability of occurring! Sadly, too many in the risk management business fail to include this step of assessing the risk impact. Again, when the spreadsheet is sorted by impact to the business, a line is drawn somewhere.
One may find that risk items with a HIGH probability have a LOW or even VERY LOW impact to the business. I like risk management spreadsheets that include a third column which is “probability x impact”. This column helps understand the relationship between the two risk components.
Side Bar: Notice how when I talk about risk management I talk about probability and impact. If you aren’t thinking about both of these areas, then you are only performing half the risk management you should be.
Let’s go back to the database upgrade question that prompted this blog post. I think that everyone reading this blog article should agree that upgrading an Oracle database is a risky proposition. There are so many different things that could go wrong with an Oracle database upgrade. The probability of an upgrade failure is HIGH. Risk mitigation items often include, but are not limited to, practicing the upgrade on clones of production and backing up the database before the upgrade process begins. Why do we do this? Well the impact to the business is VERY HIGH. If we fail when upgrading our production database, then our business users have no access to the data. We aren’t a very good Data Guardian if we cannot get past this failure. If we practice the upgrade sufficiently in non-production environments, we can reduce the probability of the risk item to MEDIUM. But in all likelihood, we cannot reduce that specific risk probability to LOW. That is why we take the backup before the upgrade begins. Should still have problems even though we have done our level-best reduce the probability of that risk item, the impact to the business is still VERY HIGH. So the DBA’s risk remediation strategy is to take notes on where and what caused the upgrade to fail, and to restore from the backup. The database is up and running and we have eliminated the impact to the business. The DBA then goes back to the drawing board to determine how to resolve what went wrong. The DBA is attempting to reducing the probability of that problem occurring again when they back at a later point in time to do the upgrade process again.
So let’s go back to the comment in the OTN thread where it seemed to be saying that practicing database downgrades isn’t worth the time. I disagree. And my disagreement has everything to do with the impact to the business. I do agree with the the comment the poster said in their reply.
thorough testing of all of the critical parts will identify any issues and have them resolved before the production upgrade.
I agree with that 100%. Why do we do this “thorough testing”? It is all because of risk mitigation. We are attempting to reduce the probability that the upgrade will cause poor performance or cause application functionality to break. But even as that poster said, “There will always be issues that pop-up in production after the upgrade because it is impossible to test 100% of your application.” Again, I agree 100% with what this poster is saying here. But what about the impact to the business? I’ll get to that in a minute, but first I have to digress a bit in this next paragraph…
I recently upgraded a critical production system from 188.8.131.52 to the 184.108.40.206 version. Where I work, we have more application testing than I’ve ever seen in my other jobs. We have a full QA team that does testing for us. We even have a team that is in charge of our automated testing efforts. We have automated robots that exercise our application code nightly. On top of all of that, we have another automated routine that whenever people push code changes to Test or Prod, this routine does a quick examination of critical code paths. I upgraded development environments (more than 15 of them) to 220.127.116.11 and then waited one month. I then upgraded Test and waited 3 weeks before I upgraded production. There were issues found and resolved before we upgraded production. But even after all of that, I had big issues once production was upgraded. You can visit my blog posts in mid-October to mid-December to see some of those issues. I was very close to downgrading this database but I managed to work through the issues instead. Now back to the point I was making…
After the upgrade is complete, the database is opened for business. Application users are now allowed to use the application. What happens inside the database at this point? Transactions! And transactions mean data changes. At the point in time the DBA opens the database for business after an upgrade is complete, data changes start occurring. After all this that’s the whole point of the the database, isn’t it? Capture data changes and make data available to the application’s end users.
So what happens if you’re in the boat I was last Fall with my database upgrade? I was hitting things that we did not see in non-production, even after all of our testing. The impact to the business was HIGH. I need to be able to reduce this impact to the business. I had three options. 1) Fix the issues, one by one. 2) Restore from the backup I took before the upgrade so that I could get the database back to the old version. 3) Downgrade the database and go back to the drawing board. I chose the first option. as I always have during my career. But what if that was not sufficient? It can take time to resolve the issues. Some businesses simply cannot afford that kind of time with that negative impact to the business. How many websites have been abandoned because performance was terrible or things didn’t work correctly? And for the strong majority of production databases out there, option 2 has a very terrible impact to the business! You’ll lose transactions after the upgrade was completed! The DBA won’t be able to roll forward past the upgrade while keeping the database at the old version, so data will be lost and for many production databases, this is unacceptable. The business may be able to afford one hour of data loss, but how many people would pull the trigger on this action within one hour of the upgrade? In all likelihood, this action would be performed days after the upgrade and the impact to the business for that kind of data loss is well above VERY HIGH. So that leaves option 3 as the option with the lowest impact to the business to help resolve whatever impacts the business is experiencing after the upgrade.
You can probably tell from that last paragraph that I feel that it is important for the Oracle DBA to know how to downgrade their database after an upgrade is complete. I’ll concede that the probability of the DBA needing to perform a downgrade is VERY LOW. But the impact of not downgrading may be catastrophic to the business. (There’s those two words again). Because the probability is low, I don’t practice downgrades often, but because the impact of not being able to downgrade is very high, I do practice them once in awhile.
So in closing, I’m going to go back to that Murphy’s Law thing again. The universe is not conspiring against me, but as the Data Guardian, I need to practice good risk management principles. That means assessing the probability and the impact of risk items imposed by my change. While the universe and the gods may not make Murphy’s Law or its cousins kick into gear, I’m not going myself any favors by mitigating risk items. I am not reducing the probability one bit.