Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature “mount restricted force for recovery” to resurrect diskgroup and lose less data when multiple failures occur.
Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).
If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good is this. The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.
As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a “normal” environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.
When Exadata X8M was released during the last Open World I made one post about the technical details about it. You can check it here: Exadata X8M (this post received some good shares and reviews). If you read it, you can see that I focused in more internal details (like torn blocks, one side path, two sides read/writes, and others), that differ from the normal analyses for Exadata X8M.
But recently I was invited by Oracle to participate exclusive workshop about Exadata X8M and I needed to share some details that picked me up. The workshop was done directly from Oracle Solution Center in Santa Clara Campus (it is an amazing place that I had an opportunity to visit in 2015, and have a rich history – if you have the opportunity, visit), and cover some technical details and with the hands-on part.
Unfortunately, I can’t share everything (I even don’t know if I can share something), but see the info below.
In my previous post, I showed how the clone to tape occurs for ZDLRA. But as explained, the clones occur through the scheduler and follow some rules. For full backup, as an example, it clones the last available.
But sometimes, it is needed to call the clone for some specific backup, maybe to do long-term storage to follow some regimentation/law. And if we leverage this for the clone, jobs can maybe take to long, or clone more that you need.
As you saw in my last post, the configuration to enable clone to tape for ZDLRA it is not complicated, but you need to take care of some details to avoid errors. Besides that, ZDLRA relies on OSB to do that (when configured with native tape support) and this has some details that you need to be aware of.
In this post, I will show how the clone to tape works for ZDLRA. And how you can check some details about OSB.
With ZDLRA you can clone your backups to tape using two ways. The first is using third-party software and the second is using Oracle Secure Backup (OSB). This integration from ZDLRA and OSB is the native way to do that.
Cloning backups to tape will help to offload backups from ZDLRA (reducing the space usage if you need to sustain long recovery windows), and add another layer of protection (since you can put tapes in a third site).
Here I will show how easy is to configure the OSB backup and how to integrate it into your backup policy.
Upgrade GRID infrastructure is one activity that usually is postponed because it involves a sensible area that, when not works, causes big downtime until be fixed. But, in the last versions, it is not a complicated task and if you follow the basic rules, it works without problems.
Here I will show a little example of how to upgrade the GI from 18.6.0 to 19.5. The steps below were executed at Exadata running version 18.104.22.168.0.191012 and GI 22.214.171.124, but can be done in every environment that supports Oracle GI.
Understand metrics for Exadata Storage Server is important to understand how all the software features are being used and all the details from that. Here I will discuss one case where the FC_IO_BY_R_SEC metric can show not precise values. And I will discuss one missing metric that can save a lot.
If you have doubts about metrics, you can check my post about metrics, it was an introduction, but cover some aspects of how to read and use it. You can check my other post where I show how to use metric DB_FC_IO_BY_SEC to identify database problems that can be hidden when checking only from the database side.
Metrics for Exadata deliver to you one way to deeply see, and understand, what it is happening for Exadata Storage Server and Exadata Software. Understand it is fundamental to identify and solve problems that can be hidden (or even unsee) from the database side. In this post, I will explain details about these metrics and what you can do using them.
My last article about Exadata Storage Server metrics was about one example of how to use them to identify problems that do not appear in the database side. In that post, I showed how I used the metric DB_FC_IO_BY_SEC to identify bad queries.
The point for Exadata (that I made in that article), is that most of the time, Exadata is so powerful that bad statements are handled without a problem because of the features that exist (flashcache, smartio, and others). But another point is that usually, Exadata is a high consolidated environment, where you “consolidate” a lot of databases and it is normal that some of them have different workloads and needs. Using metrics can help you to do a fine tune of your environment, but besides that, it delivers to you one way to check and control everything that’s happening.
In this post, I will not explain each metric one by one, but guide you to understand metrics and some interesting and important details about them.
Recently I made an Exadata stack upgrade/update to the last 19.2 version (126.96.36.199.0.191012) and I upgraded the GI from 18c to 19c (last 19c version – 188.8.131.52.191015) and after that, TFA does not work.
Since I don’t want to complete execute a TFA clean and reinstallation I tried to find the error and the solution. Here I want to share with you the workaround (since there is no solution yet) that I discovered and used to fix the error.
The actual environment is:
Old Grid Infrastructure: Version 184.108.40.206.190416
New Grid Infrastructure: Version 220.127.116.11.191015
Exadata domU: Version 18.104.22.168.0.191012 running kernel 4.1.12-124.30.1.el7uek.x86_64
After upgrade the GI from 18c to 19c, the TFA does not work. If you try to start it or collect log using it, you can receive errors. In the environment described here, the TFA was running fine with the 18c version, and the rootupgrade script from 18c to 19c does not report an error.
And to be more precise, the TFA upgrade from 18c to 19c called by rootupgrade was ok (according to the log – I will show later). But even after that, the error occurs.
The provided solution as usual (by MOS support): download the lastest TFA and reinstall the actual one. Unfortunately, I not like this approach because can lead to an error during GI upgrade for next releases (like 20) and updates (19.6 as an example).