The replication for ZDRLA works differently than a “normal” for Oracle Database that uses Data Guard (or even Golden Gate). The point is to replicate the ingested backup “as is” between ZDLRA’s and not datafile block replication. And, of course, it is completely different from tape clones.
ZDLRA replication is not just sent backup from one site to another, it is how to increase your protection and be part of the disaster recovery strategy. The replication does not occur just for “rman backups”, but also for archivelogs generated for Real-Time Redo. And adding, this is how you integrate ZDLRA at your MAA architecture that makes the difference and how you protect your environment and reach zero RPO. There are several points about replication, how it operates, modes, and integration for Oracle MAA universe. I will discuss some points here in this post.
The architecture for ZDLRA replication it is simple. There are two important definitions:
Upstream: It is the ZDLRA that receives the backup and forward it to another ZDLRA
Downstream: Is the ZDLRA that receives the backup from another ZDLRA
Oracle Data Guard Broker allows the database administrators to automate some tasks and an easy way to configure properly a lot of features and details for data guard environments. The Fast-Start FailOver (FSFO) allows the broker to automatically failover to standby database in case of failure of the primary. But until 19c the only option is always to trigger the failover. This changed at 19c with a nice new feature that allows us to put FSFO in Observe-Only Mode.
In this post, I will focus just on new features for FSFO like Observer-Only Mode and Health Conditions for it. Lag and other details will not be covered here.
The Observe-Only Mode is a simple change that allows putting the FSFO to just observing/monitoring the DG environment, but in case of failure, it does not change the roles between primary and standby. Simple like that. As the Broker documentation for Observe-Only Mode says:
The observe-only mode enables you to test the impact of using fast-start failover in your configuration, without making any actual changes to the configuration.
Tasks for ZDLRA are the pillar of how the backups are processed, everything is one task. So, when you ingest incremental backup one task is created but can occur that it get a freeze at ORDERING_WAIT state. These tasks are hard to identify and can create a big problem for your virtual full backup and backup strategy. Below I will show how they occur and how to solve the problem.
The process of patch ZDLRA is not complicated, but it is important to be aware of some details. The most important is from where you are until where you want to go. This is crucial because it will define what commands you will need to execute.
If you read the previous post about the process, you can notice that I was running the ZDLRA 12.2 version, and forwarded to 19.2 version. In that case, I needed to use the upgrade path since I was changing the major release and the racli commands had the “upgrade” parameter.
In this post I will show how to do a simple update (or patch apply) for ZDLRA, this means that I will remain inside the same major release for recovery appliance library. Some steps and checks are the same.
Whatever you need to do (patch or upgrade), the startup point it is the note 1927416.1 that cover the supported versions for ZDLRA. There it is possible to find all the supported versions for the recovery appliance library as well as the Exadata versions. Please, not upgrade the Exadata stack with a version that is not listed on this page.
When you change the parameters for the database is possible to specify the db_unique_name and allow more control where you want to apply/use it. This is very useful to limit the scope, but you need to be aware of some collateral effects. Even not present at the official doc, you can use it. But check here some details that you need to take care of.
The process to patch Exadata stack and software changed in the last years and it became easier. Now, with patchmgr to be used for all (database servers, storage cells, and switches) the process is much easier to control the steps. Here I will show the steps that are involved in this process.
Independent if it is ZDLRA or Exadata, the process for Engineering System is the same. So, this post can be used as a guide for the Exadata patch apply as well. In 2018 I already made a similar process about how to patch/upgrade Exadata to 18c (you can access here) and even made a partial/incomplete post for 12c in 2015.
The process will be very similar and can be done in rolling and non-rolling mode. In the first, the services continue to run and you don’t need to shutdown databases, but will take more time because the patchmgr applies server by server. At the second, you need to shutdown the entire GI and the patch is applied in parallel and will be faster.
The proceed to patch/upgrade ZDLRA is not complicated, but as usual, some details need to be checked before starting the procedure. Since it is one engineering system based at Exadata, the procedure has one part that (maybe) needs to upgrade this stack too. But, is possible to upgrade just the recovery appliance library.
Whatever if need or no to upgrade the Exadata stack, the upgrade for recovery appliance library is the same. The commands and checks are the same. The procedure described in this post cover the upgrade of the recovery appliance library. For Exadata stack, it is in another post.
Where we are
Before even start the patch/upgrade it is important to know exactly which version you are running. To do this execute the command racli version at you database node:
HAIP (High Availability IP) is not supported for the Exadata environment but can occur (if you did not create the cluster using OEDA) that HAIP became in use. And this particularity true for ZDLRA. So, during the upgrade from the previous version (12.2) to a higher version, it is needed to remove HAIP.
Usually, when we upgrading from 12.2 to 18c the HAIP is removed from Exadata. If the upgrade is from 12.1, and HAIP is there, it continues and is not removed by the upgrade process. If you are using HAIP and your GI is 12.1, this procedure as-is described here can’t be used (need some adaptation), because of some requirements from ASM+ACFS+DB. But since this is a preliminary step from a GI upgrade, the focus is to disable and remove it from GI.
The HAIP is not needed for Exadata because by architecture the InfiniBand network already defines (per server) two IP’s to avoid the single point of failure. So, it is not needed to create an additional layer (HAIP and virtual IP), that does the same that already exists by network design.
The REPLACE DISK command was released with 12.1 and allow to do an online replacement for a failed disk. This command is important because it reduces the rebalance time doing just the SYNC phase. Comparing with normal disk replacement (DROP and ADD in the same command), the REPLACE just do mirror resync.
Basically, when the REPLACE command is called, the rebalance just copy/sync the data from the survivor disk (the partner disk from the mirror). It is faster since the previous way with drop/add execute a complete rebalance from all AU of the diskgroup, doing REBALANCE and SYNC phase.
The replace disk command is important for the SWAP disk process for Exadata (where you add the new 14TB disks) since it is faster to do the rebalance of the diskgroup.
Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature “mount restricted force for recovery” to resurrect diskgroup and lose less data when multiple failures occur.
Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).
If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good is this. The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.
As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a “normal” environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.