Recently I made a post about a little issue that I got with Oracle Data Guard. In that scenario, because of outage in the standby datacenter, healthy primary database shutdown with error “ORA-16830: primary isolated…”. Just to remember that the database was running with Maximum Availability, Fail-Start Failover enabled and (the most important detail) the Observer was running in the standby datacenter too.
The point from my previous post tried to show that does not exists one doc that provides full details about “pros” and “cons” where put your observer. Whatever the place, on the primary datacenter or in standby, it has little details to check. Even the best (ideal) scenario, with a third datacenter, can be tough to sustain.
Here I will try to show one option that can help you and improve the reliability of your MAA/DG environment. At least, you will have more options to decide how to protect your database. Bellow, I show some details about how to configure and use multiple observers, but if you want to jump and see a little concern you can directly to the end of the post.
More Than One
Basically, to do that improvement, you can add more than one observer to protect your DG environment. It is simple to configure, and you can use this since 12.2 and now have at least three of them. To configure you just need to do (in the simplest way):
- Install the default Oracle Client infrastructure in the new server.
- Add TNS entry for/to both sides.
- Open the DGMGRL.
- Call “start observer” command.
Check how easy it is:
[oracle@dbobss ~]$ dgmgrl sys/oracle@orcls DGMGRL for Linux: Release 126.96.36.199.0 - Production on Sun May 5 16:30:58 2019 Copyright (c) 1982, 2017, Oracle and/or its affiliates. All rights reserved. Welcome to DGMGRL, type "help" for information. Connected to "orcls" Connected as SYSDBA. DGMGRL> start observer [W000 05/05 16:31:40.34] FSFO target standby is orcls [W000 05/05 16:31:42.53] Observer trace level is set to USER [W000 05/05 16:31:42.53] Try to connect to the primary. [W000 05/05 16:31:42.53] Try to connect to the primary orcl. [W000 05/05 16:31:42.54] The standby orcls is ready to be a FSFO target [W000 05/05 16:31:42.54] Reconnect interval expired, create new connection to primary database. [W000 05/05 16:31:42.54] Try to connect to the primary. [W000 05/05 16:31:43.68] Connection to the primary restored! [W000 05/05 16:31:44.68] Disconnecting from database orcl. …
When using multiple observers you can have at least 3 observers at the same time. But, exists only one Master Observer and it is responsible for Fast-Start Failover and protect the system. If you lost the master observer the Broker/Primary/Standby decides together which one will be the next master observer. Until the 19c version they do not work in quorum (or something like this using a voting system to decide the role switch) to protect the DG.
The interesting part about multiple observers it is that provide to you another way to customize your environment. Remember in my first post I reported the complexity (bases in pros and con) to choose the better place to put the observer. Now with multiple observers, you can put one in each data center and switch between them when you want to protect one side or another.
Now, my example environment it is two databases, three observers:
Check that I have one in each datacenter and one in external place. And inside of broker you can see:
DGMGRL> show observer Configuration - dgconfig Primary: orcl Target: orcls Observer "dbobss" - Master Host Name: dbobss Last Ping to Primary: 1 second ago Last Ping to Target: 1 second ago Observer "dbobsp" - Backup Host Name: dbobsp Last Ping to Primary: 1 second ago Last Ping to Target: 0 seconds ago Observer "dbobst" - Backup Host Name: dbobst Last Ping to Primary: 2 seconds ago Last Ping to Target: 2 seconds ago DGMGRL>
In case of failure Broker/Primary/Standby decides which one will be the next master observer. The time to decides it is fixed and occurs after 30 seconds and need to be coordinated/communicated and counts with the agreement from both, primary and standby. Unfortunately, there is no way to reduce this time/check from 30 seconds. You can see a good reference about this (with more tech details) in these articles: here and here (German Language); and the others that I already pointed in the past: here, here and here.
In my environment, I made the shutdown from machine running the master observer (dbobss) and the log from broker (in primary):
05/05/2019 17:15:34 FSFP: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0) Data Guard Broker initiated a master observer switch since the current master observer cannot reach the primary database FSFP: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0) Succeeded in switching master observer from observer 'dbobss' to 'dbobsp' FSFP: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0) FSFP: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0) Master observer begins pinging this instance Fore: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0)
And in the broker log for standby:
05/05/2019 17:15:34 drcx: FSFO SetState(st=16 "UNOBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0) drcx: FSFO SetState(st=43 "SET SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0) 05/05/2019 17:15:37 drcx: FSFO SetState(st=15 "OBSERVED", fl=0x0 "", ob=0x0, tgt=0, v=0) drcx: FSFO SetState(st=44 "CLR SWOB INPRG", fl=0x0 "", ob=0x0, tgt=0, v=0) drcx: FSFO SetState(st=12 "SET OBID", fl=0x0 "", ob=0x32cc2ad6, tgt=0, v=0) 05/05/2019 17:15:39 Master observer begins pinging this instance
Look in the logs above that both (primary and standby) agreed together with the change. After the failure, you saw the events SET SWOB INPRG (switch observer in progress) and SET OBID (set observer ID) and CLR SWOB INPRG (clear switch observer in progress) to confirm that was detected UNOBSERVED state. In this case, the observer dbobsp was chosen as the new master observer. You can see here the output when you use the trace level for broker as support.
After you reinstate your observer and it goes back online, you can simply set the master observer to the desired one if you want/need:
DGMGRL> set masterobserver to dbobss; Sent the proposed master observer to the data guard broker configuration. Please run SHOW OBSERVER to see if master observer switch actually happens. DGMGRL>
When you use multiple observers you can have more control how to protect your DG, you can have one observer in each site and choose the side that you want to protect. You can write one script to check the database role and change the master to protect the desired database role.
Remember my previous post: if you choose to protect the primary (with the observer in the same datacenter), and your entire primary datacenter fails, FSFO not occurs because standby does not decide alone. If you choose to protect the standby (with the observer in the same datacenter) and a datacenter/network failure in standby side, this can lead you a complete shutdown from a healthy primary database because it becomes “isolated”.
Since multiple observers continue to use hierarchy decision, the decision remains over only one observer. Even if you have multiple three observers, if you put the master observer in the same site than standby and they become isolated, they still decide alone and because FSFO the primary continues to shutdown because it thinks that it is isolated (even if the primary continues to receive connections from the other two observers).
Because the actual design, even if you put the “FastStartFailoverThreshold” as 240 by example, the automatic switch from master observer to another does not occur because the standby side cannot be reached to confirm the change of observer. Maybe for the next versions (20, 21…) we can see a change in this design and, when use multiple observers, voting/quorum method are used to decide role change for FSFO. Of course that even a quorum approach can lead a problem if you put two in the same datacenter and the site become isolate, but it can mitigate problems in some other cases.
In my next post I will dig more about this, with some examples and logs/traces analyses. You will see some details when the standby is isolated and you use multiple observers.