{"id":699,"date":"2020-03-23T22:16:33","date_gmt":"2020-03-24T01:16:33","guid":{"rendered":"http:\/\/www.fernandosimon.com\/blog\/?p=699"},"modified":"2020-07-19T19:10:47","modified_gmt":"2020-07-19T22:10:47","slug":"asm-mount-restricted-force-for-recovery","status":"publish","type":"post","link":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/","title":{"rendered":"ASM, Mount Restricted Force For Recovery"},"content":{"rendered":"<p style=\"text-align: justify;\">Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature \u201c<strong>mount restricted force for recovery<\/strong>\u201d to resurrect diskgroup and lose less data when multiple failures occur.<\/p>\n<p style=\"text-align: justify;\">Diskgroup redundancy is a key factor for ASM resilience, where you can survive to disk failures and still continue to run databases. I will not extend about ASM disk redundancy here, but usually, you can configure your diskgroup without redundancy (EXTERNAL), double redundancy (NORMAL), triple redundancy (HIGH), and even fourth redundancy (EXTEND for stretch clusters).<\/p>\n<p style=\"text-align: justify;\">If you want to understand more about redundancy you have a lot of articles at MOS and on the internet that provide useful information. One good <a href=\"https:\/\/www.doag.org\/formes\/pubfiles\/8586114\/2016-INF-Emre_Baransel-A_Deep_Dive_into_ASM_Redundancy_in_Exadata-Praesentation.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">is this.<\/a> The idea is simple, spread multiple copies in different disks. And can even be better if you group disks in the same failgroups, so, your data will have multiple copies in separate places.<\/p>\n<p style=\"text-align: justify;\">As an example, this a key for Exadata, where every storage cell is one independent failgroup and you can survive to one entire cell failure (or double full, depending on the redundancy of your diskgroup) without data loss. The same idea can be applied at a \u201cnormal\u201d environment, where you can create failgroup to disks attached to controller A, and another attached to controller B (so the failure of one storage controller does not affect all failgroups). At ASM, if you do not create failgroup, each disk is a different one in diskgroups that have redundancy enabled.<\/p>\n<p style=\"text-align: justify;\"><!--more Click here to read more...--><\/p>\n<p style=\"text-align: justify;\">This represents for Exadata, but it is safe for representation. Basically your data will be in at least two different failgroups:<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-700 size-full\" src=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png\" alt=\"\" width=\"886\" height=\"457\" srcset=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png 886w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata-300x155.png 300w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata-768x396.png 768w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata-624x322.png 624w\" sizes=\"auto, (max-width: 886px) 100vw, 886px\" \/><\/a><\/p>\n<h2 style=\"text-align: justify;\">Environment<\/h2>\n<p style=\"text-align: justify;\">In the example that I use here, I have one diskgroup called DATA, which has 7 (seven) disks and each one is on failgroup. The redundancy for this diskgroup is NORMAL, this means that the block is copied in two failgroups. If two failures occur, probably, I will have data loss\/corruption. Look:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                           FAILGROUP                      LABEL                           PATH\r\n------------------------------ ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI01                        CELLI01                        CELLI01                         ORCL:CELLI01\r\nCELLI02                        CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                        CELLI03                        CELLI03                         ORCL:CELLI03\r\nCELLI04                        CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                        CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                        CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                        CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                         RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                    SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n9 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">The version for my GI is 19.6.0.0, but this can be used from 12.1.0.2 and newer versions (works for 11.2.0.4 in some versions). And In this server, I have three databases running, DBA19, DBB19, and DBC19.<\/p>\n<p style=\"text-align: justify;\">So, with everything running correctly, the data from my databases will be spread two failgroups (this is just a representation and not correct representation where the blocks from my database are):<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-701 size-full\" src=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies.png\" alt=\"\" width=\"257\" height=\"305\" srcset=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies.png 257w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-253x300.png 253w\" sizes=\"auto, (max-width: 257px) 100vw, 257px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Remember that a NORMAL redundancy just needs two copies. So, some blocks from datafile 1 from DBA19, as an example, can be stored at CELLI01 and CELLI04. And if your database is small (and your failgroups are big), and you are lucky too, the entire database can be stored in just these two places. In case of failure that just involves CELLI02 and CELLI03 failgroups, your data (from DBA19c) can be intact.<\/p>\n<h2 style=\"text-align: justify;\">Understanding the failure<\/h2>\n<p style=\"text-align: justify;\">Unfortunately, failures (will) happen and can be multiple at the same time. In the diskgroup DATA above, after the second failure, your diskgroup will be dismounted instantly. Usually when this occurs, if you can\u2019t recover the hardware error, you need to restore and recover a backup of your databases after recreating the diskgroup.<\/p>\n<p style=\"text-align: justify;\">If you have lucky and the failures occur at the same time, you can (most of the time) return the failed disks and try to mount the diskgroup because there is no difference between the failed disks\/failgroups. But the problem occurs if you have one failure (like CELLI03 diskgroup disappears) and after some time another failgroup fails (like CELLI07). The detail is that between the failures, the databases continued to run and change data in the disk. And when this occurs, and when your failgroup returns, there are differences.<\/p>\n<p style=\"text-align: justify;\">Another point that is very important to understand is the time to recover the failure. If you have one disk\/failgroup at ASM, the attributes <em>disk_repair_time<\/em> and <em>failgroup_repair_time<\/em> define the time that you have to repair your failure before the rebalance of data takes place. The first (disk_repair_time) is the time that you have to repair the disk in case of failure if your failgroup have more than one disk, just the failed is rebalanced. The second (failgroup_repair_time) is the time that you have to repair the failed failgroup (when it fails completely).<\/p>\n<p style=\"text-align: justify;\">The interesting here is that between the moment of failure until the end of this clock you are susceptible to another failure. If it occurs (more failures that your mirror protection) you will lose the diskgroup. And another fact here it is that between the failures, your databases continue to run, so, if your return the first failed disk\/failgroup, you need to sync it.<\/p>\n<p style=\"text-align: justify;\">These \u201crepair times\u201d serve to provide to you time to fix\/recover the failure and avoid the rebalance. Think about the architecture, usually the diskgroups with redundancy are big and protect big environments think in one Exadata, as an example, where each disk can have 14TB \u2013 and one cell can have until 12 of them), and do rebalance of this amount of data takes a lot of time. To avoid this, if your failed disk is replaced before this time, just sync with the block changed is needed.<\/p>\n<p style=\"text-align: justify;\">A \u201cdefault configuration\u201d have these values:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select dg.name,a.value,a.name\r\n  2  from v$asm_diskgroup dg, v$asm_attribute a\r\n  3  where dg.group_number=a.group_number\r\n  4  and a.name like '%time'\r\n  5  \/\r\n\r\nNAME                                     VALUE           NAME\r\n---------------------------------------- --------------- ----------------------------------------\r\nDATA                                     12.0h           disk_repair_time\r\nDATA                                     24.0h           failgroup_repair_time\r\nRECO                                     24.0h           failgroup_repair_time\r\nRECO                                     12.0h           disk_repair_time\r\nSYSTEMDG                                 24.0h           failgroup_repair_time\r\nSYSTEMDG                                 12.0h           disk_repair_time\r\n\r\n6 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">But think in one scenario where more than one failure occurs, the first in CELLI01 at 08:00 am and the second in CELL0I6 at 10:00 am, now, from two hours, you have the new version of blocks. If you fix the failure (for CELL01) you don\u2019t guarantee that you have everything in the last version and the normal mount will not work.<\/p>\n<p style=\"text-align: justify;\">And it is here that <strong><em>mount restricted force for recovery<\/em><\/strong> enters. It allows you to resurrect the diskgroup and help you to restore fewer things. Think in the example before, if the failures occur at CELLI01 and CELL06, but your datafiles are in CELLI02 and CELLI07, you lose nothing. Or restore just some tablespaces and not all database. So, it is more gain than lose.<\/p>\n<h2 style=\"text-align: justify;\">Mount restricted force for recovery<\/h2>\n<p style=\"text-align: justify;\">Here, I will simulate multiple failures for the disks (more than one) and show how you can use <strong><em>mount restricted force for recovery<\/em><\/strong>. Please be careful and follow all the steps correctly to avoid mistakes and to understand how to do and what is happening.<\/p>\n<p style=\"text-align: justify;\">So, here I have DATA diskgroup, with normal redundancy and 7 (seven) failgroups. DBA19, DBB19, and DBC19 databases running.<\/p>\n<p style=\"text-align: justify;\">So, at the first step, I will simulate a complete failure of CELLI03 failgroup. In my environment, to allow more control, I have one iSCSI target for each failgroup (this allows me to disconnect one by one if needed). The CELLI03 died:<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-03.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-702 size-full\" src=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-03.png\" alt=\"\" width=\"257\" height=\"305\" srcset=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-03.png 257w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-03-253x300.png 253w\" sizes=\"auto, (max-width: 257px) 100vw, 257px\" \/><\/a><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[root@asmrec ~]# iscsiadm -m session\r\ntcp: [11] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.d65b214fca9a (non-flash) --CELLI04\r\ntcp: [14] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.637b3bbfa86d (non-flash) --CELLI07\r\ntcp: [17] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.2f4cdb93107c (non-flash) --CELLI05\r\ntcp: [2] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.bb66b92348a7 (non-flash)  --CELLI03\r\ntcp: [20] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.57c0a000e316 (non-flash) --(SYS)\r\ntcp: [23] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.89ef4420ea4d (non-flash) --CELLI06\r\ntcp: [5] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.eff4683320e8 (non-flash)  --CELLI01\r\ntcp: [8] 172.16.0.3:3260,1 iqn.2006-01.com.openfiler:tsn.7d8f4c8f5012 (non-flash)  --CELLI02\r\n[root@asmrec ~]#\r\n[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -u\r\nLogging out of session [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260]\r\nLogout of [sid: 2, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.\r\n[root@asmrec ~]#<\/pre>\n<p style=\"text-align: justify;\">And at ASM alertlog we can see:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">2020-03-22T17:14:11.589115+01:00\r\nNOTE: process _user8100_+asm1 (8100) initiating offline of disk 9.4042310133 (CELLI03) with mask 0x7e in group 1 (DATA) with client assisting\r\nNOTE: checking PST: grp = 1\r\n2020-03-22T17:14:11.589394+01:00\r\nGMON checking disk modes for group 1 at 127 for pid 40, osid 8100\r\n2020-03-22T17:14:11.589584+01:00\r\nNOTE: checking PST for grp 1 done.\r\nNOTE: initiating PST update: grp 1 (DATA), dsk = 9\/0xf0f0c1f5, mask = 0x6a, op = clear mandatory\r\n2020-03-22T17:14:11.589746+01:00\r\nGMON updating disk modes for group 1 at 128 for pid 40, osid 8100\r\ncluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1\r\nWARNING: Write Failed. group:1 disk:9 AU:1 offset:4190208 size:4096\r\npath:ORCL:CELLI03\r\n         incarnation:0xf0f0c1f5 synchronous result:'I\/O error'\r\n         subsys:\/opt\/oracle\/extapi\/64\/asm\/orcl\/1\/libasm.so krq:0x7f9182f72210 bufp:0x7f9182f78000 osderr1:0x3 osderr2:0x2e\r\n         IO elapsed time: 0 usec Time waited on I\/O: 0 usec\r\nWARNING: found another non-responsive disk 9.4042310133 (CELLI03) that will be offlined<\/pre>\n<p style=\"text-align: justify;\">So, the failure occurred at 17:14. The full output from ASM alertlog can be found here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-Failure-CELLI03.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-Failure-CELLI03.txt<\/a><\/p>\n<p style=\"text-align: justify;\">And we can see that disappeared (but not deleted or dropped) from ASM:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI01                                  CELLI01                        CELLI01                         ORCL:CELLI01\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n9 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">At this point, ASM is starting to count the clock of 12hours (as defined in my repair attributes). The failgroup was not dropped and rebalance was not going on because ASM is optimistic that you will fix the issue in this period.<\/p>\n<p style=\"text-align: justify;\">But after some time I had a second failure in the diskgroup:<\/p>\n<p style=\"text-align: justify;\"><a href=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-01.png\" target=\"_blank\" rel=\"noopener noreferrer\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-704 size-full\" src=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-01.png\" alt=\"\" width=\"257\" height=\"305\" srcset=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-01.png 257w, https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Block-Copies-FAILED-01-253x300.png 253w\" sizes=\"auto, (max-width: 257px) 100vw, 257px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Now at ASM alertlog you can see that diskgroup was dismounted (and several other messages). Bellow a cropped from the alertlog. The full output (and I think that deserve a look) it is here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-Failure-CELLI03-and-CELL01<\/a><\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">2020-03-22T17:18:39.699555+01:00\r\nWARNING: Write Failed. group:1 disk:1 AU:1 offset:4190208 size:4096\r\npath:ORCL:CELLI01\r\n         incarnation:0xf0f0c1f3 asynchronous result:'I\/O error'\r\n         subsys:\/opt\/oracle\/extapi\/64\/asm\/orcl\/1\/libasm.so krq:0x7f9182f833d0 bufp:0x7f91836ef000 osderr1:0x3 osderr2:0x2e\r\n         IO elapsed time: 0 usec Time waited on I\/O: 0 usec\r\nWARNING: Hbeat write to PST disk 1.4042310131 in group 1 failed. [2]\r\n2020-03-22T17:18:39.704035+01:00\r\n...\r\n...\r\n2020-03-22T17:18:39.746945+01:00\r\nNOTE: cache closing disk 9 of grp 1: (not open) CELLI03\r\nERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.\r\n2020-03-22T17:18:39.747462+01:00\r\nERROR: too many offline disks in PST (grp 1)\r\n2020-03-22T17:18:39.759171+01:00\r\nNOTE: cache dismounting (not clean) group 1\/0xB48031B9 (DATA)\r\nNOTE: messaging CKPT to quiesce pins Unix process pid: 12050, image: oracle@asmrec.oralocal (B001)\r\n2020-03-22T17:18:39.761807+01:00\r\nNOTE: halting all I\/Os to diskgroup 1 (DATA)\r\n2020-03-22T17:18:39.766289+01:00\r\nNOTE: LGWR doing non-clean dismount of group 1 (DATA) thread 1\r\nNOTE: LGWR sync ABA=23.3751 last written ABA 23.3751\r\n...\r\n...\r\n2020-03-22T17:18:40.207406+01:00\r\nSQL&gt; alter diskgroup DATA dismount force \/* ASM SERVER:3028300217 *\/\r\n...\r\n...\r\n2020-03-22T17:18:40.841979+01:00\r\nErrors in file \/u01\/app\/grid\/diag\/asm\/+asm\/+ASM1\/trace\/+ASM1_rbal_8756.trc:\r\nORA-15130: diskgroup \"DATA\" is being dismounted\r\n2020-03-22T17:18:40.853738+01:00\r\n...\r\n...\r\nERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline.\r\n2020-03-22T17:18:40.861939+01:00\r\nERROR: too many offline disks in PST (grp 1)\r\n...\r\n...\r\n2020-03-22T17:18:43.214368+01:00\r\nErrors in file \/u01\/app\/grid\/diag\/asm\/+asm\/+ASM1\/trace\/+ASM1_rbal_8756.trc:\r\nORA-15130: diskgroup \"DATA\" is being dismounted\r\n2020-03-22T17:18:43.214885+01:00\r\nNOTE: client DBC19:DBC19:asmrec no longer has group 1 (DATA) mounted\r\n2020-03-22T17:18:43.215492+01:00\r\nNOTE: client DBB19:DBB19:asmrec no longer has group 1 (DATA) mounted\r\nNOTE: cache deleting context for group DATA 1\/0xb48031b9\r\n...\r\n...\r\n2020-03-22T17:18:43.298551+01:00\r\nSUCCESS: alter diskgroup DATA dismount force \/* ASM SERVER:3028300217 *\/\r\nSUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA\r\n2020-03-22T17:18:43.352003+01:00\r\nSQL&gt; ALTER DISKGROUP DATA MOUNT  \/* asm agent *\/\/* {0:1:9} *\/\r\n2020-03-22T17:18:43.372816+01:00\r\nNOTE: cache registered group DATA 1\/0xB44031BF\r\nNOTE: cache began mount (first) of group DATA 1\/0xB44031BF\r\nNOTE: Assigning number (1,8) to disk (ORCL:CELLI02)\r\nNOTE: Assigning number (1,0) to disk (ORCL:CELLI04)\r\nNOTE: Assigning number (1,11) to disk (ORCL:CELLI05)\r\nNOTE: Assigning number (1,3) to disk (ORCL:CELLI06)\r\nNOTE: Assigning number (1,2) to disk (ORCL:CELLI07)\r\n2020-03-22T17:18:43.514642+01:00\r\ncluster guid (e4db41a22bd95fc6bf79d2e2c93360c7) generated for PST Hbeat for instance 1\r\n2020-03-22T17:18:46.089517+01:00\r\nNOTE: detected and added orphaned client id 0x10010\r\nNOTE: detected and added orphaned client id 0x1000e<\/pre>\n<p style=\"text-align: justify;\">So, the second failure occurred at 17:18 and lead to diskgroup force dismount. And you can see messages like \u201c<em>NOTE: cache dismounting (not clean)<\/em>\u201d, \u201c<em>ERROR: too many offline disks in PST (grp 1)<\/em>\u201d, and even \u201c<em>ERROR: disk 1 (CELLI01) in group 1 (DATA) cannot be offlined because all disks [1(CELLI01), 9(CELLI03)] with mirrored data would be offline<\/em>\u201d.<\/p>\n<p style=\"text-align: justify;\">So, probably some data was lost. And even if you consider that between these 4 minutes data was changed in the databases, the mess is Big. If you want to see the alertlog from databases, check here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-From-Databases-Alertlog-at-Failure<\/a><\/p>\n<p style=\"text-align: justify;\">And now we have this at ASM:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n                                                                        CELLI02                         ORCL:CELLI02\r\n                                                                        CELLI04                         ORCL:CELLI04\r\n                                                                        CELLI05                         ORCL:CELLI05\r\n                                                                        CELLI06                         ORCL:CELLI06\r\n                                                                        CELLI07                         ORCL:CELLI07\r\n\r\n7 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">And if we try to mount we receive an error due to disk offline:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data mount;\r\nalter diskgroup data mount\r\n*\r\nERROR at line 1:\r\nORA-15032: not all alterations performed\r\nORA-15040: diskgroup is incomplete\r\nORA-15042: ASM disk \"9\" is missing from group number \"1\"\r\nORA-15042: ASM disk \"1\" is missing from group number \"1\"\r\n\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\"><strong>Now is the key decision<\/strong>. If you have important data that worth the effort to try to recover you can continue. It is your decision and based on several details. Since the diskgroup is dismounted, the repair time is not counting, and you have days until recovery. Sometimes one day stopped is better than several days to recover all databases from the last backup.<\/p>\n<p style=\"text-align: justify;\">Imagine that you can bring online the first failed failgroup (CELL03) that have 4 minutes of the difference of data:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[root@asmrec ~]# iscsiadm -m node -T iqn.2006-01.com.openfiler:tsn.bb66b92348a7 -p 172.16.0.3:3260 -l\r\nLogging in to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] (multiple)\r\nLogin to [iface: default, target: iqn.2006-01.com.openfiler:tsn.bb66b92348a7, portal: 172.16.0.3,3260] successful.\r\n[root@asmrec ~]#\r\n<\/pre>\n<p style=\"text-align: justify;\">And if you try to mount it normally you will receive an error (output from alertlog at this try can be seen here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-Mout-With-One-Disk-Online.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-Mout-With-One-Disk-Online<\/a>)<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data mount;\r\nalter diskgroup data mount\r\n*\r\nERROR at line 1:\r\nORA-15032: not all alterations performed\r\nORA-15017: diskgroup \"DATA\" cannot be mounted\r\nORA-15066: offlining disk \"1\" in group \"DATA\" may result in a data loss\r\n\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\"><strong>So, now we can try the mount restricted force for recovery<\/strong>:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data mount restricted force for recovery;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">The alertlog from ASM (that you can full here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-Mout-Restricted-Force-For-Recovery<\/a>) report messages related with the cache from diskgroup and disk that need to be checked. And now we are like this:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI01                                  CELLI01\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n                                                                        CELLI03                         ORCL:CELLI03\r\n\r\n10 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">The next step is to bring online the failgroup that came back:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data online disks in failgroup CELLI03;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">Doing this ASM will resync this failgroup (using this block as the last version) and bring the cache of this disk online. At ASM alertlog you can see messages like (full output here at <a href=\"https:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-ALERTLOG-Output-Online-Restored-Failgroup.txt\" target=\"_blank\" rel=\"noopener noreferrer\">ASM-ALERTLOG-Output-Online-Restored-Failgroup<\/a>):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">2020-03-22T17:27:47.729003+01:00\r\nSQL&gt; alter diskgroup data online disks in failgroup CELLI03\r\n2020-03-22T17:27:47.729551+01:00\r\nNOTE: cache closing disk 1 of grp 1: (not open) CELLI01\r\n2020-03-22T17:27:47.729640+01:00\r\nNOTE: cache closing disk 9 of grp 1: (not open) CELLI03\r\n2020-03-22T17:27:47.730398+01:00\r\nNOTE: GroupBlock outside rolling migration privileged region\r\nNOTE: initiating resync of disk group 1 disks\r\nCELLI03 (9)\r\n\r\nNOTE: process _user6891_+asm1 (6891) initiating offline of disk 9.4042310248 (CELLI03) with mask 0x7e in group 1 (DATA) without client assisting\r\n2020-03-22T17:27:47.737580+01:00\r\n...\r\n...\r\n2020-03-22T17:27:47.796524+01:00\r\nNOTE: disk validation pending for 1 disk in group 1\/0x1d7031d4 (DATA)\r\nNOTE: Found ORCL:CELLI03 for disk CELLI03\r\nNOTE: completed disk validation for 1\/0x1d7031d4 (DATA)\r\n2020-03-22T17:27:47.935467+01:00\r\n...\r\n...\r\n2020-03-22T17:27:48.116572+01:00\r\nNOTE: cache closing disk 1 of grp 1: (not open) CELLI01\r\nNOTE: cache opening disk 9 of grp 1: CELLI03 label:CELLI03\r\n2020-03-22T17:27:48.117158+01:00\r\nSUCCESS: refreshed membership for 1\/0x1d7031d4 (DATA)\r\n2020-03-22T17:27:48.123545+01:00\r\nNOTE: initiating PST update: grp 1 (DATA), dsk = 9\/0x0, mask = 0x5d, op = assign mandatory\r\n...\r\n...\r\n2020-03-22T17:27:48.142068+01:00\r\nNOTE: PST update grp = 1 completed successfully\r\n2020-03-22T17:27:48.143197+01:00\r\nSUCCESS: alter diskgroup data online disks in failgroup CELLI03\r\n2020-03-22T17:27:48.577277+01:00\r\nNOTE: Attempting voting file refresh on diskgroup DATA\r\nNOTE: Refresh completed on diskgroup DATA. No voting file found.\r\n...\r\n...\r\n2020-03-22T17:27:48.643277+01:00\r\nNOTE: Starting resync using Staleness Registry and ATE scan for group 1\r\n2020-03-22T17:27:48.696075+01:00\r\nNOTE: Starting resync using Staleness Registry and ATE scan for group 1\r\nNOTE: header on disk 9 advanced to format #2 using fcn 0.0\r\n2020-03-22T17:27:49.725837+01:00\r\nWARNING: Started Drop Disk Timeout for Disk 1 (CELLI01) in group 1 with a value 43200\r\n2020-03-22T17:27:57.301042+01:00\r\n...\r\n2020-03-22T17:27:59.687480+01:00\r\nNOTE: cache closing disk 1 of grp 1: (not open) CELLI01\r\nNOTE: reset timers for disk: 9\r\nNOTE: completed online of disk group 1 disks\r\nCELLI03 (9)\r\n\r\n2020-03-22T17:27:59.714674+01:00\r\nERROR: ORA-15421 thrown in ARBA for group number 1\r\n2020-03-22T17:27:59.714805+01:00\r\nErrors in file \/u01\/app\/grid\/diag\/asm\/+asm\/+ASM1\/trace\/+ASM1_arba_8786.trc:\r\nORA-15421: Rebalance is not supported when the disk group is mounted for recovery.\r\n2020-03-22T17:27:59.715047+01:00\r\nNOTE: stopping process ARB0\r\nNOTE: stopping process ARBA\r\n2020-03-22T17:28:00.652115+01:00\r\nNOTE: rebalance interrupted for group 1\/0x1d7031d4 (DATA)\r\n<\/pre>\n<p style=\"text-align: justify;\">And not we have at ASM:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI01                                  CELLI01\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n9 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">And rebalance not continue because is not allowed when diskgroup is in restrict mode:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select * from gv$asm_operation;\r\n\r\n   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID\r\n---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------\r\n         1            1 REBAL COMPACT   WAIT          1                                                                                                               0\r\n         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0\r\n         1            1 REBAL REBUILD   WAIT          1                                                                                                               0\r\n         1            1 REBAL RESYNC    WAIT          1                                                                                                               0\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">But since the failgroup become online \u201cin force way\u201d, the old cache (from CELL01) need to be clean. And since it is not the last version, maybe some files were corrupted. To check this, you can look the *arb* process trace files at ASM trace directory:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">root@asmrec trace]# ls -lFhtr *arb*\r\n...\r\n...\r\n-rw-r----- 1 grid oinstall 6.4K Mar 22 17:10 +ASM1_arb0_3210.trm\r\n-rw-r----- 1 grid oinstall  44K Mar 22 17:10 +ASM1_arb0_3210.trc\r\n-rw-r----- 1 grid oinstall  984 Mar 22 17:27 +ASM1_arb0_8788.trm\r\n-rw-r----- 1 grid oinstall 2.1K Mar 22 17:27 +ASM1_arb0_8788.trc\r\n-rw-r----- 1 grid oinstall  882 Mar 22 17:27 +ASM1_arba_8786.trm\r\n-rw-r----- 1 grid oinstall 1.2K Mar 22 17:27 +ASM1_arba_8786.trc\r\n[root@asmrec trace]#<\/pre>\n<p style=\"text-align: justify;\">And looking from one of the last, we can see that some extend (that does not exist, the recovered failgroup, or the cache is not the last one) was filled with dummy (BADFDA7A) data:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[root@asmrec trace]# cat +ASM1_arb0_8788.trc\r\nTrace file \/u01\/app\/grid\/diag\/asm\/+asm\/+ASM1\/trace\/+ASM1_arb0_8788.trc\r\nOracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production\r\nVersion 19.6.0.0.0\r\nBuild label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417\r\nORACLE_HOME:    \/u01\/app\/19.0.0.0\/grid\r\nSystem name:    Linux\r\nNode name:      asmrec.oralocal\r\nRelease:        4.14.35-1902.10.8.el7uek.x86_64\r\nVersion:        #2 SMP Thu Feb 6 11:02:28 PST 2020\r\nMachine:        x86_64\r\nInstance name: +ASM1\r\nRedo thread mounted by this instance: 0 &lt;none&gt;\r\nOracle process number: 40\r\nUnix process pid: 8788, image: oracle@asmrec.oralocal (ARB0)\r\n\r\n\r\n*** 2020-03-22T17:27:59.044949+01:00\r\n*** SESSION ID:(402.55837) 2020-03-22T17:27:59.044969+01:00\r\n*** CLIENT ID:() 2020-03-22T17:27:59.044975+01:00\r\n*** SERVICE NAME:() 2020-03-22T17:27:59.044980+01:00\r\n*** MODULE NAME:() 2020-03-22T17:27:59.044985+01:00\r\n*** ACTION NAME:() 2020-03-22T17:27:59.044989+01:00\r\n*** CLIENT DRIVER:() 2020-03-22T17:27:59.044994+01:00\r\n\r\n WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 266, extent 22: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery\r\n WARNING: group 1, file 258, extent 7: filling extent with BADFDA7A during recovery\r\n\r\n*** 2020-03-22T17:27:59.680119+01:00\r\nNOTE: initiating PST update: grp 1 (DATA), dsk = 9\/0x0, mask = 0x7f, op = assign mandatory\r\nkfdp_updateDsk(): callcnt 195 grp 1\r\nPST verChk -0: req, id=266369333, grp=1, requested=91 at 03\/22\/2020 17:27:59\r\nNOTE: PST update grp = 1 completed successfully\r\nNOTE: kfdsFilter_freeDskSrSlice for Filter 0x7fbaf6238d38\r\nNOTE: kfdsFilter_clearDskSlice for Filter 0x7fbaf6238d38 (all:TRUE)\r\nNOTE: completed online of disk group 1 disks\r\nCELLI03 (9)\r\n[root@asmrec trace]#<\/pre>\n<p style=\"text-align: justify;\">And as you can imagine, this will lead to files that need to be restored from backup. But look that just some data, not everything. Remember at the beginning of the post that this depends on how your data is distributed inside of ASM failgroups. If you have luck, you have just a few data impacted. This depends on a lot of factors, as the time that was offline, the size of the failgroup, the activity of your databases, and many others. But, the gains can be good and mad it worth the effort.<\/p>\n<p style=\"text-align: justify;\">&nbsp;After that, we can normally dismount the diskgroup:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data dismount;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">And mount it again:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; alter diskgroup data mount;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">Since now the diskgroup is mounted in a clean way, you can continue with the rebalance:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select * from gv$asm_operation;\r\n\r\n   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID\r\n---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------\r\n         1            1 REBAL COMPACT   WAIT          1                                                                                                               0\r\n         1            1 REBAL REBALANCE ERRS          1                                                         ORA-15421                                             0\r\n         1            1 REBAL REBUILD   WAIT          1                                                                                                               0\r\n         1            1 REBAL RESYNC    WAIT          1                                                                                                               0\r\n\r\nSQL&gt; alter diskgroup DATA rebalance;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">The state at ASM side it is:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI01                                  CELLI01\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n9 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">As you can see, the CELL01 was not removed yet (I will talk about it later). But the activities can continue, databases can be checked.<\/p>\n<h3 style=\"text-align: justify;\">Database side<\/h3>\n<p style=\"text-align: justify;\">At the database side, we need to check what we lost and need to recover. Since I am using cluster the GI tried to start it (and as you can see two became up):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[oracle@asmrec ~]$ ps -ef |grep smon\r\nroot      8254     1  2 13:53 ?        00:04:40 \/u01\/app\/19.0.0.0\/grid\/bin\/osysmond.bin\r\ngrid      8750     1  0 13:54 ?        00:00:00 asm_smon_+ASM1\r\noracle   11589     1  0 17:31 ?        00:00:00 ora_smon_DBB19\r\noracle   11751     1  0 17:31 ?        00:00:00 ora_smon_DBA19\r\noracle   18817 29146  0 17:44 pts\/9    00:00:00 grep --color=auto smon\r\n[oracle@asmrec ~]$<\/pre>\n<h4 style=\"text-align: justify;\">DBA19<\/h4>\n<p style=\"text-align: justify;\">The firs that I checked was DBA19C, I used rman to VALIDATE DATABASE:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[oracle@asmrec ~]$ rman target \/\r\n\r\nRecovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:45:21 2020\r\nVersion 19.6.0.0.0\r\n\r\nCopyright (c) 1982, 2019, Oracle and\/or its affiliates.  All rights reserved.\r\n\r\nconnected to target database: DBA19 (DBID=828667324)\r\n\r\nRMAN&gt; validate database;\r\n\r\nStarting validate at 22-MAR-20\r\nusing target database control file instead of recovery catalog\r\nallocated channel: ORA_DISK_1\r\nchannel ORA_DISK_1: SID=260 device type=DISK\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\ninput datafile file number=00001 name=+DATA\/DBA19\/DATAFILE\/system.256.1035153873\r\ninput datafile file number=00004 name=+DATA\/DBA19\/DATAFILE\/undotbs1.258.1035153973\r\ninput datafile file number=00003 name=+DATA\/DBA19\/DATAFILE\/sysaux.257.1035153927\r\ninput datafile file number=00007 name=+DATA\/DBA19\/DATAFILE\/users.259.1035153975\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:03:45\r\nList of Datafiles\r\n=================\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n1    OK     0              17722        117766          5042446\r\n  File Name: +DATA\/DBA19\/DATAFILE\/system.256.1035153873\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              79105\r\n  Index      0              13210\r\n  Other      0              7723\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n3    OK     0              19445        67862           5042695\r\n  File Name: +DATA\/DBA19\/DATAFILE\/sysaux.257.1035153927\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              7988\r\n  Index      0              5531\r\n  Other      0              34876\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n4    FAILED 1              49           83247           5042695\r\n  File Name: +DATA\/DBA19\/DATAFILE\/undotbs1.258.1035153973\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              0\r\n  Index      0              0\r\n  Other      511            83151\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n7    OK     0              93           641             4941613\r\n  File Name: +DATA\/DBA19\/DATAFILE\/users.259.1035153975\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              65\r\n  Index      0              15\r\n  Other      0              467\r\n\r\nvalidate found one or more corrupt blocks\r\nSee trace file \/u01\/app\/oracle\/diag\/rdbms\/dba19\/DBA19\/trace\/DBA19_ora_19219.trc for details\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\nincluding current control file for validation\r\nincluding current SPFILE in backup set\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:01\r\nList of Control File and SPFILE\r\n===============================\r\nFile Type    Status Blocks Failing Blocks Examined\r\n------------ ------ -------------- ---------------\r\nSPFILE       OK     0              2\r\nControl File OK     0              646\r\nFinished validate at 22-MAR-20\r\n\r\nRMAN&gt; shutdown abort;\r\n\r\nOracle instance shut down\r\n\r\nRMAN&gt; startup mount;\r\n\r\nconnected to target database (not started)\r\nOracle instance started\r\ndatabase mounted\r\n\r\nTotal System Global Area    1610610776 bytes\r\n\r\nFixed Size                     8910936 bytes\r\nVariable Size                859832320 bytes\r\nDatabase Buffers             734003200 bytes\r\nRedo Buffers                   7864320 bytes\r\n\r\nRMAN&gt; run{\r\n2&gt; restore datafile 4;\r\n3&gt; recover datafile 4;\r\n4&gt; }\r\n\r\nStarting restore at 22-MAR-20\r\nallocated channel: ORA_DISK_1\r\nchannel ORA_DISK_1: SID=249 device type=DISK\r\n\r\nchannel ORA_DISK_1: starting datafile backup set restore\r\nchannel ORA_DISK_1: specifying datafile(s) to restore from backup set\r\nchannel ORA_DISK_1: restoring datafile 00004 to +DATA\/DBA19\/DATAFILE\/undotbs1.258.1035153973\r\nchannel ORA_DISK_1: reading from backup piece \/tmp\/9puro5qr_1_1\r\nchannel ORA_DISK_1: piece handle=\/tmp\/9puro5qr_1_1 tag=BKP-DB-INC0\r\nchannel ORA_DISK_1: restored backup piece 1\r\nchannel ORA_DISK_1: restore complete, elapsed time: 00:00:45\r\nFinished restore at 22-MAR-20\r\n\r\nStarting recover at 22-MAR-20\r\nusing channel ORA_DISK_1\r\n\r\nstarting media recovery\r\nmedia recovery complete, elapsed time: 00:00:02\r\n\r\nFinished recover at 22-MAR-20\r\n\r\nRMAN&gt; alter database open;\r\n\r\nStatement processed\r\n\r\nRMAN&gt; exit\r\n\r\n\r\nRecovery Manager complete.\r\n[oracle@asmrec ~]$<\/pre>\n<p style=\"text-align: justify;\">As you can see, the datafile 4 FAILED and needs to be recovered. Luckily, the redo was not affected too and the open was OK. Since it was the UNDO, I made abort (because the immediate can take an eternity, and even since undo was down, nothing was happening inside of the database).<\/p>\n<p style=\"text-align: justify;\">But as you saw, just one datafile was corrupted. Of course that with big databases and big failgroup, more files will be corrupted. But it is a shot that can worth it.<\/p>\n<h4 style=\"text-align: justify;\">DBB19<\/h4>\n<p style=\"text-align: justify;\">The second was DBB19 and I used the same approach, VALIDATE DATABASE:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[oracle@asmrec ~]$ export ORACLE_SID=DBB19\r\n[oracle@asmrec ~]$\r\n[oracle@asmrec ~]$ rman target \/\r\n\r\nRecovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 17:55:20 2020\r\nVersion 19.6.0.0.0\r\n\r\nCopyright (c) 1982, 2019, Oracle and\/or its affiliates.  All rights reserved.\r\n\r\nPL\/SQL package SYS.DBMS_BACKUP_RESTORE version 19.03.00.00 in TARGET database is not current\r\nPL\/SQL package SYS.DBMS_RCVMAN version 19.03.00.00 in TARGET database is not current\r\nconnected to target database: DBB19 (DBID=1336872427)\r\n\r\nRMAN&gt; validate database;\r\n\r\nStarting validate at 22-MAR-20\r\nusing target database control file instead of recovery catalog\r\nallocated channel: ORA_DISK_1\r\nchannel ORA_DISK_1: SID=374 device type=DISK\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\ninput datafile file number=00001 name=+DATA\/DBB19\/DATAFILE\/system.261.1035154051\r\ninput datafile file number=00003 name=+DATA\/DBB19\/DATAFILE\/sysaux.265.1035154177\r\ninput datafile file number=00004 name=+DATA\/DBB19\/DATAFILE\/undotbs1.267.1035154235\r\ninput datafile file number=00007 name=+DATA\/DBB19\/DATAFILE\/users.268.1035154241\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:35\r\nList of Datafiles\r\n=================\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n1    OK     0              16763        116487          3861452\r\n  File Name: +DATA\/DBB19\/DATAFILE\/system.261.1035154051\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              78871\r\n  Index      0              13010\r\n  Other      0              7836\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n3    OK     0              19307        62758           3861452\r\n  File Name: +DATA\/DBB19\/DATAFILE\/sysaux.265.1035154177\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              7459\r\n  Index      0              5158\r\n  Other      0              30796\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n4    OK     0              1            35847           3652497\r\n  File Name: +DATA\/DBB19\/DATAFILE\/undotbs1.267.1035154235\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              0\r\n  Index      0              0\r\n  Other      0              35839\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n7    OK     0              85           641             3759202\r\n  File Name: +DATA\/DBB19\/DATAFILE\/users.268.1035154241\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              70\r\n  Index      0              15\r\n  Other      0              470\r\n\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\nincluding current control file for validation\r\nincluding current SPFILE in backup set\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:01\r\nList of Control File and SPFILE\r\n===============================\r\nFile Type    Status Blocks Failing Blocks Examined\r\n------------ ------ -------------- ---------------\r\nSPFILE       OK     0              2\r\nControl File OK     0              646\r\nFinished validate at 22-MAR-20\r\n\r\nRMAN&gt; VALIDATE CHECK LOGICAL DATABASE;\r\n\r\nStarting validate at 22-MAR-20\r\nusing channel ORA_DISK_1\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\ninput datafile file number=00001 name=+DATA\/DBB19\/DATAFILE\/system.261.1035154051\r\ninput datafile file number=00003 name=+DATA\/DBB19\/DATAFILE\/sysaux.265.1035154177\r\ninput datafile file number=00004 name=+DATA\/DBB19\/DATAFILE\/undotbs1.267.1035154235\r\ninput datafile file number=00007 name=+DATA\/DBB19\/DATAFILE\/users.268.1035154241\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:35\r\nList of Datafiles\r\n=================\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n1    OK     0              16763        116487          3861452\r\n  File Name: +DATA\/DBB19\/DATAFILE\/system.261.1035154051\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              78871\r\n  Index      0              13010\r\n  Other      0              7836\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n3    OK     0              19307        62758           3861452\r\n  File Name: +DATA\/DBB19\/DATAFILE\/sysaux.265.1035154177\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              7459\r\n  Index      0              5158\r\n  Other      0              30796\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n4    OK     0              1            35847           3652497\r\n  File Name: +DATA\/DBB19\/DATAFILE\/undotbs1.267.1035154235\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              0\r\n  Index      0              0\r\n  Other      0              35839\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n7    OK     0              85           641             3759202\r\n  File Name: +DATA\/DBB19\/DATAFILE\/users.268.1035154241\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              70\r\n  Index      0              15\r\n  Other      0              470\r\n\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\nincluding current control file for validation\r\nincluding current SPFILE in backup set\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:01\r\nList of Control File and SPFILE\r\n===============================\r\nFile Type    Status Blocks Failing Blocks Examined\r\n------------ ------ -------------- ---------------\r\nSPFILE       OK     0              2\r\nControl File OK     0              646\r\nFinished validate at 22-MAR-20\r\n\r\nRMAN&gt; exit\r\n\r\n\r\nRecovery Manager complete.\r\n[oracle@asmrec ~]$\r\n<\/pre>\n<p style=\"text-align: justify;\">As you saw, no failures for DBB19. I still checked logically the database with VALIDATE CHECK LOGICAL DATABASE because since the validate returned no failed files, I wanted to check logically the blocks.<\/p>\n<h4 style=\"text-align: justify;\">DBC19<\/h4>\n<p style=\"text-align: justify;\">Same for the last database, but now, datafile 3 failed:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">[oracle@asmrec ~]$ export ORACLE_SID=DBC19\r\n[oracle@asmrec ~]$ rman target \/\r\n\r\nRecovery Manager: Release 19.0.0.0.0 - Production on Sun Mar 22 18:01:33 2020\r\nVersion 19.6.0.0.0\r\n\r\nCopyright (c) 1982, 2019, Oracle and\/or its affiliates.  All rights reserved.\r\n\r\nconnected to target database (not started)\r\n\r\nRMAN&gt; startup mount;\r\n\r\nOracle instance started\r\ndatabase mounted\r\n\r\nTotal System Global Area    1610610776 bytes\r\n\r\nFixed Size                     8910936 bytes\r\nVariable Size                864026624 bytes\r\nDatabase Buffers             729808896 bytes\r\nRedo Buffers                   7864320 bytes\r\n\r\nRMAN&gt; validate database;\r\n\r\nStarting validate at 22-MAR-20\r\nusing target database control file instead of recovery catalog\r\nallocated channel: ORA_DISK_1\r\nchannel ORA_DISK_1: SID=134 device type=DISK\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\ninput datafile file number=00001 name=+DATA\/DBC19\/DATAFILE\/system.262.1035154053\r\ninput datafile file number=00004 name=+DATA\/DBC19\/DATAFILE\/undotbs1.270.1035154249\r\ninput datafile file number=00003 name=+DATA\/DBC19\/DATAFILE\/sysaux.266.1035154181\r\ninput datafile file number=00007 name=+DATA\/DBC19\/DATAFILE\/users.271.1035154253\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:03:15\r\nList of Datafiles\r\n=================\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n1    OK     0              17777        117764          4188744\r\n  File Name: +DATA\/DBC19\/DATAFILE\/system.262.1035154053\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              79161\r\n  Index      0              13182\r\n  Other      0              7640\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n3    FAILED 1              19272        66585           4289434\r\n  File Name: +DATA\/DBC19\/DATAFILE\/sysaux.266.1035154181\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              7311\r\n  Index      0              4878\r\n  Other      511            35099\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n4    OK     0              1            84522           4188748\r\n  File Name: +DATA\/DBC19\/DATAFILE\/undotbs1.270.1035154249\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              0\r\n  Index      0              0\r\n  Other      0              84479\r\n\r\nFile Status Marked Corrupt Empty Blocks Blocks Examined High SCN\r\n---- ------ -------------- ------------ --------------- ----------\r\n7    OK     0              93           641             3717377\r\n  File Name: +DATA\/DBC19\/DATAFILE\/users.271.1035154253\r\n  Block Type Blocks Failing Blocks Processed\r\n  ---------- -------------- ----------------\r\n  Data       0              65\r\n  Index      0              15\r\n  Other      0              467\r\n\r\nvalidate found one or more corrupt blocks\r\nSee trace file \/u01\/app\/oracle\/diag\/rdbms\/dbc19\/DBC19\/trace\/DBC19_ora_22091.trc for details\r\nchannel ORA_DISK_1: starting validation of datafile\r\nchannel ORA_DISK_1: specifying datafile(s) for validation\r\nincluding current control file for validation\r\nincluding current SPFILE in backup set\r\nchannel ORA_DISK_1: validation complete, elapsed time: 00:00:01\r\nList of Control File and SPFILE\r\n===============================\r\nFile Type    Status Blocks Failing Blocks Examined\r\n------------ ------ -------------- ---------------\r\nSPFILE       OK     0              2\r\nControl File OK     0              646\r\nFinished validate at 22-MAR-20\r\n\r\nRMAN&gt; run{\r\n2&gt; restore datafile 3;\r\n3&gt; recover datafile 3;\r\n4&gt; }\r\n\r\nStarting restore at 22-MAR-20\r\nusing channel ORA_DISK_1\r\n\r\nchannel ORA_DISK_1: starting datafile backup set restore\r\nchannel ORA_DISK_1: specifying datafile(s) to restore from backup set\r\nchannel ORA_DISK_1: restoring datafile 00003 to +DATA\/DBC19\/DATAFILE\/sysaux.266.1035154181\r\nchannel ORA_DISK_1: reading from backup piece \/tmp\/0buro5rh_1_1\r\nchannel ORA_DISK_1: piece handle=\/tmp\/0buro5rh_1_1 tag=BKP-DB-INC0\r\nchannel ORA_DISK_1: restored backup piece 1\r\nchannel ORA_DISK_1: restore complete, elapsed time: 00:00:45\r\nFinished restore at 22-MAR-20\r\n\r\nStarting recover at 22-MAR-20\r\nusing channel ORA_DISK_1\r\n\r\nstarting media recovery\r\n\r\narchived log for thread 1 with sequence 25 is already on disk as file +RECO\/DBC19\/ARCHIVELOG\/2020_03_22\/thread_1_seq_25.323.1035737103\r\narchived log for thread 1 with sequence 26 is already on disk as file +RECO\/DBC19\/ARCHIVELOG\/2020_03_22\/thread_1_seq_26.329.1035739907\r\narchived log for thread 1 with sequence 27 is already on disk as file +RECO\/DBC19\/ARCHIVELOG\/2020_03_22\/thread_1_seq_27.332.1035741283\r\narchived log file name=+RECO\/DBC19\/ARCHIVELOG\/2020_03_22\/thread_1_seq_25.323.1035737103 thread=1 sequence=25\r\nmedia recovery complete, elapsed time: 00:00:03\r\nFinished recover at 22-MAR-20\r\n\r\nRMAN&gt; alter database open;\r\n\r\nStatement processed\r\n\r\nRMAN&gt; exit\r\n\r\n\r\nRecovery Manager complete.\r\n[oracle@asmrec ~]$<\/pre>\n<h3 style=\"text-align: justify;\">Dropping failgroup<\/h3>\n<p style=\"text-align: justify;\">If the fix for the remaining failgroup took a lot, it will be dropped automatically. But we can do this manually with force (look that without force it fails):<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01;\r\nALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01\r\n*\r\nERROR at line 1:\r\nORA-15032: not all alterations performed\r\nORA-15084: ASM disk \"CELLI01\" is offline and cannot be dropped.\r\n\r\n\r\nSQL&gt;\r\nSQL&gt; ALTER DISKGROUP data DROP DISKS IN FAILGROUP CELLI01 FORCE;\r\n\r\nDiskgroup altered.\r\n\r\nSQL&gt;<\/pre>\n<p style=\"text-align: justify;\">And after the rebalance finish, all disk will be removed:<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"no-highlight\">SQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\n_DROPPED_0001_DATA                       CELLI01\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n9 rows selected.\r\n\r\nSQL&gt; select * from gv$asm_operation;\r\n\r\n   INST_ID GROUP_NUMBER OPERA PASS      STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE                                       CON_ID\r\n---------- ------------ ----- --------- ---- ---------- ---------- ---------- ---------- ---------- ----------- -------------------------------------------- ----------\r\n         1            1 REBAL COMPACT   WAIT          1          1          0          0          0           0                                                       0\r\n         1            1 REBAL REBALANCE WAIT          1          1          0          0          0           0                                                       0\r\n         1            1 REBAL REBUILD   RUN           1          1        292        642        666           0                                                       0\r\n         1            1 REBAL RESYNC    DONE          1          1          0          0          0           0                                                       0\r\n\r\nSQL&gt; select * from gv$asm_operation;\r\n\r\nno rows selected\r\n\r\nSQL&gt; select NAME,FAILGROUP,LABEL,PATH from v$asm_disk order by FAILGROUP, label;\r\n\r\nNAME                                     FAILGROUP                      LABEL                           PATH\r\n---------------------------------------- ------------------------------ ------------------------------- ------------------------------------------------------------\r\nCELLI02                                  CELLI02                        CELLI02                         ORCL:CELLI02\r\nCELLI03                                  CELLI03                        CELLI03                         ORCL:CELLI03\r\nCELLI04                                  CELLI04                        CELLI04                         ORCL:CELLI04\r\nCELLI05                                  CELLI05                        CELLI05                         ORCL:CELLI05\r\nCELLI06                                  CELLI06                        CELLI06                         ORCL:CELLI06\r\nCELLI07                                  CELLI07                        CELLI07                         ORCL:CELLI07\r\nRECI01                                   RECI01                         RECI01                          ORCL:RECI01\r\nSYSTEMIDG01                              SYSTEMIDG01                    SYSI01                          ORCL:SYSI01\r\n\r\n8 rows selected.\r\n\r\nSQL&gt;<\/pre>\n<h2 style=\"text-align: justify;\">The steps for MOUNT RESTRICTED FORCE FOR RECOVERY<\/h2>\n<p style=\"text-align: justify;\">To resume, the steps needed are (in order):<\/p>\n<ol style=\"text-align: justify;\">\n<li>Put <strong>online the failed disk\/failgroup<\/strong><\/li>\n<li>Execute <strong><em>alter diskgroup &lt;DG&gt; mount restricted force for recovery<\/em><\/strong><\/li>\n<li>Brink online the failgroup with <strong><em>alter diskgroup data online disks in failgroup &lt;FG&gt;<\/em><\/strong><\/li>\n<li>Clean dismount DG <strong><em>alter diskgroup &lt;DG&gt; dismount<\/em><\/strong><\/li>\n<li>Clean mount <strong><em>alter diskgroup &lt;DG&gt; mount<\/em><\/strong><\/li>\n<li>Check databases for failures and recover it<\/li>\n<\/ol>\n<h2 style=\"text-align: justify;\">Undocumented feature<\/h2>\n<p style=\"text-align: justify;\">So, the question is, why it is undocumented? I don\u2019t have the answer but can figure out some points. For me, the most important is that is not a full, clean return. You need to restore and recover from the backup. Maybe you will lose a lot of data.<\/p>\n<p style=\"text-align: justify;\">Of course that here in this example is a controlled scenario, I have just a few databases and my failgroup have just one disk inside. In real life, the problem will be worst. More diskgroups can be affected, as RECO\/REDO\/FRA. And probably you lost some redologs and archivelogs too and you can\u2019t do a clean recovery. Or even need to recover OCR and Votedisk from the cluster.<\/p>\n<p style=\"text-align: justify;\">This is the point for correct architecture design, if you need more protection at ASM side, you can use HIGH redundancy to survive at least two failures without interruption. This is the reason that SYSTEMDG (or OCR\/Vote disk) is put high redundancy diskgroup at Exadata.<\/p>\n<p style=\"text-align: justify;\">Outages and failures can occur in different layers of your environment. But storage\/disk failures are catastrophic for databases because they can lead data corruption and you need to use backups to recover it. They can occur in any environment, from Storage until Exadata. I had one in an old Exadata V2 in 2016, used just for DEV databases, that crashed two storage cells (with one hour of difference) and needed to use this procedure to save some files and reduce the downtime avoiding to restore everything (more than 10TB).<\/p>\n<p style=\"text-align: justify;\">So, it is good to know this kind of a procedure because can save time. But it is your decision to use it or no, check if worth or no.<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\">Some references that you can check:<\/p>\n<ul style=\"text-align: justify;\">\n<li><a href=\"https:\/\/support.oracle.com\/epmos\/faces\/DocContentDisplay?id=1968642.1\" target=\"_blank\" rel=\"noopener noreferrer\">Recover from diskgroup failure using the 12.1.0.2 \u201cmount restricted force for recovery\u201d feature &#8211; An Example (Doc ID 1968642.1)<\/a><\/li>\n<li><a href=\"https:\/\/support.oracle.com\/epmos\/faces\/DocContentDisplay?id=1404123.1\" target=\"_blank\" rel=\"noopener noreferrer\">How to change the DISK_REPAIR_TIME timer after disk goes offline from failgroup (Doc ID 1404123.1)<\/a><\/li>\n<li><a href=\"https:\/\/support.oracle.com\/epmos\/faces\/DocContentDisplay?id=1968607.1\" target=\"_blank\" rel=\"noopener noreferrer\">The ASM Priority Rebalance feature &#8211; An Example (Doc ID 1968607.1)<\/a><\/li>\n<\/ul>\n<p style=\"text-align: justify;\">&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p style=\"text-align: justify;\"><strong>Disclaimer<\/strong>: <em>\u201cThe postings on this site are my own and don\u2019t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community. Post protected by copyright.\u201d<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Survive to disk failures it is crucial to avoid data corruption, but sometimes, even with redundancy at ASM, multiple failures can happen. Check in this post how to use the undocumented feature \u201cmount restricted force for recovery\u201d to resurrect diskgroup and lose less data when multiple failures occur. Diskgroup redundancy is a key factor for [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[44,29,77,6,56,5,91],"tags":[80,100,69,125,65],"class_list":["post-699","post","type-post","status-publish","format-standard","hentry","category-backup","category-database","category-engineeredsystems","category-exadata","category-grid-infrastructure","category-oracle","category-rman","tag-asm","tag-engineered-systems","tag-exadata","tag-gi","tag-oracle"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.7 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>ASM, Mount Restricted Force For Recovery - Fernando Simon<\/title>\n<meta name=\"description\" content=\"Check how to use the undocumented ASM &quot;mount restricted force for recovery&quot; to restore and resurrect crashed diskgroup with failed failgroups\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"ASM, Mount Restricted Force For Recovery - Fernando Simon\" \/>\n<meta property=\"og:description\" content=\"Check how to use the undocumented ASM &quot;mount restricted force for recovery&quot; to restore and resurrect crashed diskgroup with failed failgroups\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/\" \/>\n<meta property=\"og:site_name\" content=\"Fernando Simon\" \/>\n<meta property=\"article:published_time\" content=\"2020-03-24T01:16:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-07-19T22:10:47+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png\" \/>\n<meta name=\"author\" content=\"Simon\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Simon\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"33 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/\"},\"author\":{\"name\":\"Simon\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/#\\\/schema\\\/person\\\/386da956604bca0d5be5dd52210c1dd9\"},\"headline\":\"ASM, Mount Restricted Force For Recovery\",\"datePublished\":\"2020-03-24T01:16:33+00:00\",\"dateModified\":\"2020-07-19T22:10:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/\"},\"wordCount\":2513,\"commentCount\":1,\"image\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/www.fernandosimon.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/ASM-Failgroup-Exadata.png\",\"keywords\":[\"ASM\",\"Engineered Systems\",\"Exadata\",\"GI\",\"Oracle\"],\"articleSection\":[\"Backup\",\"Database\",\"Engineered Systems\",\"Exadata\",\"Grid Infrastructure\",\"Oracle\",\"RMAN\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/\",\"url\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/\",\"name\":\"ASM, Mount Restricted Force For Recovery - Fernando Simon\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/www.fernandosimon.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/ASM-Failgroup-Exadata.png\",\"datePublished\":\"2020-03-24T01:16:33+00:00\",\"dateModified\":\"2020-07-19T22:10:47+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/#\\\/schema\\\/person\\\/386da956604bca0d5be5dd52210c1dd9\"},\"description\":\"Check how to use the undocumented ASM \\\"mount restricted force for recovery\\\" to restore and resurrect crashed diskgroup with failed failgroups\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#primaryimage\",\"url\":\"http:\\\/\\\/www.fernandosimon.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/ASM-Failgroup-Exadata.png\",\"contentUrl\":\"http:\\\/\\\/www.fernandosimon.com\\\/blog\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/ASM-Failgroup-Exadata.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/asm-mount-restricted-force-for-recovery\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"ASM, Mount Restricted Force For Recovery\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/\",\"name\":\"Fernando Simon\",\"description\":\"Have you hugged your backup today?\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/#\\\/schema\\\/person\\\/386da956604bca0d5be5dd52210c1dd9\",\"name\":\"Simon\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g\",\"caption\":\"Simon\"},\"sameAs\":[\"http:\\\/\\\/www.fernandosimon.com\"],\"url\":\"https:\\\/\\\/www.fernandosimon.com\\\/blog\\\/author\\\/simon\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"ASM, Mount Restricted Force For Recovery - Fernando Simon","description":"Check how to use the undocumented ASM \"mount restricted force for recovery\" to restore and resurrect crashed diskgroup with failed failgroups","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/","og_locale":"en_US","og_type":"article","og_title":"ASM, Mount Restricted Force For Recovery - Fernando Simon","og_description":"Check how to use the undocumented ASM \"mount restricted force for recovery\" to restore and resurrect crashed diskgroup with failed failgroups","og_url":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/","og_site_name":"Fernando Simon","article_published_time":"2020-03-24T01:16:33+00:00","article_modified_time":"2020-07-19T22:10:47+00:00","og_image":[{"url":"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png","type":"","width":"","height":""}],"author":"Simon","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Simon","Est. reading time":"33 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#article","isPartOf":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/"},"author":{"name":"Simon","@id":"https:\/\/www.fernandosimon.com\/blog\/#\/schema\/person\/386da956604bca0d5be5dd52210c1dd9"},"headline":"ASM, Mount Restricted Force For Recovery","datePublished":"2020-03-24T01:16:33+00:00","dateModified":"2020-07-19T22:10:47+00:00","mainEntityOfPage":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/"},"wordCount":2513,"commentCount":1,"image":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#primaryimage"},"thumbnailUrl":"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png","keywords":["ASM","Engineered Systems","Exadata","GI","Oracle"],"articleSection":["Backup","Database","Engineered Systems","Exadata","Grid Infrastructure","Oracle","RMAN"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/","url":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/","name":"ASM, Mount Restricted Force For Recovery - Fernando Simon","isPartOf":{"@id":"https:\/\/www.fernandosimon.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#primaryimage"},"image":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#primaryimage"},"thumbnailUrl":"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png","datePublished":"2020-03-24T01:16:33+00:00","dateModified":"2020-07-19T22:10:47+00:00","author":{"@id":"https:\/\/www.fernandosimon.com\/blog\/#\/schema\/person\/386da956604bca0d5be5dd52210c1dd9"},"description":"Check how to use the undocumented ASM \"mount restricted force for recovery\" to restore and resurrect crashed diskgroup with failed failgroups","breadcrumb":{"@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#primaryimage","url":"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png","contentUrl":"http:\/\/www.fernandosimon.com\/blog\/wp-content\/uploads\/2020\/03\/ASM-Failgroup-Exadata.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.fernandosimon.com\/blog\/asm-mount-restricted-force-for-recovery\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.fernandosimon.com\/blog\/"},{"@type":"ListItem","position":2,"name":"ASM, Mount Restricted Force For Recovery"}]},{"@type":"WebSite","@id":"https:\/\/www.fernandosimon.com\/blog\/#website","url":"https:\/\/www.fernandosimon.com\/blog\/","name":"Fernando Simon","description":"Have you hugged your backup today?","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.fernandosimon.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.fernandosimon.com\/blog\/#\/schema\/person\/386da956604bca0d5be5dd52210c1dd9","name":"Simon","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a3dbc48de62fffb1829befb4a588d789ec6dc5e05977afabb3407a5f37a16482?s=96&d=mm&r=g","caption":"Simon"},"sameAs":["http:\/\/www.fernandosimon.com"],"url":"https:\/\/www.fernandosimon.com\/blog\/author\/simon\/"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p5ofTp-bh","_links":{"self":[{"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/posts\/699","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/comments?post=699"}],"version-history":[{"count":0,"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/posts\/699\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/media?parent=699"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/categories?post=699"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.fernandosimon.com\/blog\/wp-json\/wp\/v2\/tags?post=699"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}