Why never finished deploying oracle data guard physical standy by rman duplicate?

From: Quanwen Zhao <quanwenzhao_at_gmail.com>
Date: Sun, 21 May 2023 21:40:27 +0800
Message-ID: <CABpiuuQTLGkZKfU7p1CurEd-GU6ZAooGyGU4MTkNxgwm-37uqg_at_mail.gmail.com>



Hello my oracle friends :-),

One of my customers wanna delopy oracle data guard physical standby by rman duplicate using my current oracle 11.2.0.4.0 with single instance, after a series of the tedious DG parameters configuration both on primary and physical standby database I started using rman duplicate to push memory script from primary to physical standby.

Yes, my current oracle 11.2.0.4.0 has the volume size of *2.0 t* for data files, the hardware resource is as below: Logic CPUs: 64c, Physical Memory: 512 GB, Disk IO for the read and write speed is probably a bit of bottleneck.

Firstly, I use the following shell script to deploy it.

cat duplicate_dg.sh
> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
> /home/oracle/duplicate_dg_`date +%Y%m%d_%H%M%S`.log
> run {
> allocate channel p1 type disk;
> allocate channel p2 type disk;
> allocate channel p3 type disk;
> allocate channel p4 type disk;
> allocate channel p5 type disk;
> allocate channel p6 type disk;
> allocate auxiliary channel d1 type disk;
> allocate auxiliary channel d2 type disk;
> allocate auxiliary channel d3 type disk;
> allocate auxiliary channel d4 type disk;
> allocate auxiliary channel d5 type disk;
> allocate auxiliary channel d6 type disk;
> duplicate target database for standby nofilenamecheck from active database;
> release channel p6;
> release channel p5;
> release channel p4;
> release channel p3;
> release channel p2;
> release channel p1;
> release channel d6;
> release channel d5;
> release channel d4;
> release channel d3;
> release channel d2;
> release channel d1;
> }
> exit;
> EOF
But after *3.5 hours*, the log file of rman duplicate has the following error:

......
> executing command: SET NEWNAME
>

Starting backup at 2023-05-19 23:14:31
> channel p1: starting datafile copy
> input datafile file number=00041
> name=/oracle/oradata/xxxxxx/datafile/data_1.dbf
> channel p2: starting datafile copy
> input datafile file number=00042
> name=/oracle/oradata/xxxxxx/datafile/data_2.dbf
> channel p3: starting datafile copy
> input datafile file number=00043
> name=/oracle/oradata/xxxxxx/datafile/data_3.dbf
> channel p4: starting datafile copy
> input datafile file number=00044
> name=/oracle/oradata/xxxxxx/datafile/data_4.dbf
> channel p5: starting datafile copy
> input datafile file number=00045
> name=/oracle/oradata/xxxxxx/datafile/data_5.dbf
> channel p6: starting datafile copy
> input datafile file number=00046
> name=/oracle/oradata/xxxxxx/datafile/data_6.dbf
> output file name=/oracle/oradata/xxxxxxx/datafile/data_7.dbf
> tag=TAG20230519T231432
> channel p1: datafile copy complete, elapsed time: 00:42:08
> channel p1: starting datafile copy
> input datafile file number=00007
> name=/oracle/oradata/xxxxxx/datafile/data_7.1005681629
> output file name=/oracle/oradata/xxxxxx/datafile/data_9.dbf
> tag=TAG20230519T231432
> channel p5: datafile copy complete, elapsed time: 00:42:07
> channel p5: starting datafile copy
> input datafile file number=00008
> name=/oracle/oradata/xxxxxx/datafile/data_8.1005682331
> output file name=/oracle/oradata/xxxxxx/datafile/data_10.dbf
> tag=TAG20230519T231432
> channel p6: datafile copy complete, elapsed time: 00:42:07
> ......
>
>
> *RMAN-03009: failure of backup command on p1 channel at 05/20/2023
> 02:45:38ORA-00603: ORACLE server session terminated by fatal
> errorORA-00239: timeout waiting for control file enqueue: held by 'inst 1,

> osid 8262' for more than 900 seconds*
> ......

My initial thought is reducing rman channel number and limit speed, so I changed the previous shell script to be as follows:

cat duplicate_dg.sh
> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
> /home/oracle/duplicate_dg_`date +%Y%m%d-%H%M%S`.log
> run {
>
> *allocate channel p1 type disk maxpiecesize 16g maxopenfiles 4 rate
> 40M;allocate auxiliary channel d1 type disk maxpiecesize 16g maxopenfiles 4
> rate 40M;*
> duplicate target database for standby nofilenamecheck from active database;
> release channel d1;
> release channel p1;
> }
> exit;
> EOF
Re-running that shell script, never showing the error ORA-00239 again, but after 8 hours (finished copying *1.3 t* for data files), my primary oracle database has been crashed directly, I've seen the following error from */var/log/messages*.

......
> May 21 16:03:19 aaaaaa kernel: [112149.846996] *oracle invoked
> oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0*
> May 21 16:03:19 aaaaaa kernel: [112149.847013] oracle cpuset=/
> mems_allowed=0
> May 21 16:03:19 aaaaaa kernel: [112149.847036] CPU: 37 PID: 48996 Comm:
> oracle Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2
> May 21 16:03:19 aaaaaa kernel: [112149.847038] Hardware name: Xen HVM
> domU, BIOS 4.7.4-1.16 07/13/2018
> May 21 16:03:19 aaaaaa kernel: [112149.847040] 0000000000000000
> ffff88011c157528 ffffffff816c6e40 ffff887dd8bb3800
> May 21 16:03:19 aaaaaa kernel: [112149.847043] 0000000000000000
> ffff88011c157578 ffffffff8118c15e ffff880100000000
> May 21 16:03:19 aaaaaa kernel: [112149.847046] ffffffff000200da
> ffff880148932a00 ffff880d0e7c3800 ffff880d0e7c42b0
> May 21 16:03:19 aaaaaa kernel: [112149.847048] Call Trace:
> May 21 16:03:19 aaaaaa kernel: [112149.847057] [<ffffffff816c6e40>]
> dump_stack+0x63/0x83
> May 21 16:03:19 aaaaaa kernel: [112149.847062] [<ffffffff8118c15e>]
> dump_header+0x8e/0xe0
> May 21 16:03:19 aaaaaa kernel: [112149.847065] [<ffffffff8118c767>]
> oom_kill_process+0x1d7/0x3c0
> May 21 16:03:19 aaaaaa kernel: [112149.847071] [<ffffffff812a8485>] ?
> security_capable_noaudit+0x15/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847083] [<ffffffff8108d707>] ?
> has_capability_noaudit+0x17/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847089] [<ffffffff8118cc18>]
> __out_of_memory+0x2c8/0x370
> May 21 16:03:19 aaaaaa kernel: [112149.847094] [<ffffffff8118ce09>]
> out_of_memory+0x69/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847099] [<ffffffff8119209f>]
> __alloc_pages_slowpath+0x6af/0x760
> May 21 16:03:19 aaaaaa kernel: [112149.847102] [<ffffffff81192401>]
> __alloc_pages_nodemask+0x2b1/0x2d0
> May 21 16:03:19 aaaaaa kernel: [112149.847105] [<ffffffff810b61c8>] ?
> sched_clock_cpu+0xa8/0xc0
> May 21 16:03:19 aaaaaa kernel: [112149.847109] [<ffffffff811de104>]
> alloc_pages_vma+0xd4/0x230
> May 21 16:03:19 aaaaaa kernel: [112149.847113] [<ffffffff811cceed>]
> read_swap_cache_async+0xfd/0x160
> May 21 16:03:19 aaaaaa kernel: [112149.847115] [<ffffffff811cd05e>]
> swapin_readahead+0x10e/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847118] [<ffffffff811a148e>]
> shmem_swapin+0x5e/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847121] [<ffffffff816c6f3d>] ?
> io_schedule_timeout+0xdd/0x110
> May 21 16:03:19 aaaaaa kernel: [112149.847124] [<ffffffff811fad5e>] ?
> swap_cgroup_record+0x4e/0x60
> May 21 16:03:19 aaaaaa kernel: [112149.847127] [<ffffffff8131c703>] ?
> radix_tree_lookup_slot+0x13/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847129] [<ffffffff81187d6e>] ?
> find_get_entry+0x1e/0xa0
> May 21 16:03:19 aaaaaa kernel: [112149.847132] [<ffffffff81189788>] ?
> pagecache_get_page+0x38/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847135] [<ffffffff811a47a0>]
> shmem_getpage_gfp+0x540/0x820
> May 21 16:03:19 aaaaaa kernel: [112149.847137] [<ffffffff811a54ba>]
> shmem_fault+0x6a/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847141] [<ffffffff8129a0ae>]
> shm_fault+0x1e/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847144] [<ffffffff811b877d>]
> __do_fault+0x3d/0xa0
> May 21 16:03:19 aaaaaa kernel: [112149.847149] [<ffffffff810ebfd7>] ?
> current_fs_time+0x27/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847153] [<ffffffff810c8783>] ?
> __wake_up+0x53/0x70
> May 21 16:03:19 aaaaaa kernel: [112149.847155] [<ffffffff811b89c5>]
> do_read_fault+0x1e5/0x300
> May 21 16:03:19 aaaaaa kernel: [112149.847157] [<ffffffff811b8cfc>] ?
> do_shared_fault+0x19c/0x1d0
> May 21 16:03:19 aaaaaa kernel: [112149.847159] [<ffffffff811bc703>]
> handle_pte_fault+0x1e3/0x230
> May 21 16:03:19 aaaaaa kernel: [112149.847163] [<ffffffff810738a0>] ?
> pte_alloc_one+0x30/0x50
> May 21 16:03:19 aaaaaa kernel: [112149.847165] [<ffffffff811b7e27>] ?
> __pte_alloc+0xd7/0x190
> May 21 16:03:19 aaaaaa kernel: [112149.847167] [<ffffffff811bc90c>]
> __handle_mm_fault+0x1bc/0x330
> May 21 16:03:19 aaaaaa kernel: [112149.847169] [<ffffffff811bcb32>]
> handle_mm_fault+0xb2/0x1a0
> May 21 16:03:19 aaaaaa kernel: [112149.847171] [<ffffffff8106ddf3>] ?
> __do_page_fault+0xe3/0x480
> May 21 16:03:19 aaaaaa kernel: [112149.847173] [<ffffffff8106de7c>]
> __do_page_fault+0x16c/0x480
> May 21 16:03:19 aaaaaa kernel: [112149.847175] [<ffffffff8106e337>]
> do_page_fault+0x37/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847177] [<ffffffff816c7b9a>] ?
> schedule_user+0x1a/0x60
> May 21 16:03:19 aaaaaa kernel: [112149.847181] [<ffffffff816cdb18>]
> page_fault+0x28/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847183] Mem-Info:
> May 21 16:03:19 aaaaaa kernel: [112149.847193] active_anon:18878013
> inactive_anon:30402260 isolated_anon:576
> May 21 16:03:19 aaaaaa kernel: [112149.847193] active_file:4697
> inactive_file:4559 isolated_file:613
> May 21 16:03:19 aaaaaa kernel: [112149.847193] unevictable:0 dirty:0
> writeback:0 unstable:0
> May 21 16:03:19 aaaaaa kernel: [112149.847193] slab_reclaimable:242933
> slab_unreclaimable:49152
> May 21 16:03:19 aaaaaa kernel: [112149.847193] mapped:16529750
> shmem:47921098 pagetables:2963894 bounce:0
> May 21 16:03:19 aaaaaa kernel: [112149.847193] free:532442 free_pcp:223
> free_cma:1
> May 21 16:03:19 aaaaaa kernel: [112149.847197] Node 0 DMA free:15828kB
> min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15988kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB
> mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:68kB
> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB
> local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847202] lowmem_reserve[]: 0 3455
> 515685 515685
> May 21 16:03:19 aaaaaa kernel: [112149.847206] Node 0 DMA32 free:2049148kB
> min:436kB low:544kB high:652kB active_anon:0kB inactive_anon:28kB
> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB
> isolated(file):0kB present:3915776kB managed:3540104kB mlocked:0kB
> dirty:0kB writeback:0kB mapped:12kB shmem:8kB slab_reclaimable:8240kB
> slab_unreclaimable:1700kB kernel_stack:32kB pagetables:248kB unstable:0kB
> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB
> pages_scanned:4 all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847210] lowmem_reserve[]: 0 0
> 512229 512229
> May 21 16:03:19 aaaaaa kernel: [112149.847214] Node 0 Normal free:64792kB
> min:65092kB low:81364kB high:97636kB active_anon:75512308kB
> inactive_anon:121608500kB active_file:18788kB inactive_file:18236kB
> unevictable:0kB isolated(anon):2304kB isolated(file):2452kB
> present:532930560kB managed:524523364kB mlocked:0kB dirty:0kB writeback:0kB
> mapped:66118988kB shmem:191684384kB slab_reclaimable:963492kB
> slab_unreclaimable:194840kB kernel_stack:19120kB pagetables:11855328kB
> unstable:0kB bounce:0kB free_pcp:892kB local_pcp:0kB free_cma:4kB
> writeback_tmp:0kB pages_scanned:312092 all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847218] lowmem_reserve[]: 0 0 0 0
> May 21 16:03:19 aaaaaa kernel: [112149.847221] Node 0 DMA: 1*4kB (U) 2*8kB
> (U) 2*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB
> (U) 1*2048kB (M) 3*4096kB (M) = 15828kB
> May 21 16:03:19 aaaaaa kernel: [112149.847233] Node 0 DMA32: 26*4kB (UM)
> 25*8kB (UM) 33*16kB (UEM) 40*32kB (UEM) 17*64kB (UEM) 4*128kB (EM) 6*256kB
> (UEM) 2*512kB (M) 5*1024kB (UEM) 3*2048kB (UMR) 496*4096kB (M) = 2049152kB
> May 21 16:03:19 aaaaaa kernel: [112149.847246] Node 0 Normal: 16336*4kB
> (UMC) 81*8kB (UM) 11*16kB (UR) 9*32kB (UR) 1*64kB (R) 0*128kB 2*256kB (R)
> 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 68568kB
> May 21 16:03:19 aaaaaa kernel: [112149.847258] Node 0 hugepages_total=0
> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> May 21 16:03:19 aaaaaa kernel: [112149.847259] Node 0
> hugepages_total=153641 hugepages_free=153641 hugepages_surp=0
> hugepages_size=2048kB
> May 21 16:03:19 aaaaaa kernel: [112149.847261] 48948485 total pagecache
> pages
> May 21 16:03:19 aaaaaa kernel: [112149.847262] 1018682 pages in swap cache
> May 21 16:03:19 aaaaaa kernel: [112149.847264] Swap cache stats: add
> 32771424, delete 31752742, find 10338170/12219018
> May 21 16:03:19 aaaaaa kernel: [112149.847265] Free swap = 0kB
> May 21 16:03:19 aaaaaa kernel: [112149.847266] Total swap = 16498684kB
> May 21 16:03:19 aaaaaa kernel: [112149.847267] 134215581 pages RAM
> May 21 16:03:19 aaaaaa kernel: [112149.847268] 0 pages HighMem/MovableOnly
> May 21 16:03:19 aaaaaa kernel: [112149.847269] 2191643 pages reserved
> May 21 16:03:19 aaaaaa kernel: [112149.847270] 4096 pages cma reserved
> May 21 16:03:19 aaaaaa kernel: [112149.847271] 0 pages hwpoisoned
> ......

Oh, my god!!! You know, I've set *Hugepage* on OS and disabled the *THP (Transparent Huge Pages)*.

[root_at_aaaaaa ~]# *cat /proc/meminfo | grep Huge*
> AnonHugePages: 0 kB
>
>
> *HugePages_Total: 153641HugePages_Free: 73222HugePages_Rsvd: 73182*
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> [root_at_aaaaaa ~]#
> [root_at_aaaaaa ~]# *cat /sys/kernel/mm/transparent_hugepage/enabled*
> always madvise *[never]*
> [root_at_aaaaaa ~]#

About how to avoid out-of-memory? The most articles suggest to set *"vm.lower_zone_protection
= 250"* on */etc/sysctl.conf *on x86_64 system or install *hugemem kernel rpm package* on x86 system.

Could you help me troubleshooting the incomprehensible issue? Thanks beforehand!

Best Regards
Quanwen Zhao

--
http://www.freelists.org/webpage/oracle-l
Received on Sun May 21 2023 - 15:40:27 CEST

Original text of this message