Re: Why never finished deploying oracle data guard physical standy by rman duplicate?

From: Mladen Gogala <gogala.mladen_at_gmail.com>
Date: Sun, 21 May 2023 12:32:21 -0400
Message-ID: <70f61042-d742-52d0-4ba8-caac2746b1f0_at_gmail.com>



On 5/21/23 09:40, Quanwen Zhao wrote:
> Hello my oracle friends :-),
>
> One of my customers wanna delopy oracle data guard physical standby by
> rman duplicate using my current oracle 11.2.0.4.0 with single
> instance, after a series of the tedious DG parameters configuration
> both on primary and physical standby database I started using rman
> duplicate to push memory script from primary to physical standby.
>
> Yes, my current oracle 11.2.0.4.0 has the volume size of *2.0 t* for
> data files, the hardware resource is as below:
> Logic CPUs: 64c, Physical Memory: 512 GB, Disk IO for the read and
> write speed is probably a bit of bottleneck.
>
> Firstly, I use the following shell script to deploy it.
>
> cat duplicate_dg.sh
> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
> /home/oracle/duplicate_dg_`date +%Y%m%d_%H%M%S`.log
> run {
> allocate channel p1 type disk;
> allocate channel p2 type disk;
> allocate channel p3 type disk;
> allocate channel p4 type disk;
> allocate channel p5 type disk;
> allocate channel p6 type disk;
> allocate auxiliary channel d1 type disk;
> allocate auxiliary channel d2 type disk;
> allocate auxiliary channel d3 type disk;
> allocate auxiliary channel d4 type disk;
> allocate auxiliary channel d5 type disk;
> allocate auxiliary channel d6 type disk;
> duplicate target database for standby nofilenamecheck from active
> database;
> release channel p6;
> release channel p5;
> release channel p4;
> release channel p3;
> release channel p2;
> release channel p1;
> release channel d6;
> release channel d5;
> release channel d4;
> release channel d3;
> release channel d2;
> release channel d1;
> }
> exit;
> EOF
>
>
> But after *3.5 hours*, the log file of rman duplicate has the
> following error:
>
> ......
> executing command: SET NEWNAME
>
> Starting backup at 2023-05-19 23:14:31
> channel p1: starting datafile copy
> input datafile file number=00041
> name=/oracle/oradata/xxxxxx/datafile/data_1.dbf
> channel p2: starting datafile copy
> input datafile file number=00042
> name=/oracle/oradata/xxxxxx/datafile/data_2.dbf
> channel p3: starting datafile copy
> input datafile file number=00043
> name=/oracle/oradata/xxxxxx/datafile/data_3.dbf
> channel p4: starting datafile copy
> input datafile file number=00044
> name=/oracle/oradata/xxxxxx/datafile/data_4.dbf
> channel p5: starting datafile copy
> input datafile file number=00045
> name=/oracle/oradata/xxxxxx/datafile/data_5.dbf
> channel p6: starting datafile copy
> input datafile file number=00046
> name=/oracle/oradata/xxxxxx/datafile/data_6.dbf
> output file name=/oracle/oradata/xxxxxxx/datafile/data_7.dbf
> tag=TAG20230519T231432
> channel p1: datafile copy complete, elapsed time: 00:42:08
> channel p1: starting datafile copy
> input datafile file number=00007
> name=/oracle/oradata/xxxxxx/datafile/data_7.1005681629
> output file name=/oracle/oradata/xxxxxx/datafile/data_9.dbf
> tag=TAG20230519T231432
> channel p5: datafile copy complete, elapsed time: 00:42:07
> channel p5: starting datafile copy
> input datafile file number=00008
> name=/oracle/oradata/xxxxxx/datafile/data_8.1005682331
> output file name=/oracle/oradata/xxxxxx/datafile/data_10.dbf
> tag=TAG20230519T231432
> channel p6: datafile copy complete, elapsed time: 00:42:07
> ......
> *RMAN-03009: failure of backup command on p1 channel at 05/20/2023
> 02:45:38
> ORA-00603: ORACLE server session terminated by fatal error
> ORA-00239: timeout waiting for control file enqueue: held by 'inst
> 1, osid 8262' for more than 900 seconds*
> ......
>
>
> My initial thought is reducing rman channel number and limit speed, so
> I changed the previous shell script to be as follows:
>
> cat duplicate_dg.sh
> rman target sys/xxxxxx_at_aa auxiliary sys/xxxxxx_at_bb << EOF >
> /home/oracle/duplicate_dg_`date +%Y%m%d-%H%M%S`.log
> run {
> *allocate channel p1 type disk maxpiecesize 16g maxopenfiles 4
> rate 40M;
> allocate auxiliary channel d1 type disk maxpiecesize 16g
> maxopenfiles 4 rate 40M;*
> duplicate target database for standby nofilenamecheck from active
> database;
> release channel d1;
> release channel p1;
> }
> exit;
> EOF
>
>
> Re-running that shell script, never showing the error ORA-00239 again,
> but after 8 hours (finished copying *1.3 t* for data files), my
> primary oracle database has been crashed directly, I've seen the
> following error from */var/log/messages*.
>
> ......
> May 21 16:03:19 aaaaaa kernel: [112149.846996] *oracle invoked
> oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0*
> May 21 16:03:19 aaaaaa kernel: [112149.847013] oracle cpuset=/
> mems_allowed=0
> May 21 16:03:19 aaaaaa kernel: [112149.847036] CPU: 37 PID: 48996
> Comm: oracle Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2
> May 21 16:03:19 aaaaaa kernel: [112149.847038] Hardware name: Xen
> HVM domU, BIOS 4.7.4-1.16 07/13/2018
> May 21 16:03:19 aaaaaa kernel: [112149.847040]  0000000000000000
> ffff88011c157528 ffffffff816c6e40 ffff887dd8bb3800
> May 21 16:03:19 aaaaaa kernel: [112149.847043]  0000000000000000
> ffff88011c157578 ffffffff8118c15e ffff880100000000
> May 21 16:03:19 aaaaaa kernel: [112149.847046]  ffffffff000200da
> ffff880148932a00 ffff880d0e7c3800 ffff880d0e7c42b0
> May 21 16:03:19 aaaaaa kernel: [112149.847048] Call Trace:
> May 21 16:03:19 aaaaaa kernel: [112149.847057]
>  [<ffffffff816c6e40>] dump_stack+0x63/0x83
> May 21 16:03:19 aaaaaa kernel: [112149.847062]
>  [<ffffffff8118c15e>] dump_header+0x8e/0xe0
> May 21 16:03:19 aaaaaa kernel: [112149.847065]
>  [<ffffffff8118c767>] oom_kill_process+0x1d7/0x3c0
> May 21 16:03:19 aaaaaa kernel: [112149.847071]
>  [<ffffffff812a8485>] ? security_capable_noaudit+0x15/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847083]
>  [<ffffffff8108d707>] ? has_capability_noaudit+0x17/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847089]
>  [<ffffffff8118cc18>] __out_of_memory+0x2c8/0x370
> May 21 16:03:19 aaaaaa kernel: [112149.847094]
>  [<ffffffff8118ce09>] out_of_memory+0x69/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847099]
>  [<ffffffff8119209f>] __alloc_pages_slowpath+0x6af/0x760
> May 21 16:03:19 aaaaaa kernel: [112149.847102]
>  [<ffffffff81192401>] __alloc_pages_nodemask+0x2b1/0x2d0
> May 21 16:03:19 aaaaaa kernel: [112149.847105]
>  [<ffffffff810b61c8>] ? sched_clock_cpu+0xa8/0xc0
> May 21 16:03:19 aaaaaa kernel: [112149.847109]
>  [<ffffffff811de104>] alloc_pages_vma+0xd4/0x230
> May 21 16:03:19 aaaaaa kernel: [112149.847113]
>  [<ffffffff811cceed>] read_swap_cache_async+0xfd/0x160
> May 21 16:03:19 aaaaaa kernel: [112149.847115]
>  [<ffffffff811cd05e>] swapin_readahead+0x10e/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847118]
>  [<ffffffff811a148e>] shmem_swapin+0x5e/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847121]
>  [<ffffffff816c6f3d>] ? io_schedule_timeout+0xdd/0x110
> May 21 16:03:19 aaaaaa kernel: [112149.847124]
>  [<ffffffff811fad5e>] ? swap_cgroup_record+0x4e/0x60
> May 21 16:03:19 aaaaaa kernel: [112149.847127]
>  [<ffffffff8131c703>] ? radix_tree_lookup_slot+0x13/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847129]
>  [<ffffffff81187d6e>] ? find_get_entry+0x1e/0xa0
> May 21 16:03:19 aaaaaa kernel: [112149.847132]
>  [<ffffffff81189788>] ? pagecache_get_page+0x38/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847135]
>  [<ffffffff811a47a0>] shmem_getpage_gfp+0x540/0x820
> May 21 16:03:19 aaaaaa kernel: [112149.847137]
>  [<ffffffff811a54ba>] shmem_fault+0x6a/0x1c0
> May 21 16:03:19 aaaaaa kernel: [112149.847141]
>  [<ffffffff8129a0ae>] shm_fault+0x1e/0x20
> May 21 16:03:19 aaaaaa kernel: [112149.847144]
>  [<ffffffff811b877d>] __do_fault+0x3d/0xa0
> May 21 16:03:19 aaaaaa kernel: [112149.847149]
>  [<ffffffff810ebfd7>] ? current_fs_time+0x27/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847153]
>  [<ffffffff810c8783>] ? __wake_up+0x53/0x70
> May 21 16:03:19 aaaaaa kernel: [112149.847155]
>  [<ffffffff811b89c5>] do_read_fault+0x1e5/0x300
> May 21 16:03:19 aaaaaa kernel: [112149.847157]
>  [<ffffffff811b8cfc>] ? do_shared_fault+0x19c/0x1d0
> May 21 16:03:19 aaaaaa kernel: [112149.847159]
>  [<ffffffff811bc703>] handle_pte_fault+0x1e3/0x230
> May 21 16:03:19 aaaaaa kernel: [112149.847163]
>  [<ffffffff810738a0>] ? pte_alloc_one+0x30/0x50
> May 21 16:03:19 aaaaaa kernel: [112149.847165]
>  [<ffffffff811b7e27>] ? __pte_alloc+0xd7/0x190
> May 21 16:03:19 aaaaaa kernel: [112149.847167]
>  [<ffffffff811bc90c>] __handle_mm_fault+0x1bc/0x330
> May 21 16:03:19 aaaaaa kernel: [112149.847169]
>  [<ffffffff811bcb32>] handle_mm_fault+0xb2/0x1a0
> May 21 16:03:19 aaaaaa kernel: [112149.847171]
>  [<ffffffff8106ddf3>] ? __do_page_fault+0xe3/0x480
> May 21 16:03:19 aaaaaa kernel: [112149.847173]
>  [<ffffffff8106de7c>] __do_page_fault+0x16c/0x480
> May 21 16:03:19 aaaaaa kernel: [112149.847175]
>  [<ffffffff8106e337>] do_page_fault+0x37/0x90
> May 21 16:03:19 aaaaaa kernel: [112149.847177]
>  [<ffffffff816c7b9a>] ? schedule_user+0x1a/0x60
> May 21 16:03:19 aaaaaa kernel: [112149.847181]
>  [<ffffffff816cdb18>] page_fault+0x28/0x30
> May 21 16:03:19 aaaaaa kernel: [112149.847183] Mem-Info:
> May 21 16:03:19 aaaaaa kernel: [112149.847193]
> active_anon:18878013 inactive_anon:30402260 isolated_anon:576
> May 21 16:03:19 aaaaaa kernel: [112149.847193]  active_file:4697
> inactive_file:4559 isolated_file:613
> May 21 16:03:19 aaaaaa kernel: [112149.847193]  unevictable:0
> dirty:0 writeback:0 unstable:0
> May 21 16:03:19 aaaaaa kernel: [112149.847193]
>  slab_reclaimable:242933 slab_unreclaimable:49152
> May 21 16:03:19 aaaaaa kernel: [112149.847193]  mapped:16529750
> shmem:47921098 pagetables:2963894 bounce:0
> May 21 16:03:19 aaaaaa kernel: [112149.847193]  free:532442
> free_pcp:223 free_cma:1
> May 21 16:03:19 aaaaaa kernel: [112149.847197] Node 0 DMA
> free:15828kB min:0kB low:0kB high:0kB active_anon:0kB
> inactive_anon:0kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:15988kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB
> slab_unreclaimable:68kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847202] lowmem_reserve[]: 0
> 3455 515685 515685
> May 21 16:03:19 aaaaaa kernel: [112149.847206] Node 0 DMA32
> free:2049148kB min:436kB low:544kB high:652kB active_anon:0kB
> inactive_anon:28kB active_file:0kB inactive_file:0kB
> unevictable:0kB isolated(anon):0kB isolated(file):0kB
> present:3915776kB managed:3540104kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:12kB shmem:8kB slab_reclaimable:8240kB
> slab_unreclaimable:1700kB kernel_stack:32kB pagetables:248kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:4 all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847210] lowmem_reserve[]: 0
> 0 512229 512229
> May 21 16:03:19 aaaaaa kernel: [112149.847214] Node 0 Normal
> free:64792kB min:65092kB low:81364kB high:97636kB
> active_anon:75512308kB inactive_anon:121608500kB
> active_file:18788kB inactive_file:18236kB unevictable:0kB
> isolated(anon):2304kB isolated(file):2452kB present:532930560kB
> managed:524523364kB mlocked:0kB dirty:0kB writeback:0kB
> mapped:66118988kB shmem:191684384kB slab_reclaimable:963492kB
> slab_unreclaimable:194840kB kernel_stack:19120kB
> pagetables:11855328kB unstable:0kB bounce:0kB free_pcp:892kB
> local_pcp:0kB free_cma:4kB writeback_tmp:0kB pages_scanned:312092
> all_unreclaimable? yes
> May 21 16:03:19 aaaaaa kernel: [112149.847218] lowmem_reserve[]: 0
> 0 0 0
> May 21 16:03:19 aaaaaa kernel: [112149.847221] Node 0 DMA: 1*4kB
> (U) 2*8kB (U) 2*16kB (U) 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U)
> 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15828kB
> May 21 16:03:19 aaaaaa kernel: [112149.847233] Node 0 DMA32:
> 26*4kB (UM) 25*8kB (UM) 33*16kB (UEM) 40*32kB (UEM) 17*64kB (UEM)
> 4*128kB (EM) 6*256kB (UEM) 2*512kB (M) 5*1024kB (UEM) 3*2048kB
> (UMR) 496*4096kB (M) = 2049152kB
> May 21 16:03:19 aaaaaa kernel: [112149.847246] Node 0 Normal:
> 16336*4kB (UMC) 81*8kB (UM) 11*16kB (UR) 9*32kB (UR) 1*64kB (R)
> 0*128kB 2*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB =
> 68568kB
> May 21 16:03:19 aaaaaa kernel: [112149.847258] Node 0
> hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=1048576kB
> May 21 16:03:19 aaaaaa kernel: [112149.847259] Node 0
> hugepages_total=153641 hugepages_free=153641 hugepages_surp=0
> hugepages_size=2048kB
> May 21 16:03:19 aaaaaa kernel: [112149.847261] 48948485 total
> pagecache pages
> May 21 16:03:19 aaaaaa kernel: [112149.847262] 1018682 pages in
> swap cache
> May 21 16:03:19 aaaaaa kernel: [112149.847264] Swap cache stats:
> add 32771424, delete 31752742, find 10338170/12219018
> May 21 16:03:19 aaaaaa kernel: [112149.847265] Free swap  = 0kB
> May 21 16:03:19 aaaaaa kernel: [112149.847266] Total swap = 16498684kB
> May 21 16:03:19 aaaaaa kernel: [112149.847267] 134215581 pages RAM
> May 21 16:03:19 aaaaaa kernel: [112149.847268] 0 pages
> HighMem/MovableOnly
> May 21 16:03:19 aaaaaa kernel: [112149.847269] 2191643 pages reserved
> May 21 16:03:19 aaaaaa kernel: [112149.847270] 4096 pages cma reserved
> May 21 16:03:19 aaaaaa kernel: [112149.847271] 0 pages hwpoisoned
> ......
>
>
> Oh, my god!!! You know, I've set *Hugepage* on OS and disabled the
> *THP (Transparent Huge Pages)*.
>
> [root_at_aaaaaa ~]# *cat /proc/meminfo | grep Huge*
> AnonHugePages:         0 kB
> *HugePages_Total:   153641
> HugePages_Free:    73222
> HugePages_Rsvd:    73182*
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> [root_at_aaaaaa ~]#
> [root_at_aaaaaa ~]# *cat /sys/kernel/mm/transparent_hugepage/enabled*
> always madvise *[never]*
> [root_at_aaaaaa ~]#
>
>
> About how to avoid out-of-memory? The most articles suggest to set
> *"vm.lower_zone_protection = 250"* on */etc/sysctl.conf *on x86_64
> system or install *hugemem kernel rpm package* on x86 system.
>
> Could you help me troubleshooting the incomprehensible issue? Thanks
> beforehand!
>
> Best Regards
> Quanwen Zhao

You can turn off out of memory murderer by setting vm.overcommit_memory = 1 (echo 1>/proc/sys/vm/overcommit_memory).  If you have enough memory and swap, some pages will be swapped out, but if you don't, the machine will crash. If not, you'll have to decrease your SGA and have more memory managed by Linux.

Regards

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com


--
http://www.freelists.org/webpage/oracle-l
Received on Sun May 21 2023 - 18:32:21 CEST

Original text of this message