There has been some big new security items added to 12cR1 such as SHA2 in DBMS_CRYPTO, code based security in PL/SQL, Data Redaction, unified audit or even privilege analysis but also as I hinted in some previous blogs there are....[Read More]
Posted by Pete On 31/07/13 At 11:11 AM
Replicating data can be a time consuming process to setup. Fortunately, Oracle GoldenGate provide a few tools to help ease the complexity of setup. One of these tools is the “macro”. Macros are used to simplify and automate the work associated with setting up and replicating data. So what exactly is a macro? Oracle defines a macro as:
A macro is a built-in automation tool that enables you to call a stored set of processing steps from within the Oracle GoldenGate parameter file.
In a nutshell, a macro is a stored set of commands that are used on a frequent basis; consisting of parameters for simple to complex series of substitutions, calculations or conversions. Macros may be written inline in the parameter file or stored in a macro library.
What this post will show you is how to add a macro to a library, accept parameters and then pass the parameter to within the parameter file during replication.
First thing that needs to be done is setup a standard directory to contain all the macro files. A directory called “dirmac” needs to be created in the OGG_HOME.
Note: The “dirmac” directory is something that you will need to create with the following command “mkdir -p $OGG_HOME/dirmac”. A macro directory can be created anywhere you like, I personally try to keep all OGG items together.
In the macro directory, create a file to use as the macro library. Ideally, you should have a file for each type of process running in that $OGG_HOME. Once the file is created, then edit the library file and add the macros desired.
> mkdir -p $OGG_HOME/dirmac > cd $OGG_HOME/dirmac > touch <library_name>.mac > vi <library_name>.mac
When the macro library file is open for editing add macro that is desired. Remember, a macro library can house more than one macro. In the example below, you will see two examples. The first example is setting a tokens that can be called. The second example is setting a macro to map tokens and header information to a specific table.
—Example 1— BEGIN SRC_CSN_TS = @GETENV(‘GGHEADER’,’COMMITTIMESTAMP’) END; MACRO #src_icnt BEGIN ICNT = @GETENV('STATS', 'INSERT') END; MACRO #src_ucnt BEGIN UCNT = @GETENV('STATS', 'UPDATE') END; MACRO #src_dcnt BEGIN DCNT = @GETENV('STATS', 'DELETE') END; MACRO #src_dmlcnt BEGIN DMLCNT = @GETENV('STATS', 'DML') END; —Example 2— MACRO #hb_mappings PARAMS (#src_schema) BEGIN MAP #src_schema.RANDOM_VALUES, target SCOTT.GG_REP_OP_STATUS, INSERTMISSINGUPDATES COLMAP ( SRC_DB_SCHEMA=@token('SRC_SCHEMA'), GG_REPLICAT_NAME=@GETENV('GGENVIRONMENT','GROUPNAME'), TGT_LAST_UPDATE_DT=@DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')), SRC_CSN_TS=@token('SRC_CSN_TS'), ICNT=@token('ICNT'), UCNT=@token('UCNT'), DCNT=@token('DCNT'), DMLCNT=@token('DMLCNT') ); END;
Notice in example 2 the PARAMS statement. When using macros this can get a bit confusing since defining a macro uses the hash mark (#) and parameters in the PARAMS statement use the hash mark (#) as well. Also notice that the parameter #src_schema is used in the MAP statement in the macro. This is how the value for #src_schema is passed into the macro from the parameter files.
Now, lets take a look at a parameter file.
In my test environment, I have the following processes running:
GGSCI (oel.acme.com) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING JAGENT STOPPED EXTRACT RUNNING EXT 00:00:09 00:00:09 EXTRACT RUNNING PMP 00:00:00 00:00:07 REPLICAT RUNNING REP 00:00:00 00:00:04
Taking a look at the extract parameter file (apply side); in order to use the macro in example 1, add an INCLUDE statement which references macro library to the parameter file. Then in the TABLE statement, using the TOKEN string, the macro for the token can be referenced.
-- Verifies parameter file syntax. COMMENT OUT AFTER TESTING. --CHECKPARAMS --Specifies the name of the extract process EXTRACT EXT --Set Oracle Environment Variables SETENV (ORACLE_HOME="/oracle/app/product/18.104.22.168/dbhome_1") SETENV (ORACLE_SID="bc11g") --Oracle Login USERID ggate, PASSWORD ggate --Warns for a long running transaction WARNLONGTRANS 1h, CHECKINTERVAL 30m --Trace process info --TRACE ./dirrpt/trace_ext.trc --Specifies the location of the remote trail file on target machine EXTTRAIL ./dirdat/lt --Ignore transactions for golden gate user TRANLOGOPTIONS EXCLUDEUSER GGATE --Resolves the TABLES to be replicated ON START-UP WILDCARDRESOLVE IMMEDIATE <strong>INCLUDE ./dirmac/ops_info.mac</strong> --Table Mappings TABLE SCOTT.RANDOM_VALUES, TOKENS(<strong>#src_csn_ts(), #src_icnt(),#src_ucnt(),#src_dcnt(),#src_dmlcnt()</strong>);
Once the extract parameter file is updated, then the extract needs to be restarted. Upon restart of the extract, keep an eye out and make sure the process doesn’t abend. The VIEW REPORT command can be used to check the report file during startup. Once the extract starts, you can see how the macro is used and expanded in the parameter file. From the extract side, the needed info will be captured and placed in the associated trail files.
In the middle, the pump really doesn’t need any changes or restarting.
Now on the replicat (capture) side, in order to use the macro defined in example 2 (above); the replicat parameter file needs to be edited to include the macro library and statements to call the macros. The next code block shows the contents of my replicat parameter file.
--Specifies the name of the replicat load process. REPLICAT REP -- Verifies parameter file syntax. COMMENT OUT AFTER TESTING. --CHECKPARAMS SETENV (ORACLE_HOME="/oracle/app/product/22.214.171.124/dbhome_1") SETENV (ORACLE_SID="bc11g") --Oracle login. USERID ggate, PASSWORD ggate --surpress triggers - enable for 126.96.36.199 or later --DBOPTIONS SUPPRESSTRIGGERS ALLOWDUPTARGETMAP --The source ddl and target ddl are identical ASSUMETARGETDEFS --Tracing info --TRACE ./dirrpt/trace_rep.trc --Specifies name and location of a discard file. DISCARDFILE ./dirrpt/REP.dsc, append, megabytes 200 --Resolves the TARGETs to be replicated during process start-up --WILDCARDRESOLVE IMMEDIATE --Specify error handling rules: REPERROR(default, discard) REPERROR(default2, discard) --Table Mappings INCLUDE ./dirmac/ops_info.mac map SCOTT.RANDOM_VALUES, target SCOTT.RANDOM_VALUES_HIST; #hb_mappings(SCOTT);
You will notice that I have included the INCLUDE statement to call the macro library. Then the macro (example 2) that does the table mapping for the desired information can be accessed using #hb_mappings() (last line of parameter file example). Passing the schema name is simple by placing it in between the parenthesis. As the example above shows, I’m passing SCOTT as the schema I want to use.
Upon restart of the replicat, by looking at the report (VIEW REPORT), I can see where the macro library is read and how the macro is translated into a map statement for the replicat to use.
— Report Output (summerized)— map SCOTT.RANDOM_VALUES, target SCOTT.RANDOM_VALUES_HIST; #hb_mappings(SCOTT); MAP SCOTT.RANDOM_VALUES, target SCOTT.GG_REP_OP_STATUS, INSERTMISSINGUPDATES COLMAP ( SRC_DB_SCHEMA=@token('SRC_SCHEMA'), GG_REPLICAT_NAME=@GETENV('GGENVIRONMENT','GROUPNAME'), TGT_LAST_UPDATE_DT=@DATE('YYYY-MM-DD HH:MI:SS.FFFFFF','JTS',@GETENV('JULIANTIMESTAMP')), SRC_CSN_TS=@token('SRC_CSN_TS'), ICNT=@token('ICNT'), UCNT=@token('UCNT'), DCNT=@token('DCNT'), DMLCNT=@token('DMLCNT') );
After the replicat has restarted. Then the table in the mapping statement can be checked to see if any data was inserted (SCOTT.GG_REP_OP_STATUS). Image 1 below shows the output of the data I requested to be replicated using the macro.
This should have shown you a way to use macros within your replication environment.
Filed under: Golden Gate
It's just a patchset. The delivery that is there to stabilize a release with all the bug fixes. But it comes with a lot of new features as well. And not only the one that has been advertised as the future of the database. It's a huge release.
Let's have a look at what's new.
First, it seems that it will be the only patchest for 12.1
Then, there is that In-Memory option awaited for a while. There has been some demo done by Larry Ellison on Exadata or even on the Oracle SPARC M6. Of course, if you have 32 TB of memory, we can understand the need for an In-Memory optimized storage. For a more real-life usage of that option, stay tune on our blog. We investigate the features in the context of our customer concerns, to fit their needs. For example, In-Memory addresses cases where some customers use Active Data Guard to offload reporting/real-time analytics to another server. But unfortunately In-Memory is not populated on a physical standby. We probably have to wait 12.2 for that.
In-Memory is an option, so available only in Enterprise Edition.
There are other new features related with large memory. There is a part of buffer cache dedicated to big tables (you just set the percentage) to be cached for In-Memory Parallel Query. And there is also a mode where all the database is in buffer cache. About performance and Parallel Query, a new transformation has been introduced to optimize the group by operation when joining a fact table to dimensions.
Second new feature is the range-partitioned hash cluster. Oracle CLUSTER segments is a very old feature but not widely used. Hash cluster is the fastest way to access to a row because the key can be directly transformed to a rowid. Unfortunately maintenance is not easy, especially when the volume increases. And we have partitioning which is the way to ease maintenance with growing tables but, until today, we can't partition a hash cluster. I mean, not in a supported way because Oracle uses it on SPARC for the TPC benchmarks - applying a specific patch (10374168) for it.
Well, the good news is that we can finally partition hash clusters with the simple syntax:
create cluster democ1 (sample_time timestamp,sample_id number)
hashkeys 3600 hash is sample_id size 8192
partition by range (sample_time) (
partition P12 values less than( timestamp'2014-04-26 12:00:00' )
Another nice feature is Attribute Clustering. Lot of other RDBMS has the ability to arrange rows but Oracle puts any insert anywhere in a heap table, depending only on where some free space is left. The alternative is IOT of course. But it can be good to try to cluster rows on one or several columns. It's better for index access, it's better for cache efficiency, it's better for storage indexes (or in-memory min/max), for ILM compression, etc. We can finally do it and I'll blog soon about that.
Attribute Clustering is not an option, but available only in Enterprise Edition.
I think those two features are my favorite ones. Because the best optimization we can do, without refactoring the application design, is to place data in the way it will be retreived.
The trend today is to store unstructured data as JSON. XML was nice, but it's verbose. JSON is easier to read and even PostgreSQL can store JSON in its latest version. So Oracle has it in 188.8.131.52: you can store and index it. Once again stay tuned on this blog to see how it works.
Something important was missing in Oracle SQL. How do you grant a read only user? You grant only select privilege? But that's too much because with a select privilege we can lock a table (with LOCK or SELECT FOR UPDATE). So we have now a READ privilege to prevent that. That's my favorite new feature for developers.
Then there are a few improvements on multitenant, such as the possibility to save the state of a pluggable database so that it can be automatically opened when the CDB startup. We already addressed that in 184.108.40.206 in our Database Management Kit. An undocumented parameter, _multiple_char_set_cdb, let us imagine that we will be able to have different characterset for the PDB - probably in the future. Currently it's set to false.
And once again as beta testing partners we have put the pressure to have a fix for what we consider as serious availability bug. The behaviour in 220.127.116.11 beta was even worse about CDB availability and I finally had a bug opened (Bug 19001390 - PDB SYSTEM TABLESPACE MEDIA FAILURE CAUSES THE WHOLE CDB TO CRASH) that should be fixed in 12.1
About fixes, some restrictions are now gone: we can finally use ILM with multitenant and we can have supplemental logging while using a move partition online. And you can have Flashback Data Archive in multitenant as well.
All that is good news, but remember, even if it's only the 4th digit that is increased in the version number, it's a brand new version with lot of new features. So, when do you plan to upgrade ? 11g is supported until January 2015. Extended support is free until January 2016 given that you are in the terminal patchset (18.104.22.168). So either you don't want to be in the latestet release and you will have to upgrade to 22.214.171.124 before the end of the year, waiting for 12.2 maybe in 2016. Or you want those new features and will probably go to 126.96.36.199 for 2015.
Talking about upgrade, there's a bad news. We thought that multitenancy can accelarate upgrade time. Because the data dictionary is shared, you just have to plug a PDB into a newer version CDB and it's upgraded. And we show that in our 12c new features workshop by applying a PSU. But we have tested the upgrade to 188.8.131.52 in the same way, and it's not that simple. Plugging is quick when you have only new patches that did not change the dictionary. It's still true for PSU when the dictionary changes are limited to the root container. But when you upgrade to 184.108.40.206 you have to synchronize all the PDB dictionaries (all that magic behind object links and metadata links) and that takes time. It takes the same time as upgrading a non-CDB. Conclusion: you don't save time when you do it by plug/unplug.
But I have good news as well for that because I've tested a 1 minute downtime migration from 220.127.116.11 to 18.104.22.168. Dbvisit replicate, the affordable replication solution, supports multitenant in it's latest version, both as source and target. If your application is compatible (which is easy to check with the 30 days trial) then it's a good way to migrate without stress and with minimal downtime. It's available for Standard Edition as well, but currently the 22.214.171.124 download can install only an Enterprise Edition.
I am still here. Just wanted to let you know that I am still in the business of working with Discoverer even though Oracle recently announced that it would be de-supported. If you need help just get in touch.
Also, you may not be aware but we have updated our Discoverer Handbook with the latest 11g version. You can find it on Amazon
We share our skills to maximize your revenue!
We share our skills to maximize your revenue!
Chris Newfield has an excellent post at Remaking the University about the University of California’s budget situation and how it relates to the recent Moody’s negative outlook on higher education finances. The whole article is worth reading, but one section jumped off the page for me [emphasis added].
The sadder example of ongoing debt is the request for “external financing for the UCPath project.” UC Path was UCOP’s flagship solution to UC inefficiencies that were allegedly wasting taxpayers’ money–in other words, new enterprise software for the systemwide consolidation of payroll and human resources functions. This is boring, important back office stuff, hardly good material for a political campaign to show the state “UC means business,” but that’s what it became. Rather than funding each campus’s decades-old effort to upgrade its systems on its own, UCOP sought centralization, which predictably introduced new levels of cost, complexity, and inefficiency, since centralization is often not actually efficient.
I had heard nothing good about UC Path from people trying to implement it on campuses, and have tried to ignore it, but this week it has resurfaced as a problem at the Regental level. The project timeline has grown from 48 to 72 months, and its costs are said to be $220 million (it had spent $131 million by May 2014) . Worse, the repayment schedule has mushroomed from seven to twenty years. Annual payments are to be something like $25 million. Campuses are to be taxed to pay for 2015-era systems until 2035, which is like taking out a twenty year mortgage to pay for your refrigerator, except that your fridge will be working better in 2035 than next year’s PeopleSoft product. Since the concurrent budget document notes efficiency savings of $30 million per year (top of page 4), UCOP may be spending $220 million to save a net $5 million per year over a couple of decades–and going into debt to do it. In the end, an efficiency measure has turned into a literal liability.
What the hell – a $220 million project to save money? How did this project get in this much trouble?
The UCPath project concept originated in 2009 with the project announcement coming in late 2011. The goal is to replace the Payroll Personnel System (PPS) that runs separately for each of the 11 UC locations with Oracle’s PeopleSoft payroll and HR systems. PPS is over 30 years old, and there are major risk issues with such an old system as well as a host of inefficient processes. The original project plans were based on a $170 million budget1 with the first wave of go-live for the Office of the President and 3 campuses scheduled for early 2013. All campuses would be live on the new system by late 2014.2
In a presentation to the Board of Regents in January 2012:
Over the same period, cost reductions are expected to be approximately $750 million from technology efficiency gains, process standardization and consolidation of transactional activities into a UC-wide shared services center. Overall, the project has a net present value of approximately $230 million (at a nine percent discount rate) with breakeven in year 5.
Subsequent promises were made in March of 2012:
We think this project is likely to pay for itself within five years, and UC could be accruing over $100 million in annual savings by the eighth year,” said Peter Taylor, UC’s chief financial officer. “We also expect to deliver HR and payroll services with increased efficiency, accuracy and quality.”
At the Board of Regents’ meeting last week, the project team gave the first update to the regents since January 2012 (itself a troubling sign). See this Sharestream video from 2:56:10 – 3:22:40.
By Fall 2013 the project was in trouble, and UC leadership brought in new leadership for the project: Mark Cianca as Deputy CIO and Sabu Varghese as Program Director. Their first act was to do a health check on the project, and the results were not pretty (as described in last week’s Board of Regents’ meeting).
- The project team and implementation partner (Oracle) had treated the project as a software replacement rather than a fundamental business transformation initiative.
- The individual campuses had not been consulted on changes in business processes, and in fact they had not even been asked to sign off on future state business processes that each campus would have to run to stay in operation.
- The new project team had to go through more than 100 future state processes with campuses and get agreement on how to proceed.
The result, as described by UC President Janet Napolitano at last week’s meeting, was the team having to “reboot the entire project”.
Based on the reboot, the current plan is $220 million with first wave complete by February 2016 and all campuses live by mid 2017. That’s $50 million over budget and 24 months over schedule.
But the planning is not complete. They are working up their “final” replan of budget and timeline, which they will present in January 2015.
How solid is the current estimate? The implementation schedule is listed as the highest risk, even with the delays.
The project financing has changed so much that UC is now facing the need to use external financing over a much longer term, as described in the material for last week’s board meeting.
Therefore, this item seeks approval to refinance the UCPath loan out of CapEquip and into external financing to achieve the financing customization required. As indicated above, the original repayment plan based on the $220.5 million budget was expected to have been repaid with annual debt service of $25 million. This would have resulted in a 12-year loan term once principal was to be repaid. In January 2015, UCPath project leadership plans to present a revised project timeline, a revised project budget and a revised estimated loan repayment schedule. Project leadership will work with the campus budget officers (and campus budget department staff) to develop: (1) an appropriate campus cost allocation strategy; (2) an estimated repayment schedule that will reflect commencement of principal repayments in conjunction with the final campus deployment (estimated to be early 2017); and (3) an estimated 15-20 year loan repayment period.Notes
- The new project team seems quite credible, and for the most part they addressed the right points during the briefing. Kudos to UC for making this change in leadership.
- This is a major project turnaround (or reboot, in Napolitano’s words), but I’m not sure that UC had communicated the significance of the project changes to system campuses (and certainly not to the media).
- I would view the current plan of $220 million and Q1 2017 full deployment as best case situation – the team told the regents that they were going to update the plan, and ERP project almost never come in earlier than planned.
- The actual amount is much higher than $220 based on this footnote: “The $10 million in tenant improvements approved for the UCPath Center Riverside site as well as the $17.4 million purchase of the facility (UCPath is currently projected to use no more than 50 percent of the building) are not included in the figures above.”
- How do you go 2.5 years between updates from what is now a quarter billion dollar project?
- What about the current estimate of benefits – is it $30 million per year as Chris described or closer to $100 million per year? One big concern I have is that the information on project benefits was not updated, presented to the regents, or asked by the regents. While I question the $25 million financing and $30 million benefits numbers, I think Chris got it exactly right by noting how UC administration is failing to ask hard questions:
Moving forward, I’m afraid that officials are going to have to get much better at admitting mistakes like UCPath, and then actually undoing them. I couldn’t listen to the recording of the UCPath conversation, but Cloudminder made it sound like a lot of restrained finger-pointing with no solution in sight. Did anyone say, “well, this seemed like a good idea at the time, but it’s not. Let’s just cancel it, figure out where we went wrong, and come up with something better”?
It is possible that continuing with the rebooted project is the right answer, but UC is not even asking the question. Failing to ask whether 15-20 year financing of a new ERP makes sense seems like a major oversight. Won’t this lock UC into an Oracle system that is already antiquated for another two decades or more? It seems stunning to me that UC is planning to commit to $220 million of external financing without asking some basic questions.
- one regent last week stated the original request was actually $156 million.
- All public projects should fear the Wayback Machine for checking old web pages.
The post University of California’s $220 million payroll project reboot appeared first on e-Literate.
We share our skills to maximize your revenue!
So there I was troubleshooting GoldenGate issue and was puzzled as to why GoldenGate transactions were not seen from Oracle database.
I had the transaction XID correct; however, I was filtering by ACTIVE transaction from Oracle which was causing the issue.
Please allow me to share a test case so that you don’t get stumped like I did.
Identify current log and update table
ARROW:(SOE@san):PRIMARY> select max(sequence#)+1 from v$log_history; MAX(SEQUENCE#)+1 ---------------- 196 ARROW:(SOE@san):PRIMARY> update INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=171 and WAREHOUSE_ID=560; 1 row updated. ARROW:(SOE@san):PRIMARY>
From GoldenGate, find opened transactions for duration of 10 minutes
$ ./ggsci Oracle GoldenGate Command Interpreter for Oracle Version 126.96.36.199.21 18343248 OGGCORE_188.8.131.52.0OGGBP_PLATFORMS_140404.1029_FBO Linux, x64, 64bit (optimized), Oracle 11g on Apr 4 2014 15:18:36 Copyright (C) 1995, 2014, Oracle and/or its affiliates. All rights reserved. GGSCI (arrow.localdomain) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING ESAN 00:00:00 00:00:05 EXTRACT STOPPED PSAN_LAS 00:00:00 68:02:14 REPLICAT STOPPED RLAS_SAN 00:00:00 68:02:12 GGSCI (arrow.localdomain) 2> send esan, status Sending STATUS request to EXTRACT ESAN ... EXTRACT ESAN (PID 2556) Current status: Recovery complete: At EOF Current read position: Redo thread #: 1 Sequence #: 196 RBA: 5861376 Timestamp: 2014-07-21 10:52:59.000000 SCN: 0.1653210 Current write position: Sequence #: 7 RBA: 1130 Timestamp: 2014-07-21 10:52:52.621948 Extract Trail: /u01/app/ggs01/dirdat/ss GGSCI (arrow.localdomain) 3> send esan, showtrans duration 10m Sending showtrans request to EXTRACT ESAN ... Oldest redo log file necessary to restart Extract is: Redo Log Sequence Number 196, RBA 4955152 ------------------------------------------------------------ XID: 3.29.673 Items: 1 Extract: ESAN Redo Thread: 1 Start Time: 2014-07-21:10:41:41 SCN: 0.1652053 (1652053) Redo Seq: 196 Redo RBA: 4955152 Status: Running GGSCI (arrow.localdomain) 4>
Note the Redo Seq: 196 matches the sequence when the update was performed from Oracle database.
Also, note XID: 3.29.673
Let’s find the transaction from the database an notice the XID matches between GoldenGate and Oracle database.
ARROW:(SYS@san):PRIMARY> @trans.sql START_TIME XID STATUS SID SERIAL# USERNAME STATUS SCHEMANAME SQLID CHILD -------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ---------- 07/21/14 10:41:39 3.29.673 INACTIVE 105 9 SOE INACTIVE SOE 6cmmk52wfnr7r 0 ARROW:(SYS@san):PRIMARY> @xplan.sql Enter value for sqlid: 6cmmk52wfnr7r Enter value for child: 0 SQL_ID 6cmmk52wfnr7r, child number 0 ------------------------------------- update INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=171 and WAREHOUSE_ID=560 Plan hash value: 2141863993 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | UPDATE STATEMENT | | | | 3 (100)| | | 1 | UPDATE | INVENTORIES | | | | | |* 2 | INDEX UNIQUE SCAN| INVENTORY_PK | 1 | 14 | 2 (0)| 00:00:01 | ----------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("PRODUCT_ID"=171 AND "WAREHOUSE_ID"=560) 20 rows selected. ARROW:(SYS@san):PRIMARY>
For fun, switched logfile and perform another update.
ARROW:(MDINH@san):PRIMARY> select max(sequence#)+1 from v$log_history; MAX(SEQUENCE#)+1 ---------------- 196 ARROW:(MDINH@san):PRIMARY> alter system switch logfile; System altered. ARROW:(MDINH@san):PRIMARY> / System altered. ARROW:(MDINH@san):PRIMARY> / System altered. ARROW:(MDINH@san):PRIMARY> / System altered. ARROW:(MDINH@san):PRIMARY> select max(sequence#)+1 from v$log_history; MAX(SEQUENCE#)+1 ---------------- 200 ARROW:(MDINH@san):PRIMARY> update SOE.INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=170; 883 rows updated. ARROW:(MDINH@san):PRIMARY>
Check GoldenGate transactions to find 2 open transactions, one from Redo Seq: 196 and one from Redo Seq: 200
GGSCI (arrow.localdomain) 1> send esan, showtrans Sending SHOWTRANS request to EXTRACT ESAN ... Oldest redo log file necessary to restart Extract is: Redo Log Sequence Number 196, RBA 4955152 ------------------------------------------------------------ XID: 3.29.673 Items: 1 Extract: ESAN Redo Thread: 1 Start Time: 2014-07-21:10:41:41 SCN: 0.1652053 (1652053) Redo Seq: 196 Redo RBA: 4955152 Status: Running ------------------------------------------------------------ XID: 4.20.516 Items: 883 Extract: ESAN Redo Thread: 1 Start Time: 2014-07-21:11:03:20 SCN: 0.1654314 (1654314) Redo Seq: 200 Redo RBA: 5136 Status: Running GGSCI (arrow.localdomain) 2>
Let’s kill the transaction by SOE user.
ARROW:(SYS@san):PRIMARY> @trans.sql START_TIME XID STATUS SID SERIAL# USERNAME STATUS SCHEMANAME SQLID CHILD -------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ---------- 07/21/14 10:41:39 3.29.673 INACTIVE 105 9 SOE INACTIVE SOE 6cmmk52wfnr7r 0 07/21/14 11:03:19 4.20.516 INACTIVE 18 53 MDINH INACTIVE MDINH a5qywm8993bqg 0 ARROW:(SYS@san):PRIMARY> @xplan.sql Enter value for sqlid: a5qywm8993bqg Enter value for child: 0 SQL_ID a5qywm8993bqg, child number 0 ------------------------------------- update SOE.INVENTORIES set QUANTITY_ON_HAND=QUANTITY_ON_HAND-10 where PRODUCT_ID=170 Plan hash value: 1060265186 ------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------ | 0 | UPDATE STATEMENT | | | | 28 (100)| | | 1 | UPDATE | INVENTORIES | | | | | |* 2 | INDEX RANGE SCAN| INV_PRODUCT_IX | 900 | 12600 | 4 (0)| 00:00:01 | ------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("PRODUCT_ID"=170) 20 rows selected. ARROW:(SYS@san):PRIMARY> alter system kill session '105,9' immediate; System altered. ARROW:(SYS@san):PRIMARY> @trans.sql START_TIME XID STATUS SID SERIAL# USERNAME STATUS SCHEMANAME SQLID CHILD -------------------- ---------------- -------- ---------- ---------- ------------------ -------- ------------------ ------------- ---------- 07/21/14 11:03:19 4.20.516 INACTIVE 18 53 MDINH INACTIVE MDINH a5qywm8993bqg 0 ARROW:(SYS@san):PRIMARY>
Verify transaction from killed session is removed from GoldenGate
GGSCI (arrow.localdomain) 1> send esan, status Sending STATUS request to EXTRACT ESAN ... EXTRACT ESAN (PID 2556) Current status: Recovery complete: At EOF Current read position: Redo thread #: 1 Sequence #: 200 RBA: 464896 Timestamp: 2014-07-21 11:06:40.000000 SCN: 0.1654584 Current write position: Sequence #: 7 RBA: 1130 Timestamp: 2014-07-21 11:06:37.435383 Extract Trail: /u01/app/ggs01/dirdat/ss GGSCI (arrow.localdomain) 2> send esan, showtrans Sending SHOWTRANS request to EXTRACT ESAN ... Oldest redo log file necessary to restart Extract is: Redo Log Sequence Number 200, RBA 5136 ------------------------------------------------------------ XID: 4.20.516 Items: 883 Extract: ESAN Redo Thread: 1 Start Time: 2014-07-21:11:03:20 SCN: 0.1654314 (1654314) Redo Seq: 200 Redo RBA: 5136 Status: Running GGSCI (arrow.localdomain) 3>
-- trans.sql set lines 200 pages 1000 col xid for a16 col username for a18 col schemaname for a18 col osuser for a12 select t.start_time, t.xidusn||'.'||t.xidslot||'.'||t.xidsqn xid, s.status, s.sid,s.serial#,s.username,s.status,s.schemaname, decode(s.sql_id,null,s.prev_sql_id) sqlid, decode(s.sql_child_number,null,s.prev_child_number) child from v$transaction t, v$session s where s.saddr = t.ses_addr order by t.start_time ;
Photo by Nate Whitehill
It’s been a few weeks since I returned from another awesome Kscope conference and I just realized that I never wrote about it.
For me, it was the first time visiting Seattle and I really liked it even when I only managed to walk around the downtown area. I had some concerns about how the weather was going to be but everything worked out very well with clear skies, temperature in the mid 70’s and no rain!
The Sunday symposiums, the conference sessions and the hands-on labs provided really good content. I particularly enjoyed all the presentations delivered by Jonathan Lewis and Richard Foote.
My friend Amy Caldwell won the contest to have a dinner with ODTUG’s President Monty Latiolais and she was very kind to invite me as her guest. We had a good time talking about the past, present and future of ODTUG and it was enlightening and inspirational to say the least.
My presentation on row pattern matching went well but the attendance wasn’t the best mostly because I had to present on the last time slot when people were on party mode and ready to head to the EMP Museum for the big event. Nevertheless, I had attendees like Dominic Delmolino, Kim Berg Hansen, Alex Zaballa, Leighton Nelson, Joel Kallman and Patrick Wolf that had good questions about my topic.
As I said before, the big event took place at the EMP Museum and I believe everyone had a good time visiting the music and sci-fi exhibits and enjoying the food, drinks and music.
Next year, Kscope will take place on Hollywood, Florida. If you’re a Developer, DBA or an Architect working with Oracle products that’s where you want to be from June 21 – 25. I suggest you register and book your hotel room right away because it’s going to sell out really fast.
Hope to see you there!
Filed under: Kscope Tagged: Kscope
Recently, I’ve been doing some work with a client where tokens need to be used. It came to my attention that the basic usage of tokens is misunderstood. Let’s see if I can clear this up a bit for people reading.
In Oracle GoldenGate, tokens are a way to capture and store data in the header of the trail file (more info on trail headers here). Once a token has been defined, captured and stored in the header, it can be retrieved, on the apply side, and used in many ways to customize what information is delivered by Oracle GoldenGate.
Defining a token is pretty simple; however, keep these three points in mind:
- You define the token and associated data
- The token header in the trail file header permits up to a total of 2,000 bytes (token name, associated data, and length of data)
- Use the TOKEN option of the TABLE parameter in Extracts
In order to define a token in an extract, the definition should follow this basic syntax:
TABLE <schema>.<table>, TOKENS( SRC_CSN_TS = @GETENV(‘GGHEADER’,’COMMITTIMESTAMP’));
In the example above, the token will be populated with the timestamp of the last commit on the table it is defined against. After restarting the extract, the token (SRC_CSN_TS) will be included in the header of the trail file.
Once the trail file is shipped to the target side and read by the replicat, the token is mapped to a column in the target table.
MAP <schema>.<table>, target <schema>.<table>, COLMAP ( SRC_CSN_TS=@token(’SRC_CSN_TS’) );
Image 1, is a view of a table where I have mapped the token (SRC_CSN_TS) to a target table to keep track of the committed timestamps of a transaction on the source system.
Tokens are simple to create, use, and are a powerful feature for mapping data between environments.
Filed under: Golden Gate
Oracle Magazine July - August, 2014
Be sure to check out the July - August issue of Oracle Magazine. It includes articles on the next generation of SPARC servers, the Cloud, Big Data Integration, the new Oracle Enterprise Manager Database Express and much more.
A few weeks back, Mike Olson of Cloudera spoke at Spark Summit on how Spark relates to the future of Hadoop. The presentation can be found here:
In particular I want to draw attention to the statement made at 1:45 in the presentation that describes Spark as the "natural successor to MapReduce" - it becomes clear very quickly that what Olson is talking about is batch processing. This is fascinating as everyone I've talked to immediately points out one obvious thing: Spark isn't a general purpose batch processing framework - that is not its design center. The whole point of Spark is to enable fast data access and interactivity.
The guys that clearly "get" Spark - unsurprisingly - are DataBricks. In talking with Ion and company, it's clear they understand the use cases where Spark shines - data scientist driven data exploration and algorithmic development, machine learning, etc. - things that take advantage of the memory mapping capabilities and speed of the framework. And they have offered an online service that allows users to rapidly extract value from cloud friendly datasets, which is smart.
Cloudera's idea of pushing SQL, Pig and other frameworks on to Spark is actually a step backwards - it is a proposal to recreate all the problems of MapReduce 1: it fails to understand the power of refactoring resource management away from the compute model. Spark would have to reinvent and mature models for multi-tenancy, resource managemnet, scheduling, security, scaleout, etc that are frankly already there today for Hadoop 2 with YARN.
The announcement of an intent to lead an implementation of Hive on Spark got some attention. This was something that I looked at carefully with my colleagues almost 2 years ago, so I'd like to make a few observations on why we didn't take this path then.
The first was maturity, in terms of the Spark implementation, of Hive itself, and Shark. Candidly, we knew Hive itself worked at scale but needed significant enhancement and refactoring for both new features on the SQL front and to work at interactive speeds. And we wanted to do all this in a way that did not compromise Hive's ability to work at scale - for real big data problems. So we focused on the mainstream of Hive and the development of a Dryad like runtime for optimal execution of operators in physical plans for SQL in a way that meshed deeply with YARN. That model took the learnings of the database community and scale out big data solutions and built on them "from the inside out", so to speak.
Anyone who has been tracking Hadoop for, oh, the last 2-3 years will understand intuitively the right architectural approach needs to be based on YARN. What I mean is that the query execution must - at the query task level - be composed of tasks that are administered directly by YARN. This is absolutely critical for multi-workload systems (this is one reason why a bolt on MPP solution is a mistake for Hadoop - it is at best a tactical model while the system evolves). This is why we are working with the community on Tez, a low level framework for enabling YARN native domain specific execution engines. For Hive-on-Tez, Hive is the engine and Tez provides the YARN level integration for resource negotiation and coorindation for DAG execution: a DAG of native operators analogous the the execution model found in the MPP world (when people compare Tez and Spark, they are fundamentally confused - Spark could be run on Tez for example for a much deeper integration with Hadoop 2 for example). This model allows the full range of use cases from interactive to massive batch to be administered in a deeply integrated, YARN native way.
Spark will undoubtedly mature into a great tool for what it is designed for: in memory, interactive scenarios - generally script driven - and likely grow to subsume new use cases we aren't anticipating today. It is, however, exactly the wrong choice for scale out big data batch processing in anything like the near term; worse still, returning to a monolithic general purpose compute framework for all Hadoop models would be a huge regression and is a disastrously bad idea.
There was a discrepancy in the failgroups of couple of ASM disks in Exadata. In Exadata, the cell name corresponds to the failgroup name. But there were couple of disks with different failgroup names. Using the following plan to rectify the issue online without any downtime:
1) Check disks and their failgroup:
col name format a27
col path format a45
SQL> select path,failgroup,mount_status,mode_status,header_status,state from v$asm_disk order by failgroup, path;
o/100.100.00.000/DBFSDG_CD_09_mycellnet0 mycellNET0 CACHED ONLINE MEMBER NORMAL
o/100.100.00.000/DATA_CD_08_mycellnet0 mycell_NET0 CACHED ONLINE MEMBER NORMAL
2) Drop Disks:
ALTER DISKGROUP DATA DROP DISK DATA_CD_08_mycellnet0 REBALANCE POWER 32;
3) Wait for rebalanacing to finish
select * from gv$asm_operation;
4) Add the disks to the correct failgroups
ALTER DISKGROUP DATA ADD failgroup mycellNET0 DISK ‘o/100.100.00.000/DATA_CD_08_mycellnet0′ rebalance power 32;
– Wait for rebalance to complete.
5) select * from v$asm_operation;
6) Verify the incorrect failgroup has gone
select name,path,failgroup from v$asm_disk where failgroup in (‘mycell_NET0′) order by name;
select path,failgroup,mount_status,mode_status,header_status,state from v$asm_disk order by failgroup, path;
One of the well-known best practices for HDFS is to store data in few large files, rather than a large number of small ones. There are a few problems related to using many small files but the ultimate HDFS killer is that the memory consumption on the name node is proportional to the number of files stored in the cluster and it doesn’t scale well when that number increases rapidly.
MapR has its own implementation of the Hadoop filesystem (called MapR-FS) and one of its claims to fame is to scale and work well with small files. In practice, though, there are a few things you should do to ensure that the performance of your map-reduce jobs does not degrade when they are dealing with too many small files, and I’d like to cover some of those.The problem
I stumbled upon this when investigating the performance of a job in production that was taking several hours to run on a 40-node cluster. The cluster had spare capacity but the job was progressing very slowly and using only 3 of the 40 available nodes.
When I looked into the data that was being processed by the active mappers, I noticed that vast majority of the splits being read by the mappers were in blocks that were replicated into the same 3 cluster nodes. There was a significant data distribution skew towards those 3 nodes and since the map-reduce tasks prefer to execute on nodes where the data is local, the rest of the cluster sat idle while those 3 nodes were IO bound and processing heavily.MapR-FS architecture
Differently from HDFS, MapR-FS doesn’t have name nodes. The file metadata is distributed across different data nodes instead. This is the key for getting rid of the name node memory limitation of HDFS, and let MapR-FS handle a lot more files, small or large, than a HDFS cluster.
Files in MapR-FS have, by default, blocks of 256MB. Blocks are organised in logical structures called “containers”. When a new block is created it is automatically assigned to one existing container within the volume that contains that file. The container determines the replication factor (3 by default) and the nodes where the replicas will be physically stored. Containers are bound to a MapR volume and cannot span multiple volumes.
There’s also a special container in MapR-FS called a “name container”, which is where the volume namespace and file chunk locations are stored. Besides the metadata, the name container always stores the first 64KB of the file’s data.
Also, there’s only a single name container per MaprFS volume. So the metadata for all the files in a volume, along with the files’ first 64KB of data, will be all stored in the same name container. The larger the number of files in a volume, the more data this container will be replicating across the same 3 cluster nodes (by default).
So, if your data set is comprised of a very large number of small files (with sizes around 64KB or less) and is all in the sae volume, most of the data will be stored in the same 3 cluster nodes, regardless of the cluster size. Even if you had a very large cluster, whenever you ran a map-reduce job to process those files, the job’s tasks would be pretty much allocated on only 3 nodes of the cluster due to data locality. Those 3 nodes would be under heavy load while the rest of the cluster would sit idle.Real impact
To give you an idea of the dimension of this problem, the first time I noticed this in production was due to a Hive query that was causing high load only in 3 nodes of 40-node cluster. The job took 5 hours to complete. When I looked into the problem I found that the table used by the Hive query had tens of thousands of very small files, many of them smaller than 64K, due to the way the data was being ingested.
We coalesced the table to combine all those small files into a much smaller number of bigger ones. The job ran again after that, without any changes, and completed in just 15 minutes!! To be completely fair, we also changed the table’s file format from SequenceFile to RCFile at the same time we coalesced the data, which certainly brought some additional performance improvements. But, from the 3-node contention I saw during the first job run, I’m fairly convinced that the main issue in this case was the data distribution skew due to the large amount of small files.Best practices
This kind of problem is mitigated when large files are used, since only a small fraction of the data (everything below the 64KB mark) will be stored in the name container, with the rest distributed across other containers and, therefore, other nodes. We’ll also have a smaller number of files (for a similar data volume), which reduces the problem even more.
If your data is ingested in a way that creates many small files, plan to coalesce those files into larger ones on a regular basis. One good tool for that is Edward Capriolo’s File Crusher. This is also (and especially) applicable to HDFS.
Best practice #1: Keep you data stored into large files. Pay special attention to incremental ingestion pipelines, which may create many small files, and coalesce them on a regular basis.
A quick and dirty workaround for the 3-node contention issue explained above would be to increase the replication factor for the name container. This would allow more nodes to run map-reduce tasks on that data. However, it would also use a lot more disk space just to achieve the additional data locality across a larger number of nodes. This is NOT an approach I would recommend to solve this particular problem.
Instead, the proper way to solve this in Mapr-FS is to split your data across different volumes. Especially if you’re dealing with a large number of small files that cannot be coalesced, splitting them across multiple volumes will keep the number of files per volume (and per name container) under control and it will also spread the small files’ data evenly across the cluster, since each volume will have its own name container, replicate across a different set of nodes.
The volumes may, or may not, follow your data lifecycle, with monthly, weekly or even daily volumes, depending on the amount of data being ingested and files being created.
Best practice #2: Use Mapr-FS volumes to plan your data distribution and keep the number of files per volume under control.References:
Yesterday, Cloudera released the score reports for their Data Science Challenge 2014 and I was really ecstatic when I received mine with a “PASS” score! This was a real challenge for me and I had to put a LOT of effort into it, but it paid off in the end!
Note: I won’t bother you in this blog post with the technical details of my submission. This is just an account of how I managed to accomplish it. If you want the technical details, you can look here.
I first learned about the challenge last year, when Cloudera ran it for the first time. I was intrigued, but after reading more about it I realised I didn’t have what it would be required to complete the task successfully.
At the time I was already delving into the Hadoop world, even though I was still happily working as an Oracle DBA at Pythian. I had studied the basics and the not-so-basics of Hadoop, and the associated fauna and had just passed my first Hadoop certifications (CCDH and CCAH). However, there was (and is) still so much to learn! I knew that to take the challenge I would have to invest a lot more time into my studies.
“Data Science” was still a fuzzy buzzword for me. It still is, but at the time, I had no idea about what was behind it. I remember reading this blog post about how to become a data scientist. A quick look at the map in that post turned me off: apart from the “Fundamentals” track in it, I had barely idea what the rest of the map was about! There was a lot of work to do to get there.There’s no free lunch
But as I started reading more about Data Science, I started to realise how exciting it was and how interesting were the problems it could help tackle. By now I had already put my DBA career on hold and joined the Big Data team. I felt a huge gap between my expertise as a DBA and my skills as a Big Data engineer, so I put a lot of effort in studying the cool things I wanted to know more about.
The online courses at Coursera, Edx, Stanford and the like were a huge help and soon I started wading through courses and courses, sometime many at once: Scala, R, Python, more Scala, data analysis, machine learning, and more machine learning, etc… That was not easy and it was a steep learning curve for me. The more I read and studied I realised there was many times more to learn. And there still is…The Medicare challenge
But when Cloudera announced the 2014 Challenge, early this year, I read the disclaimer and realised that this time I could understand it! Even though I had just scratched the surface of what Data Science is meant to encompass, I actually had tools to attempt tackling the challenge.
“Studies shall not stop!!!”, I soon found, as I had a lot more to learn to first pass the written exam (DS-200) and then tackle the problem proposed by the challenge: to detect fraudulent claims in the US Medicare system. It was a large undertaking but I took it one step at a time, and eventually managed to complete a coherent and comprehensive abstract to submit to Cloudera, which, as I gladly found yesterday, was good enough to give me a passing score and the “CCP: Data Scientist” certification from Cloudera!I’m a (Big Data) Engineer
What’s next now? I have only one answer: Keep studying. There’s so much cool stuff to learn. From statistics (yes, statistics!) to machine learning, there’s still a lot I want to know about and that keeps driving me forward. I’m not turning into a Data Scientist, at least not for a while. I am an Engineer at heart; I like to fix and break things at work and Data Science is one more of those tools I want to have to make my job more interesting. But I want to know more about it and learn how to use it properly, at least to avoid my Data Scientist friends cringing away every time I tell tell I’m going to run an online logistic regression!