Re: design question, use of partitioned tables?

From: Manuela Mueller <mueller_m_at_fiz-chemie.de>
Date: Thu, 07 Aug 2003 12:06:36 +0200
Message-ID: <bgt878$r2t9n$1@uni-berlin.de>

Dear all,
many thanks for your input and your time. Answers embedded
HAND
M. Mueller

<huge snip>
> Where is the lob data to be loaded, in your nfs partition?
Yes, can't hold them local on the DB Server (space shortage). During this load the *.dat files are local on the DB Server. I noticed a slightly better load performance when at least the dat file resides on local FS.

> Can you post your CREATE TABLE statement, your sqlldr
> control file and some details of the Text indexing you do?
> It might help trigger a "flash" in anyone listening here.

create table: CREATE TABLE GUIDE_DOCS ( id NUMBER(16, 0) ,wwwhost_id NUMBER(10, 0) NOT NULL ,document_id NUMBER(10, 0) NOT NULL ,url VARCHAR2(2000) NOT NULL ,date_fetched DATE NOT NULL ,date_checked DATE NOT NULL ,date_lastmod DATE NOT NULL ,language VARCHAR2(4) ,content_type VARCHAR2(40) NOT NULL ,content_length NUMBER(10, 0) NOT NULL ,content BLOB NOT NULL ,md5 VARCHAR2(40) NOT NULL ,CONSTRAINT guide_docs_id_pk PRIMARY KEY (id) ,CONSTRAINT guide_docs_wwwhostid_fk FOREIGN KEY (wwwhost_id) REFERENCES wwwhosts(id) ) TABLESPACE GUIDEMETA LOB("CONTENT") STORE AS (TABLESPACE GUIDEPAGES);

CREATE INDEX "GUIDE_DOCS_HOST_DOC_IDX"
ON "GUIDE_DOCS"("WWWHOST_ID", "DOCUMENT_ID"); b) control file:
OPTIONS (ERRORS=1000000, DIRECT=FALSE, BINDSIZE=2000000, ROWS=512) LOAD DATA
INFILE "orainfo00.dat" BADFILE "orainfo.00.bad" INTO TABLE GUIDE_DOCS
APPEND
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'

( WWWHOST_ID            INTEGER EXTERNAL,
   DOCUMENT_ID           INTEGER EXTERNAL,
   DATE_CHECKED          DATE "YYYY-MM-DD-HH24-MI-SS",
   DATE_FETCHED          DATE "YYYY-MM-DD-HH24-MI-SS",
   DATE_LASTMOD          DATE "YYYY-MM-DD-HH24-MI-SS",
   MD5                   CHAR(40),
   CONTENT_TYPE          CHAR(40),
   CONTENT_LENGTH        INTEGER EXTERNAL,
   LANGUAGE              CHAR(4),
   TITLE                 FILLER CHAR,
   URL                   CHAR(2000),
   FILE                  FILLER CHAR,
   CONTENT               LOBFILE(FILE) TERMINATED BY EOF

)

c) details of Oracle Text indexing:
The Application consists of subject-specific Internet search engines. Each month our web crawling robots deliver 2-3 million new/changed html files. These files are loaded as BLObs and subsequently (currently) 7 subject-specific Oracle Text indexes are created from scratch.

snippet from the index creating procedure: EXECUTE IMMEDIATE 'CREATE INDEX ' || IDX || ' ON ' || TAB

             || '(DOC) INDEXTYPE IS CTXSYS.CONTEXT '
             || 'PARAMETERS(''DATASTORE GUIDE_DOCS LEXER GUIDE_LEXER 
SECTION '
             || 'GROUP CTXSYS.HTML_SECTION_GROUP FILTER

CTXSYS.NULL_FILTER'')'; After optimization of te Oracle Text indexes old duplicate records (not contained in the freshly created index) are deleted (approx 1-1.5 million rows). Hence the table fragmentation.
>

>>application). Direct load exits when I try to load many records with a 
>>null lob. After a few initial tests I returned to slower conventional 
>>path :( .

>
>
> Bugger!

indeed...
>

>>32 Gb is max size for a datafile on Linux. We haven't encountered any 
>>problems with the size yet.

>
>
> Thanks for the info. Suse, wasn't it?
SuSe SLES 8, nice OS and distribution :)
>
>

>>Good question, I never performed a load at this magnitude. When I load 
>>about 3 Million records it takes about 1 h.

>
>
> That's not too bad, considering you're loading lobs as well.
> The rest of the time is spent creating the Text indexes?
Create the indexes and optimze them. Depending on the time frame momentarily either time based and scheduled via dbms_job with ctx_ddl.optimize_index(full, x min) or complete optimization with ctx_ddl.optimize_index(fast).
>
> Partitioning looks very promising for scaling. Do you
> have a date column you could partition on? I'm thinking
> one partition for every period of 7 days, drop oldest and
> create new ones when needed?

Management wants upscaling from currently 1 TB to 10 TB during the next 2 years. Additionally, other document types should be indexed (xml, pdf files) in the near future. We can theoretically imagine 2 approaches: a) one main table per subject specific Oracle Text Index (Redundancy in html files approx. 20-30%)
b) one big partitioned main table (I know we need EE for this). Maybe range partitioning on the wwwhost_id is possible. I'll investigate if the data are evenly distributed but I lack experience in partitioned tables with BLOBs. Is it possible to defragment one partition at a time? Any other ideas how to deal with this situation?
>
> Still, very strange that you get such low CPU activity even with
> concurrent loads. I'd say the main problem is the nfs data source.
> Do a test by putting some data in a local disk, then do a trial
> load from there and another of the same volume from nfs.
> Compare CPU and I/O activity across the system for both. That
> should tell you for sure if the prob is the db handling of lob's
> or the nfs.

I'll try it on a development box.
>
> A visit to Metalink for any outstanding bugs (features?) on
> lob's in your version of the software may pay off really well.
I looked at Metalink, found nothing in the bug DB. We already applied the latest patch.
>
> Cheers
> Nuno Souto
> wizofoz2k_at_yahoo.com.au.nospam
Received on Thu Aug 07 2003 - 05:06:36 CDT