Dear all,
many thanks for your input and your time.
Answers embedded
HAND
M. Mueller
<huge snip>
> Where is the lob data to be loaded, in your nfs partition?
Yes, can't hold them local on the DB Server (space shortage). During
this load the *.dat files are local on the DB Server. I noticed a
slightly better load performance when at least the dat file resides on
local FS.
> Can you post your CREATE TABLE statement, your sqlldr
> control file and some details of the Text indexing you do?
> It might help trigger a "flash" in anyone listening here.
- create table:
CREATE TABLE GUIDE_DOCS (
id NUMBER(16, 0)
,wwwhost_id NUMBER(10, 0) NOT NULL
,document_id NUMBER(10, 0) NOT NULL
,url VARCHAR2(2000) NOT NULL
,date_fetched DATE NOT NULL
,date_checked DATE NOT NULL
,date_lastmod DATE NOT NULL
,language VARCHAR2(4)
,content_type VARCHAR2(40) NOT NULL
,content_length NUMBER(10, 0) NOT NULL
,content BLOB NOT NULL
,md5 VARCHAR2(40) NOT NULL
,CONSTRAINT guide_docs_id_pk PRIMARY KEY (id)
,CONSTRAINT guide_docs_wwwhostid_fk FOREIGN KEY (wwwhost_id)
REFERENCES wwwhosts(id)
) TABLESPACE GUIDEMETA
LOB("CONTENT") STORE AS (TABLESPACE GUIDEPAGES);
CREATE INDEX "GUIDE_DOCS_HOST_DOC_IDX"
ON "GUIDE_DOCS"("WWWHOST_ID", "DOCUMENT_ID");
b) control file:
OPTIONS (ERRORS=1000000, DIRECT=FALSE, BINDSIZE=2000000, ROWS=512)
LOAD DATA
INFILE "orainfo00.dat" BADFILE "orainfo.00.bad"
INTO TABLE GUIDE_DOCS
APPEND
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
( WWWHOST_ID INTEGER EXTERNAL,
DOCUMENT_ID INTEGER EXTERNAL,
DATE_CHECKED DATE "YYYY-MM-DD-HH24-MI-SS",
DATE_FETCHED DATE "YYYY-MM-DD-HH24-MI-SS",
DATE_LASTMOD DATE "YYYY-MM-DD-HH24-MI-SS",
MD5 CHAR(40),
CONTENT_TYPE CHAR(40),
CONTENT_LENGTH INTEGER EXTERNAL,
LANGUAGE CHAR(4),
TITLE FILLER CHAR,
URL CHAR(2000),
FILE FILLER CHAR,
CONTENT LOBFILE(FILE) TERMINATED BY EOF
)
c) details of Oracle Text indexing:
The Application consists of subject-specific Internet search engines.
Each month our web crawling robots deliver 2-3 million new/changed html
files. These files are loaded as BLObs and subsequently (currently) 7
subject-specific Oracle Text indexes are created from scratch.
snippet from the index creating procedure:
EXECUTE IMMEDIATE 'CREATE INDEX ' || IDX || ' ON ' || TAB
|| '(DOC) INDEXTYPE IS CTXSYS.CONTEXT '
|| 'PARAMETERS(''DATASTORE GUIDE_DOCS LEXER GUIDE_LEXER
SECTION '
|| 'GROUP CTXSYS.HTML_SECTION_GROUP FILTER
CTXSYS.NULL_FILTER'')';
After optimization of te Oracle Text indexes old duplicate records (not
contained in the freshly created index) are deleted (approx 1-1.5
million rows). Hence the table fragmentation.
>
>>application). Direct load exits when I try to load many records with a
>>null lob. After a few initial tests I returned to slower conventional
>>path :( .
>
>
> Bugger!
indeed...
>
>>32 Gb is max size for a datafile on Linux. We haven't encountered any
>>problems with the size yet.
>
>
> Thanks for the info. Suse, wasn't it?
SuSe SLES 8, nice OS and distribution :)
>
>
>>Good question, I never performed a load at this magnitude. When I load
>>about 3 Million records it takes about 1 h.
>
>
> That's not too bad, considering you're loading lobs as well.
> The rest of the time is spent creating the Text indexes?
Create the indexes and optimze them. Depending on the time frame
momentarily either time based and scheduled via dbms_job with
ctx_ddl.optimize_index(full, x min) or complete optimization with
ctx_ddl.optimize_index(fast).
>
> Partitioning looks very promising for scaling. Do you
> have a date column you could partition on? I'm thinking
> one partition for every period of 7 days, drop oldest and
> create new ones when needed?
Management wants upscaling from currently 1 TB to 10 TB during the next
2 years. Additionally, other document types should be indexed (xml, pdf
files) in the near future. We can theoretically imagine 2 approaches:
a) one main table per subject specific Oracle Text Index (Redundancy in
html files approx. 20-30%)
b) one big partitioned main table (I know we need EE for this). Maybe
range partitioning on the wwwhost_id is possible. I'll investigate if
the data are evenly distributed but I lack experience in partitioned
tables with BLOBs. Is it possible to defragment one partition at a time?
Any other ideas how to deal with this situation?
>
> Still, very strange that you get such low CPU activity even with
> concurrent loads. I'd say the main problem is the nfs data source.
> Do a test by putting some data in a local disk, then do a trial
> load from there and another of the same volume from nfs.
> Compare CPU and I/O activity across the system for both. That
> should tell you for sure if the prob is the db handling of lob's
> or the nfs.
I'll try it on a development box.
>
> A visit to Metalink for any outstanding bugs (features?) on
> lob's in your version of the software may pay off really well.
I looked at Metalink, found nothing in the bug DB. We already applied
the latest patch.
>
> Cheers
> Nuno Souto
> wizofoz2k_at_yahoo.com.au.nospam
Received on Thu Aug 07 2003 - 05:06:36 CDT