OraFAQ Forum: Text & interMedia » PDF to HTML convert using ctxsys.auto

Home » Server Options » Text & interMedia » PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 (Database 11.2.0.1.0 / 12.1.0.1.0)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650141]

Sun, 17 April 2016 04:25

bwelter
Messages: 4
Registered: January 2012
Location: Netherlands

Junior Member

Converting the same PDF doc gives different result between Oracle 11.2 and 12.1.
Using plaintext => false to get HTML output

Code:
declare
l_blob blob; -- holding PDF
l_clob clob; -- result of conversion
begin
--loading blob with pdf:
...
-- set policy:
ctx_ddl.create_policy('test_policy','ctxsys.auto_filter');
......
-- convert PDF:
ctx_doc.policy_filter( policy_name => 'test_policy' , document => l_blob , restab => l_clob , plaintext => false);
l_clob := replace(trim(g_clob), chr(13), chr(10));
l_clob := replace(g_clob, chr(10), chr(32) || '<<EOL>>' || chr(10)||'<<BOL>>');
....
end;

In the Oracle 12 database I get in l_clob:
<<BOL>><div class="c" style="top:592px;left:218px;font-size:9px;font-family:Arial, sans-serif;" <<EOL>>
<<BOL>>>TRANSFORMER SINGLE PHASE, PR AC440V SEC AC220/5,</div> <<EOL>>
<<BOL>><div class="c" style="top:592px;left:38px;font-size:9px;font-family:Arial, sans-serif;" <<EOL>>

In the Oracle 11 database I get with the same PDF the following result in l_clob:
<<BOL>> <<EOL>>
<<BOL>><p><font size="1" face="Arial">TRANSFORMER SINGLE PHASE, PR AC440V SEC AC220/5,</font></p> <<EOL>>
<<BOL>> <<EOL>>

I explicitly need this part of the converted PDF content:
..top:592px;left:218px..

Maybe it has something to do with settings?
What is the solution?

NB: I am aware of the fact that not all PDF documents contain nicely formatted texts and x-y positions. For my purpose now this is however a good solution.

[Updated on: Sun, 17 April 2016 04:27]

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650154 is a reply to message #650141]

Sun, 17 April 2016 09:49

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

As far as I know, there is no setting that affects that. I have heard that Oracle uses a third party auto_filter and there have been changes between versions. So, it is probably just a difference between versions. To verify this, you might try posting your question on the OTN Oracle Text forum. Oracle Text product manager Roger Ford usually responds there.

https://community.oracle.com/community/database/text/content

[Updated on: Sun, 17 April 2016 09:50]

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650155 is a reply to message #650154]

Sun, 17 April 2016 10:48

bwelter
Messages: 4
Registered: January 2012
Location: Netherlands

Junior Member

I posted the question there. Thanks!

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650171 is a reply to message #650155]

Mon, 18 April 2016 07:20

bwelter
Messages: 4
Registered: January 2012
Location: Netherlands

Junior Member

the answer I got from Roger:
AUTO_FILTER is designed to create indexable text from formatted files. It makes no claims to produce any specific layout in the output files.

I don't think there are any settings which will enable 12c to produce the same output as 11g, However it might be possible to take the ctxhx executable from an 11g installation and put it into the 12c environment. I'm not sure if there are library files that might need to be transferred as well.

Report message to a moderator

Re: PDF to HTML convert using ctxsys.auto_filter different result db 11.2 and 12.1 [message #650181 is a reply to message #650171]

Mon, 18 April 2016 13:37

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

Thanks for letting us know. So, do you plan to try using the 11g ctxhx in 12c? If so, please let us know if that works or not.

Report message to a moderator

Previous Topic:	contains query not returning expected results
Next Topic:	Fulltext search

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Tue Dec 16 20:37:55 CST 2025