Re: utl_smtp Hanging when opening a connection

From: Ian M <noemailherethanks_at_news.com>
Date: Fri, 30 Nov 2007 01:29:31 +0100
Message-ID: <474f58b9$0$227$e4fe514c@news.xs4all.nl>

joel garry wrote:
> On Nov 29, 1:24 pm, Ian M <noemailheretha..._at_news.com> wrote:

>> collins.pa..._at_googlemail.com wrote:
>>> We are using utl_smtp to send emails from our database. We currently
>>> run Oracle 9.2 on RHEL4, with Postfix as our mailserver.

>
> Which 9.2? Some platform-specific bugs have been fixed in the later
> patches. One always has to wonder what wasn't found. I'm sure you've
> seen metalink Note:390852.1, in the realm of silly code idiosyncracies
> one could easily miss.
>

>>> Lately, we have noticed that the process creating the email is
>>> hanging, which often requires the database to be taken down to fully
>>> kill off the session. (This is on our test system, fortunately).

>
> Have you tried killing the session from the OS level, rather than db?
> PMON may then be more accomodating than SMON. How different is your
> test system from your production?
>

>>> Having looked at a numnber of posts on various forums, I have seen
>>> that this is not an uncommon problem. Oracle have included a timeout
>>> parameter in utl_smtp.open_connection, but this is not implemented for
>>> write processes (in version 9.1 to 10.2, although more may be the
>>> same), and from comments, I have seen that this timeout functionality
>>> does not apply to opening the connection itself.
>>> Has anybody come up with a solution to programmatically cause the open
>>> connection to timeout if a connection is not established in a
>>> reasonable time, and if so, can you please help me.
>>> Many thank in anticipation,
>>> Paul
>> Hi Paul,
>>
>> I am not sure about RHEL4 but I had a similar situation a few years back
>> on a HP box with frequent external network problems.
>>
>> To reduce the impact of this I amended the servers TCP settings (I think
>> it was tcp_ip_abort_cinterval I'm not 100% though). This caused the
>> failed open connection attempts to close much faster which was useful
>> for server scanning failures etc.
>>

>
> The details of those sorts of things tend to be very platform and OS
> version specific. This tcp twiddle was no longer needed when the
> system was upgraded to hp-ux 11i, for example:
> http://groups.google.com/group/comp.databases.oracle.server/msg/b8f8ae3a5be16fa1?dmode=source
>
> Not long ago I saw an hp-ux box that isn't meant to talk to the
> outside world. So resolv.conf didn't have anything in it. So
> sendmail started up with the wrong domain. So when the raid
> monitoring software tried to send local mail (with a domain specified)
> to root to say a disk had failed, it would just get stuck in mqueue.
> And then sendmail would put more mqueue files out there to say it had
> tried and failed to send a message. Then more to say it's been trying
> for 5 days. The five day window would allow about 32000 messages to
> hang around there (or was that an inode issue?), plus about 255
> sendmail and 255 rmail processes. When I tried to rid the mqueue of
> the files, that allowed the processes to take over all the processors,
> not very nice to telnet. I eventually got all that sorted out, added
> the dns server reference to resolv.conf, killed/restarted sendmail,
> database was able to continue, all was well with the world except for
> a hot-swappable disk. And except everything that depended on not
> having a domain specified was broken. Fortunately that quickly blew
> up a process that was configured to send me mail from another machine
> when the standby log transport failed, so I noticed it before anything
> else messed up.
>
> Moral: Check everything, even on a system that appears to be working
> and no one complains about and has monitoring software that notifies
> you when things go wrong.
>
> jg
> --
> @home.com is bogus.
> mo' money, mo' money, mo'money! http://www.internetnews.com/bus-news/article.php/3712566

Hi Joel,

I found your comment on 11i interesting and frankly surprising as I was not aware of a change here, I decided to test it, the results may surprise you.

# uname -a
HP-UX XXXXXXXX B.11.11 U 9000/800..... I checked what the current value setting was # /usr/bin/ndd -get /dev/tcp tcp_ip_abort_cinterval 75000

The IP 123.123.123.123 is blocked by a firewall so similar to any network down problems.

S=$SECONDS;telnet 123.123.123.123 1521;echo "$(($SECONDS-$S)) Seconds." ...
75 Seconds.

So we are looking at a SYN_SENT for the 75 seconds before connection abort.

I then set this value to 10 seconds.
# /usr/bin/ndd -set /dev/tcp tcp_ip_abort_cinterval 10000

S=$SECONDS;telnet 123.123.123.123 1521;echo "$(($SECONDS-$S)) Seconds." ...
10 Seconds.

Obviously I agree completely on the platform and OS specific comment, the moral is indeed check everything.

Regards, Ian. Received on Thu Nov 29 2007 - 18:29:31 CST