Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Mailing Lists -> Oracle-L -> RE: Database Monitoring tool
What I found happening with monitoring tools that use agents is that these
do not connect to the database in the same way that the applications
connect; and there were times when the agent thought everything was A-OK
when they weren't. So I wrote a ksh script which has grown to 1402 lines.
Without handing out a copy of
the entire script, I think I can give you a good idea of the approach I
took. You will probably see things in the following script examples that
are probably not necessary, but the result of extreme paranoia. My
philosophy is: there is no such thing as too much paranoia. I included some
script examples on what I thought might be some of the more obscure, less
common stuff you see in a ksh script.
The high-level logic is:
Taking these in order:
It always good to write programs (contrary to the beliefs of some, scripts ARE programs) so that you avoid hard-coding configuration info. That way the script is as generic as possible, and all specific info is in config files. Unless you want to drive people crazy, you want to store the status of the databases at the last test and page only when statuses change rather keep sending out page after page for the same problem. The script also examines the config files themselves and reports any formatting errors therein.
When the script starts, it attempts to create a dummy soft link to a non-existent file. I believe this is an atomic operation. If this succeeds, we are good to go. If it fails, then it is almost certainly because the soft link already exists. So I use the ps command to parse running processes to see if another copy of this script is running. This looks like:
CREATE_LOCK_FILE () {
## The creation of the lock file is not necessary to check if another
copy of the script
## is running. But most of the time another copy will NOT be running, so
I think this
## will save the overhead of running ps and parsing its output. If the
lock file is not
## successfully created, then we run ps and parse its output to confirm
if another copy
## of the script is running.
## It also serves as an easy manual check that the script is running.
ln -s DO_BE_DO_DB_DOO "${ADMIN_DIR}/lock_file" 2> /dev/null if [ $? -ne 0 ]; then
MY_NAME=`basename "$0"` MYPID=$$ MYPPID=`ps -eo pid -o ppid -o args | sed 's/^ *//g;s/ */ /g' | awk '$1 == PID {print $2}' PID=$MYPID` { ps -eo pid -o args | sed 's/^ *//g;s/ */ /g' | awk ' $1 == PID {next} $1 == PPID {next} $2 == "vi" {next} $0 ~ "awk" {next} $0 ~ XXX {print "ALREADY RUNNING"} ' PID=$MYPID PPID=$MYPPID XXX=$MY_NAME } | while read LINE; do if [ "$LINE" = "ALREADY RUNNING" ]; then echo "`date '+%D %H:%M'`: Exiting. Already running." return 1 fi done
return 0
}
Note: This is in a ksh function, hence the "return 1" rather than an exit. (Good, modular programming and all that, you know.)
Your script should have a trap function to remove the lock_file and any other needed cleanup if it gets a trappable signal. With the number of databases I need to check, and the fact that the free space check uses a significant amount of time, the script will usually run for about 2 minutes (give or take). But if the network is super-slow, or the target computers under extreme load, it is possible that the script will take longer. The script runs every 12 minutes (that's 5 times per hour) as a cron job.
In order to provide redundancy, the script is written so that when it gets its config info, it learns if it is on the primary box, or the backup box. In both cases, it tests the other box to see if it should continue with the database tests, or let the other box do it (since I am the secondary and the primary is up). It sends a notification if it sees a problem with the other box.
To test the database, the script connects to the database as system via sqlplus and selects the name of the database from v$database to see if it gets what it expects to get. The monitoring script runs on a box that is on a different subnet than any of the databases; so, in effect, I am implicitly monitoring the network (sort of) too. This also (sort of) monitors the listener setup on the remote box and the accuracy of my tnsnames.ora file. The philosophy here is one of trying to kill as many birds with one stone as I can. If I get the name of the database OK, then I run an anonymous block that looks at the free space to see if there are any tables with a NEXT extent greater than any free extent left in the tablespace. That is all that is monitored. You can get carried away with this and make the script so fat that it becomes unwieldy and difficult to manage. The test is as follows:
## Try to connect to database
## The sqlplus connection test is run in the background in case it hangs.
## Because it is running in the background, edits to the sid_status file
## by the background process cannot be done since we might have multiple
## background processes trying to edit the file at the same time.
## So we must APPEND to an intermediate test_results file.
## Then do the sid_status edits later.
(
## Start by assuming we have unknown error (status=2), ## then prove otherwise. STATUS=2 { sqlplus -s <<-XXX ${LOCAL_DB_USER}/${LOCAL_DB_PASS}@${LOCAL_SID} whenever sqlerror exit failure whenever oserror exit failure set serveroutput on set heading off set feedback off select name from v\$database; declare cursor c1 is select tablespace_name from dba_tablespaces; v_maxspace number := 0; v_maxnext number := 0; v_maxnext_t number := 0; v_maxnext_i number := 0; begin dbms_output.enable(200000); for tspace in c1 loop select max(bytes) into v_maxspace from dba_free_space where tablespace_name = tspace.tablespace_name; select nvl(max(next_extent),0) into v_maxnext_t from dba_tables where tablespace_name = tspace.tablespace_name; select nvl(max(next_extent),0) into v_maxnext_i from dba_indexes where tablespace_name = tspace.tablespace_name; if v_maxnext_t > v_maxnext_i then v_maxnext := v_maxnext_t; else v_maxnext := v_maxnext_i; end if; if v_maxnext > v_maxspace then dbms_output.put_line('BLOWOUT IMMINENT'); end if; end loop; end; / exit; XXX } | sed '/^$/d; s/^[ ]*//g' | while read LINE; do parse and analyze stuff ..... done
If you have replication set up, you should probably monitor the status of that; but I use a separate, more basic script for that and run it much less frequently.
Since, for a number of reasons, sqlplus can hang (usually, database is hung or waaaaaaay slow, or on Linux, if network is broke or remote box is down), I take the entire chunk of the script that tests the database and run it in the background. These are kicked off one after the other, so all databases are tested simultaneously. Since the driving script can't get the output from all these, they write to an output file. Then, when all the background tests have finished, the script takes the output file and uses it to update the file that records the status of the database at the last test and the status at this test. Since, when I wrote the script, I did not know all the possible statuses I might get, these statuses are recorded as numbers which represent an index of a ksh array. As I discovered new things that could be wrong, I just added to the array. Currently, the array looks like:
STATUS_LIST[0]='OK' STATUS_LIST[1]='TNSPING FAILED' STATUS_LIST[2]='UNSPECIFIED ERROR' STATUS_LIST[3]='SQLPLUS HUNG' STATUS_LIST[4]='NEXT EXTENT GREATER THAN AVAILABLE' STATUS_LIST[5]='ORACLE NOT AVAILABLE' STATUS_LIST[6]='FAILED TO ALLOCATE SHARED MEMORY' STATUS_LIST[7]='STARTUP OR SHUTDOWN IN PROGRESS' STATUS_LIST[8]='LISTENER COULD NOT RESOLVE SID' STATUS_LIST[9]='TNS COULD NOT RESOLVE NAME' STATUS_LIST[10]='NO LISTENER' STATUS_LIST[11]='PROTOCOL ADAPTER ERROR' STATUS_LIST[12]='MAX PROCESSES EXCEEDED' STATUS_LIST[13]='INVALID USERNAME/PASSWORD' STATUS_LIST[14]='NO APPROPRIATE SERVICE HANDLER' STATUS_LIST[15]='FAILED TO START DEDICATED SERVER' STATUS_LIST[16]='IO ERROR READING FILE' STATUS_LIST[17]='FILE CANNOT BE READ' STATUS_LIST[18]='ERROR WRITING TO AUDIT TRAIL' STATUS_LIST[19]='SNAPSHOT TOO OLD' STATUS_LIST[20]='SHUTDOWN IN PROGRESS' STATUS_LIST[21]='IMMEDIATE SHUTDOWN IN PROGRESS' STATUS_LIST[22]='END OF FILE ON COMMUNICATION' STATUS_LIST[23]='ARCHIVER ERROR. CONNECT INTERNAL ONLY' STATUS_LIST[24]='ORA-00600 INTERNAL ERROR' STATUS_LIST[96]='BLANK STATUS' STATUS_LIST[97]='INVALID SQLPLUS OPTION' STATUS_LIST[98]='JUST TESTING' STATUS_LIST[99]='CONFIG FILE ERROR'
The matching part of the script that conducts the tests has an awk script like:
STATUS=`echo "$LINE" | awk ' substr($0,1,3) == substr(LOCAL_SID,1,3) {print "0"; exit} $0 == "BLOWOUT IMMINENT" {print "4"; exit} /^ORA-01034/ {print "5"; exit} /^ORA-04031/ {print "6"; exit} /^ORA-01033/ {print "7"; exit} /^ORA-12505/ {print "8"; exit} /^ORA-12514/ {print "9"; exit} /^ORA-12541/ {print "10"; exit} /^ORA-12560/ {print "11"; exit} /^ORA-00020/ {print "12"; exit} /^ORA-01017/ {print "13"; exit} /^ORA-12519/ {print "14"; exit} /^ORA-12500/ {print "15"; exit} /^ORA-01115/ {print "16"; exit} /^ORA-00376/ {print "17"; exit} /^ORA-02002/ {print "18"; exit} /^ORA-01555/ {print "19"; exit} /^ORA-01090/ {print "20"; exit} /^ORA-01089/ {print "21"; exit} /^ORA-03113/ {print "22"; exit} /^ORA-00257/ {print "23"; exit} /^ORA-00600/ {print "24"; exit} /[Ee][Rr][Rr][Oo][Rr]/ {print "2"; exit} /^SP2-0306/ {print "97"; exit} ## If no matches above then keep current value of status. {print STATUS} ' LOCAL_SID="$LOCAL_SID" STATUS="$STATUS"`
Note: the awk script is part of that "parse and analyze" section of the while loop above.
If you're wondering about the "BLOWOUT IMMINENT", that's what the anonymous block returns that tests for free space.
After all these background processes are kicked off, the script runs the "jobs" command every 30 seconds, and keeps doing so until all the jobs have completed; or the maximum allotted time for the jobs has been reached; at which point, the jobs are killed and the status for those databases is logged as SQLPLUS HUNG. Here's what that looks like:
CHECK_JOBS_FOR_TIMEOUT () {
MAX_INDEX=$1
CUR_INDEX=1
## Give sqlplus sessions plenty of time to do their thing. ## The background SQL stuff might take a while to complete. ## Adjust how long the script waits by changing the value in the whilecondition
COUNT=1
while [ $COUNT -le 8 ]; do
## The first jobs command is to clear out any "jobs completed" messages.
jobs > /dev/null if [ -z "`jobs`" ]; then break; fi sleep 30 COUNT=$(( $COUNT + 1 ))
## Kill any remaining jobs.
for JOB_NUMBER in `jobs | sed 's/\([^0-9]*\)\([1-9][0-9]*\)\(.*\)/\2/'`;
do
kill %${JOB_NUMBER} echo "killed job number $JOB_NUMBER" >> "$LOG_FILE"done
## Then make note of any sqlplus sessions that did not appear to finish
## by checking to see if there is an entry in the test_results file.
## Note: Because some jobs can (and usually do) finish before other jobs
are
## started, the job numbers get recycled. So you can't match the index
in
## the JOB_SID array to its corresponding job number from the jobs
command.
while [ $CUR_INDEX -le $MAX_INDEX ]; do
## Read the test_results file to see if test for SID completed. SID_WAS_FOUND=`egrep -w \^${JOB_SID[$CUR_INDEX]} "$TEST_RESULTS"` ## If test got hung up and didn't complete, then record status 3. if [ -z "$SID_WAS_FOUND" ]; then ## Use status=3 for sqlplus hung. echo "${JOB_SID[$CUR_INDEX]} 3" >> "$TEST_RESULTS" echo "NOT DONE: Sqlplus did not complete on ${JOB_SID[$JOB_NUMBER]}" >> "$LOG_FILE" fi CUR_INDEX=$(( $CUR_INDEX + 1 ))
return 0
}
The majority of the script are checks looking for Murphy's Law (Paranoia is good!) rather than actually testing the databases. I'll include the "main" section of the script so you get a good idea of what is being done.
ONE VERY IMPORTANT NOTE: The box that runs this is an old 166 Mhz Pentium running Red Hat 7.1. The ksh that comes with Linux is crap. However, the fine folk in charge of the genuine ksh93 have made it available for download. Also, if you (as I did) write your script on one of the non-open source Unixes (i.e. Solaris, AIX, etc) DO NOT use good programming practice and use absolute paths for utility names. The egotistical propeller-heads in charge of Linux have decided, in their Great Wisdom, that the Unixes of the past didn't do things the way they should have; so now the mantle for fixing all the world's problems is resting on the shoulders of all-knowing, all-seeing, gurus of Linux. And you lowly pee-ons, who are just trying to get work done, should be a grateful recipient of their wisdom. If you are used to using nawk, stop. The unlimited intelligence of the Linux Propeller-heads has divined that you should no longer be using it ... sorry if it breaks your stuff ... but that's the price of "progress". (Hey guys. When, you take time off from sun-burning your tonsils long enough to fix ksh, get back to me. Oh, but dear me! I shouldn't be writing ksh scripts!)
######################### BEGIN MAIN ########################SET_VARIABLES
CREATE_LOCK_FILE
if [ $? -ne 0 ]; then exit $?; fi
CHECK_OTHER_BOX
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi
MAINTAIN_ATTIC
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi
MAINTAIN_OUTPUT_FILES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi
CHECK_CONFIG_FILES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi
SET_MORE_VARIABLES
if [ $? -ne 0 ]; then
rm "${ADMIN_DIR}/lock_file"
exit $?
fi
#------------- begin loop to build SID_LIST
SID_LIST=''
## The following line creates environment variables DEFINED in config file
(db_times)
## Note: the readability of the TIME_FILE was tested in CHECK_CONFIG_FILES
eval `sed -n '/^DEFINE/s/\(DEFINE *\)\(.*\)/export \2/p' "$TIME_FILE"`
## BUILD LIST OF SIDS TO TEST
## The following sed statement is multiple lines.
sed -n '
s/=/ /g
/^SID/s/\(SID *\)\(.*\)/\2/p
' "$TIME_FILE" | while read SID TIME_LIST; do
## If there is no entry for the SID in the sid_status file,
## create a default entry.
ADD_TO_STATUS_FILE "$SID"
if [ $? -ne 0 ]; then continue; fi
## Check that the monitoring time entries in the db_times file are legal
VALIDATE_TIME_LIST $SID $TIME_LIST
if [ $? -ne 0 ]; then continue; fi
## See if we are monitoring this SID at this time.
GET_TIME_TO_MONITOR
if [ $? -ne 0 ]; then continue; fi
## Skip any SIDS that have a config file error.
## Note that this is the only error that goes directly to the sid_status
file.
## This error is generated by the previous function calls.
## All other test output after this goes to the test_results first prior
## to being put into the sid_status file.
CONF_STAT=`awk '$1 == SID {print $3}' SID="$SID" "$SID_STATUS"`
if [ "$CONF_STAT" = "99" ]; then continue; fi
## If we are monitoring this SID, then add it to the list.
SID_LIST="${SID_LIST} ${SID}"
done
#------------- end loop to build SID_LIST
#------------- begin loop to test SIDs
JOB_INDEX=1
for SID in $SID_LIST; do
## Build an array of SIDs that we are testing.
JOB_SID[$JOB_INDEX]="$SID"
JOB_INDEX=$(( $JOB_INDEX + 1 ))
## This function will set the variable DB_PASS for the user SYSTEM.
GET_SYSTEM_PASSWORD "$SID"
if [ $? -ne 0 ]; then continue; fi
## This function will set the variable CONNECT which is the SQLNET
connect string.
GET_TNS_ENTRY "$SID"
if [ $? -ne 0 ]; then continue; fi
## See if we can sqlplus to the SID and do a simple select.
TEST_THE_SID "$DB_USER" "$DB_PASS" "$SID"
if [ $? -ne 0 ]; then continue; fi
## Give the OS a little time between starting SQL Plus process.
## Maybe this will prevent bogus SQL Plus error messages (SP2-0306)
sleep 2
done
#------------- end loop to test SIDs
## See if there are any sqlplus sessions that did not complete by checking
## if there are test results for the session in the test_results file.
MAX_INDEX=$(( $JOB_INDEX - 1 ))
CHECK_JOBS_FOR_TIMEOUT $MAX_INDEX
PUT_TEST_RESULTS_INTO_STATUS_FILE
CREATE_HTML_FILE
PAGE_AND_EMAIL_PEOPLE
if [ $TIME -ge $MAINT_TIME -a $TIME -lt $(( $MAINT_TIME + $CRON_INTERVAL ))
]; then
SEND_DAILY_SUMMARY
fi
## If there is a secondary, fail-over box, update it with current info. if [ "`hostname`" = "$PRIMARY_BOX" -a -n "$SECONDARY_BOX" ]; then
RCP_SID_STATUS
fi
rm "${ADMIN_DIR}/lock_file"
exit 0
######################### END MAIN ########################
-- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Stephen Lee INET: slee_at_dollar.com Fat City Network Services -- 858-538-5051 http://www.fatcity.com San Diego, California -- Mailing list and web hosting services --------------------------------------------------------------------- To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).Received on Fri Nov 01 2002 - 11:15:23 CST
![]() |
![]() |