Today I’ve got a call from my friend which claimed the performance degredation on one of the production databases. When connecting to SQL*Plus or RMAN, I realized a delay, so run “top” command and checked the running processes on the system. When running ps – ef command, I saw hundreds of perl executables that are currently running on the system:
[sourcecode]oracle 15560 1 3 Jan11 ? 05:50:07 /opt/oracle/product/10.2/db_1/perl/bin/perl /opt/oracle/product/10.2/db_1/sysman/admin/scripts/db/dbresp.pl
oracle 16309 1 3 Jan11 ? 05:44:53 /opt/oracle/product/10.2/db_1/perl/bin/perl /opt/oracle/product/10.2/db_1/sysman/admin/scripts/db/dbresp.pl
…..
…..[/sourcecode]
As the dbresp.pl file locates under sysman folder, I’ve decided that it has some relation with EM, so I checked the EM trace file:
[sourcecode]tail -50 emagent.trc | more
2011-01-11 08:51:37 Thread-4096777120 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
2011-01-11 08:51:37 Thread-4096777120 ERROR command: failed to kill process 24963 running perl: (errno=3: No such process)
2011-01-11 08:51:37 Thread-4096777120 ERROR engine: [oracle_database,prod_db,Response] : nmeegd_GetMetricData failed : Metric execution timed out in 600 seconds
2011-01-11 09:06:37 Thread-4113513376 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
2011-01-11 09:06:37 Thread-4113513376 ERROR command: failed to kill process 25393 running perl: (errno=3: No such process)
2011-01-11 09:06:37 Thread-4113513376 ERROR engine: [oracle_database,prod_db,Response] : nmeegd_GetMetricData failed : Metric execution timed out in 600 seconds
2011-01-11 09:21:37 Thread-4096777120 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
2011-01-11 09:21:37 Thread-4096777120 ERROR command: failed to kill process 26068 running perl: (errno=3: No such process)
2011-01-11 09:21:37 Thread-4096777120 ERROR engine: [oracle_database,prod_db,Response] : nmeegd_GetMetricData failed : Metric execution timed out in 600 seconds
2011-01-11 09:36:37 Thread-4099926944 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds[/sourcecode]
Wouu… Interesting output. I’ve decided to check metalink and found the following note: Server Has 100% Of Cpu Because Of Dbresp.pl [ID 764140.1]
Unfortunately as a solution the note adviced me to refer to the metalink note: “ Ext/Mod Problem Performance Agent High CPU Consumption Gen” where it’s written to change the alert.log file name to solve the issue. It wasn’t a real solution, so I’ve decided to take down the EM and kill all processes
[sourcecode]emctl stop dbconsole[/sourcecode]
Then I called the following command and got the list of all dbresp.pl processes and got the script which kills them all
[sourcecode]ps -ef | grep dbresp.pl | awk {‘print "kill -9 " $2’} > kill.sh
more kill.sh
kill -9 23989
kill -9 24569
kill -9 25145
kill -9 25723
…..
…..[/sourcecode]
Next, I made it executable and run :
[sourcecode]oracle@host</a>:~> chmod 755 kill.sh
oracle@host:~> ./kill.sh
oracle@host:~>
oracle@host:~> ps -ef | grep dbresp
oracle 32454 29520 0 10:48 pts/0 00:00:00 grep dbresp [/sourcecode]
After killing all unnecessary processes, CPU usage went down.
To deal with this bug, you can check the count of dbresp.pl files, take down the EM, kill all processes and start it again using any cron job
If you have another solution, please let me know