Kamran Agayev's Oracle Blog

Oracle Certified Master

Archive for April, 2019

Second OCM exam is cleared. New book and online course are on the way

Posted by Kamran Agayev A. on 23rd April 2019

2 months ago after a long preparation I decided to upgrade my OCM certification and registered for the exam in Shanghai. Few years ago when I cleared 10g OCM exam I started my preparations for the upgrade right away. I did a lot of research and practical hands-ons and then thought it would be great if I can collect everything what I have in a single book. It took almost 2 years for me to publish the book. Few months after the book was published, I started getting emails from the readers on how the book helped them during their preparations and was happy to see them passing the exam! Having a lot of different projects during those days, I didn’t manage to take the exam. And unfortunately 11g OCM 1 day exam was retired. It means that I was supposed to take another 2 days exam again! But it was ok. If this is the only option, then I have nothing to do.

I will not talk about how my travel was hard, but eventually the exam day has arrived. It was 9 sections (2 days) with lot of different practical tasks. I wouldn’t also like to go in more details regarding the questions and so on, but what I realized was that the book that I’ve published even before taking the  OCM 11g exam was covering almost everything that I had during the exam 😊 Reviewing topics directly from my book helped me to be confident during the exam.

Few weeks passed, and I got a happy email from Oracle – that I’ve passed the exam and became 2xOCM. Now it’s time for the third and last one )) And it means that I’ve already started my preparation with along the new book which will be published in a few months.

For those of you guys who want to clear the OCM 11g exam, believe it or not, my book covers almost all the topics. And after clearing the second OCM exam, I decided to start an online course and help you on your preparation individually. So keep tuned and I will announce the course information shortly 😊

OCM 11g Certificate

Posted in Uncategorized | 2 Comments »

PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

Posted by Kamran Agayev A. on 15th April 2019

Last week we had a clusterware issue on one of the critical 3 node RAC environment. In the first node, network resource is restarted by ending up killing all sessions on that node abnormally. Oracle VIP that was running on that node failed over to the third node. The first node was up and running, but didn’t accept connections because it was trying to register the instance using LOCAL_LISTENER parameter where the oranode1-vip was specified that was not running on that node. We tried to relocate it back to the first node, but it failed because it couldn’t stop it. Everytime we tried to stop or relocate it, the cleaning process started and failed in a few minutes.

Neither support, nor us didn’t find any readable information in the clusterware log files. Despite the fact that there were 2 instance up and running, as load was so high, they were barely handle all connections. The ping succeeded to the oranode1-vip, but it wasn’t able to stop it even with force mode. We couldn’t able to start it as well, because it didn’t stop successfully and wasn’t able to clean up successfully. The status was “enabled” and “not running”, but ping was ok

db-bash-$ srvctl status vip -i oranode1-vip
VIP oranode1-vip is enabled 
VIP oranode1-vip is not running 
db-bash-$

From crsctl stat res command we could see that it’s OFFLINE and failed over to the node3

 

db-bash-$ crsctl stat res -t
oranode1-vip  1 OFFLINE UNKNOWN node03

 

And it failed when we tried to start it:

db-bash-$ srvctl start vip -i oranode1-vip   
PRCR-1079 : Failed to start resource oranode1-vip  
CRS-2680: Clean of 'oranode1-vip  ' on 'node03' failed 
CRS-5804: Communication error with agent process

 

We cleared socket files of the first node from /var/tmp/.oracle folder, restart the CRS and checked if it failed back, but it didn’t. Support asked us to stop the second node, clear the socket files and start it to see if something changed, but we didn’t do it, because the single node wouldn’t be able to handle all connections.

At the end, we checked the interface of virtual up on OS level, and found it on node03

db-bash-$ netstat -win
lan900:805 1500 #### #### 2481604 0 51 0 0

 

Instead of restarting the CRS of production database (which takes 10 minutes), we decided to bring that interface down using on OS level. For HP-UX, it’s ifconfig … down command

Before running this command on production environment, we tried it on the test environment and realized that the down parameter is not enough. We have to provide 0.0.0.0 ip address with along the down parameter to bring down that interface. So we run the following command to bring it down:

ifconfig lan900:805 0.0.0.0 down

And it disappeared from the list. Next, we started the vip using srvctl start vip command and it succeeded!

Lessons learned:

  • Perform all actions on the test environment (if you are not sure what can happen) before trying it on production environment
  • Don’t try to “restart” or “reboot” the instance, cluster or the node. Sometimes it just doesn’t solve your problem. Even after restart, the system can’t startup correctly (because of changed parameters, configurations and etc.)
  • In 24 hours, severity #1 SR was assigned to 6 different engineers. It takes a lot of time to gather log files, submit them and have it reviewed by Oracle engineer until his/her shift is changed. Sometimes you just don’t have time to get answer from Oracle, you have to do it by your own and take all risks. It requires an experience.

Posted in RAC issues | No Comments »