PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process
Posted by Kamran Agayev A. on April 15th, 2019
Last week we had a clusterware issue on one of the critical 3 node RAC environment. In the first node, network resource is restarted by ending up killing all sessions on that node abnormally. Oracle VIP that was running on that node failed over to the third node. The first node was up and running, but didn’t accept connections because it was trying to register the instance using LOCAL_LISTENER parameter where the oranode1-vip was specified that was not running on that node. We tried to relocate it back to the first node, but it failed because it couldn’t stop it. Everytime we tried to stop or relocate it, the cleaning process started and failed in a few minutes.
Neither support, nor us didn’t find any readable information in the clusterware log files. Despite the fact that there were 2 instance up and running, as load was so high, they were barely handle all connections. The ping succeeded to the oranode1-vip, but it wasn’t able to stop it even with force mode. We couldn’t able to start it as well, because it didn’t stop successfully and wasn’t able to clean up successfully. The status was “enabled” and “not running”, but ping was ok
db-bash-$ srvctl status vip -i oranode1-vip VIP oranode1-vip is enabled VIP oranode1-vip is not running db-bash-$
From crsctl stat res command we could see that it’s OFFLINE and failed over to the node3
db-bash-$ crsctl stat res -t
oranode1-vip 1 OFFLINE UNKNOWN node03
And it failed when we tried to start it:
db-bash-$ srvctl start vip -i oranode1-vip PRCR-1079 : Failed to start resource oranode1-vip CRS-2680: Clean of 'oranode1-vip ' on 'node03' failed CRS-5804: Communication error with agent process
We cleared socket files of the first node from /var/tmp/.oracle folder, restart the CRS and checked if it failed back, but it didn’t. Support asked us to stop the second node, clear the socket files and start it to see if something changed, but we didn’t do it, because the single node wouldn’t be able to handle all connections.
At the end, we checked the interface of virtual up on OS level, and found it on node03
db-bash-$ netstat -win
lan900:805 1500 #### #### 2481604 0 51 0 0
Instead of restarting the CRS of production database (which takes 10 minutes), we decided to bring that interface down using on OS level. For HP-UX, it’s ifconfig … down command
Before running this command on production environment, we tried it on the test environment and realized that the down parameter is not enough. We have to provide 0.0.0.0 ip address with along the down parameter to bring down that interface. So we run the following command to bring it down:
ifconfig lan900:805 0.0.0.0 down
And it disappeared from the list. Next, we started the vip using srvctl start vip command and it succeeded!
Lessons learned:
- Perform all actions on the test environment (if you are not sure what can happen) before trying it on production environment
- Don’t try to “restart” or “reboot” the instance, cluster or the node. Sometimes it just doesn’t solve your problem. Even after restart, the system can’t startup correctly (because of changed parameters, configurations and etc.)
- In 24 hours, severity #1 SR was assigned to 6 different engineers. It takes a lot of time to gather log files, submit them and have it reviewed by Oracle engineer until his/her shift is changed. Sometimes you just don’t have time to get answer from Oracle, you have to do it by your own and take all risks. It requires an experience.