Kamran Agayev's Oracle Blog

Oracle Certified Master

Archive for the 'RAC issues' Category

Real Application Clusters

Investigation on why database doesn’t start after successfully dropping a diskgroup

Posted by Kamran Agayev A. on 24th December 2020

Few months ago, while performing storage migration I faced an interesting issue which could lead to potential downtime if I didn’t notice a hidden warning in the log file.

The plan was to create a new ASM diskgroup in a normal redundancy with 2 disks from different storages and test the disk crash and confirm that there will be no data loss if one of the storages fail. After creating a diskgroup, creating a test tablespaces on it and corrupting the header of one disks, everything was ok and we decided to drop the diskgroup and start adding new disks as a failgroup to other diskgroups.

 

Below I created a scenario in my test environment which describes the same problem.

  • First of all, I get location of controlfiles and datafiles (of course redo log files as well) to make sure which diskgroups contain physical files:

 

SQL> show parameter control
NAME                                                        TYPE VALUE
------------------------------------ ----------- ------------------------------
control_files                                            string               +CFILE2/TESTDB/CONTROLFILE/current.256.1046097231

SQL> select name from v$datafile;
NAME
--------------------------------------------------------------------------------
+DATA/TESTDB/DATAFILE/system.278.1046088963
+DATA/TESTDB/DATAFILE/sysaux.277.1046088795
+DATA/TESTDB/DATAFILE/undotbs1.280.1046089213
+DATA/TESTDB/DATAFILE/undotbs2.288.1046089391
+DATA/TESTDB/DATAFILE/users.279.1046089209
SQL>

 

As you see, we have 2 diskgroups involved: +CFILE2 and +DATA. Next, I run srvctl config database command and grep list of Diskgroups which are used by this database. We see the same output – +CFILE2 and +DATA

 

-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2
-bash-4.1$

 

  • Next, I query V$ASM_DISKGROUP view to get list of all diskgroups that are available in ASM:
SQL> col name format a40
SQL> set linesize 150

 

SQL> select group_number, name, state, type, total_mb, free_mb from v$asm_diskgroup;
GROUP_NUMBER NAME                                                                    STATE               TYPE       TOTAL_MB    FREE_MB
------------ ---------------------------------------- ----------- ------ ---------- ----------
                   4 TESTDG                                                               MOUNTED     EXTERN       1019                923
                   3 DATA                                                                    CONNECTED   EXTERN      15342             8239
                   1 CFILE2                                                                  CONNECTED   EXTERN       1019                892
SQL>

 

  • We have three diskroups – +CFILE2, +DATA and +TESTDG. Next, I will create a new tablespace in the diskgroup +TESTDG to have it become a part of the database configuration:

 

SQL> create tablespace mytbs datafile '+TESTDG' size 10m;
Tablespace created.
SQL>

 

  • Once I create a tablespace in the new diskgroup, it will be part of the database configuration and dependency is established between the database and the diskgroup which can be seen from the output of the alert.log file of the database:

 

Alert.log file
Wed Jul 29 09:02:47 2020
create tablespace mytbs datafile '+TESTDG' size 10m
Wed Jul 29 09:02:48 2020
NOTE: ASMB mounting group 4 (TESTDG)
NOTE: Assigning number (4,0) to disk (/dev/asm-disk5)
SUCCESS: mounted group 4 (TESTDG)
NOTE: grp 4 disk 0: TESTDG_0000 path:/dev/asm-disk5
Wed Jul 29 09:02:50 2020
NOTE: dependency between database testdb and diskgroup resource ora.TESTDG.dg is established
Completed: create tablespace mytbs datafile '+TESTDG' size 10m

 

 

  • Output of the ASM alert.log file:

 

Wed Jul 29 09:02:48 2020
NOTE: client testdb1:testdb:rac-scan mounted group 4 (TESTDG)
Wed Jul 29 09:02:49 2020
NOTE: Advanced to new COD format for group TESTDG

 

  • From the output of the crsd.trc file it can be seen that there’s a hard dependency between diskgroup and the database:

 

2020-07-29 09:02:50.015412 :UiServer:204928768: {1:32997:407} Container [ Name: UI_REGISTER
                API_HDR_VER:
                TextMessage[3]
                API_REGUPDATE_TAG:
                TextMessage[1]
                ASYNC_TAG:
                TextMessage[1]
                ATTR_LIST:
TextMessage[MANAGEMENT_POLICY=AUTOMATICSTART_DEPENDENCIES=+hard(ora.TESTDG.dg)+pullup(ora.TESTDG.dg)STOP_DEPENDENCIES=+hard(shutdown:ora.TESTDG.dg)]
                CLIENT:
                TextMessage[]
                CLIENT_NAME:
                TextMessage[Unknown process]
                CLIENT_PID:
                TextMessage[8543]
                CLIENT_PRIMARY_GROUP:
                TextMessage[oinstall]
                LOCALE:
                TextMessage[AMERICAN_AMERICA.AL32UTF8]
                NO_WAIT_TAG:
                TextMessage[1]
                QUEUE_TAG:
                TextMessage[1]
                RESOURCE:
                TextMessage[ora.testdb.db]
                UPDATE_TAG:
                TextMessage[1]
]

 

– Now to see the new list of diskgroups which are part of the database configuration, we run the following command:

-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2,TESTDG
-bash-4.1$

As you see, diskgroup +TESTDG is also part of the database configuration. Next, to imitate a storage failure, or disk crash, I corrupt the disk of the diskgroup +TESTDG using dd command as follows:

 

-bash-4.1$ dd if=/dev/zero of=/dev/asm-disk5 bs=1024 count=10000
10000+0 records in
10000+0 records out
10240000 bytes (10 MB) copied, 0.125557 s, 81.6 MB/s
-bash-4.1$

– And check the alert.log file. Once it’s detected that the disk of the diskgroup with external redundancy is corrupted, database instance will crash:

 

Wed Jul 29 09:19:45 2020
USER (ospid: 27939): terminating the instance
Wed Jul 29 09:19:47 2020
Instance terminated by USER, pid = 27939

 

  • And from the alert.log file of an ASM instance, it can be seen that the disk is offlined:

 

Wed Jul 29 09:19:49 2020
NOTE: SMON did instance recovery for group DATA domain 3
NOTE: SMON detected lock domain 4 invalid at system inc 6 07/29/20 09:19:49
NOTE: SMON starting instance recovery of group TESTDG domain 4 inc 6 (mounted) at 07/29/20 09:19:49
NOTE: SMON will attempt offline of disk 0 - no header
NOTE: cache initiating offline of disk 0 group TESTDG
NOTE: process _smon_+asm1 (5245) initiating offline of disk 0.3916011317 (TESTDG_0000) with mask 0x7e in group 4 (TESTDG) with client assisting
NOTE: initiating PST update: grp 4 (TESTDG), dsk = 0/0xe9699735, mask = 0x6a, op = clear
Wed Jul 29 09:19:49 2020
GMON updating disk modes for group 4 at 14 for pid 18, osid 5245
ERROR: disk 0(TESTDG_0000) in group 4(TESTDG) cannot be offlined because the disk group has external redundancy.
Wed Jul 29 09:19:49 2020
ERROR: too many offline disks in PST (grp 4)

 

  • Now, we try to start the database

 

-bash-4.1$ srvctl start database -d testdb
PRCR-1079 : Failed to start resource ora.testdb.db
CRS-5017: The resource action "ora.testdb.db start" encountered the following error:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node1/crs/trace/crsd_oraagent_oracle.trc".

 

It will fail. Because it can’t access the datafile which is in the failed diskgroup. Here’s the output of the trace file:

 

CRS-2674: Start of 'ora.testdb.db' on 'node1' failed
CRS-2632: There are no more servers to try to place resource 'ora.testdb.db' on that would satisfy its placement policy
CRS-5017: The resource action "ora.testdb.db start" encountered the following error:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node2/crs/trace/crsd_oraagent_oracle.trc".
CRS-2674: Start of 'ora.testdb.db' on 'node2' failed

 

  • Output of alert.log file:

 

Wed Jul 29 09:22:15 2020
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_ora_28674.trc:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
ORA-1157 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:32997:676} */...
Wed Jul 29 09:22:17 2020
License high water mark = 1
Wed Jul 29 09:22:17 2020
USER (ospid: 28854): terminating the instance
Wed Jul 29 09:22:18 2020
Instance terminated by USER, pid = 28854

 

  • Next, we offline the datafile and restart the database:

 

SQL> alter database datafile 2 offline;
Database altered.
SQL>

 

-bash-4.1$ srvctl stop database -d testdb -stopoption abort
-bash-4.1$ srvctl start database -d testdb

 

Database is UP! Great! But …..  We solved the physical file dependency problem which was preventing database to start. But we still have the failed diskgroup in the configuration of the database resource:

 

-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2,TESTDG
-bash-4.1$

 

It means that once we restart the clusterware stack, the database resource will NOT start, because it has hard dependency with the diskgroup which is part of its configuration, which is FAILED …

Let’s restart the crs and check the status of the database:

 

-bash-4.1# crsctl stop crs
-bash-4.1# crsctl start crs

 

  • From the output of the ASM alert.log file, it can be seen that ASM tried to mount the diskgroup and failed:

 

Wed Jul 29 09:41:09 2020
ERROR: ALTER DISKGROUP TESTDG MOUNT  /* asm agent *//* {1:42096:2} */
Wed Jul 29 09:41:09 2020

WARNING: Disk Group DATA containing voting files is not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "TESTDG" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15013: diskgroup "DATA" is already mounted

 

  • CRS is up
[root@node1 oracle]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@node1 oracle]#

 

  • As we restarted crs in the first node, the instance is not running in the first node, and still up in the second node which will be down upon the next crs or node restart.

 

[root@node1 oracle]# srvctl status database -d testdb
Instance testdb1 is not running on node node1
Instance testdb2 is running on node node2
[root@node1 oracle]#

 

  • If we try to restart the instance in the first node, we’ll fail:

 

-bash-4.1$ srvctl start instance -d testdb -i testdb1
PRCR-1013 : Failed to start resource ora.testdb.db
PRCR-1064 : Failed to start resource ora.testdb.db on node node1
CRS-2674: Start of 'ora.TESTDG.dg' on 'node1' failed
-bash-4.1$

 

A message appered in asm trace file once you try to start the instance

 

Wed Jul 29 09:44:28 2020
ERROR: ALTER DISKGROUP ALL MOUNT FOR testdb /* asm agent *//* {1:42096:192} *//* incarnation::1*/

 

It’s scary! You have a failed diskgroup which doesn’t contain ANY physical file in it, and it will stop you to start the database instance because the database resource is dependent on it. The only way is to modify the database resource configuration and remove the diskgroup as follows:

 

-bash-4.1$ srvctl modify database -d testdb -diskgroup DATA,CFILE2

  • Now if we check the crsd.log file, we can see that we have only two diskgroups : + DATA and CFILE2 with hard dependency

 

2020-07-29 09:46:09.329870 :UiServer:2822174464: {1:42096:285} Container [ Name: UI_REGISTER
                API_HDR_VER:
                TextMessage[3]
                API_REGUPDATE_TAG:
                TextMessage[1]
                ATTR_LIST:
                TextMessage[START_DEPENDENCIES=hard(ora.DATA.dg,ora.CFILE2.dg) weak(type:ora.listener.type,global:type:ora.scan_listener.type,uniform:ora.ons,global:ora.gns) pullup(ora.DATA.dg,ora.CFILE2.dg)STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DATA.dg,shutdown:ora.CFILE2.dg)]
                CLIENT:
                TextMessage[]
                CLIENT_NAME:
                TextMessage[/usr/bin/java]
                CLIENT_PID:
                TextMessage[13981]
                CLIENT_PRIMARY_GROUP:
                TextMessage[oinstall]
                LOCALE:
                TextMessage[AMERICAN_AMERICA.US7ASCII]
                QUEUE_TAG:
                TextMessage[1]
                RESOURCE:
                TextMessage[ora.testdb.db]
                UPDATE_TAG:
                TextMessage[1]
]

 

To make sure it’s successfully modified, run the following command and check the output:

 

-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2

 

– Now we should be able to start the instance:

-bash-4.1$ srvctl start instance -d testdb -i testdb1
-bash-4.1$

 

  • Output of the alert.log file
Wed Jul 29 09:47:30 2020
AQPC started with pid=55, OS id=14549
Starting background process CJQ0
Completed: ALTER DATABASE OPEN /* db agent *//* {1:42096:364} */

 

What I faced that night, was that the diskgroup was successfully dropped from ASMCA, but in the crsd.log file the hard dependency was not removed from the clusterware configuration, and I decided to not restart the crs, thinking it will not startup because of this dependency. Diskgroup was already empty containing no physical datafiles, dismounted and dropped successfully but it’s hard dependency from the database resource was not changed, probably because of a bug. Which means that after dropping the diskgroup if we tried to reboot both nodes or crs, the database wouldn’t start and would lead the downtime.

Lessons learned:

  • Make sure to check alert.log file of database and asm instance, and cluster log and trace files once you perform any change (even dropping a diskgroup in the production environment and even if it succeeded)
  • After making a cluster level change, make sure to restart the crs or even perform a node reboot to see everything is ok after the change.
  • Don’t stop the entire database. Restart the crs or db instances in rolling fashion. Make sure you have at least once instance available every time.

 

Posted in RAC issues | No Comments »

PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

Posted by Kamran Agayev A. on 15th April 2019

Last week we had a clusterware issue on one of the critical 3 node RAC environment. In the first node, network resource is restarted by ending up killing all sessions on that node abnormally. Oracle VIP that was running on that node failed over to the third node. The first node was up and running, but didn’t accept connections because it was trying to register the instance using LOCAL_LISTENER parameter where the oranode1-vip was specified that was not running on that node. We tried to relocate it back to the first node, but it failed because it couldn’t stop it. Everytime we tried to stop or relocate it, the cleaning process started and failed in a few minutes.

Neither support, nor us didn’t find any readable information in the clusterware log files. Despite the fact that there were 2 instance up and running, as load was so high, they were barely handle all connections. The ping succeeded to the oranode1-vip, but it wasn’t able to stop it even with force mode. We couldn’t able to start it as well, because it didn’t stop successfully and wasn’t able to clean up successfully. The status was “enabled” and “not running”, but ping was ok

db-bash-$ srvctl status vip -i oranode1-vip
VIP oranode1-vip is enabled 
VIP oranode1-vip is not running 
db-bash-$

From crsctl stat res command we could see that it’s OFFLINE and failed over to the node3

 

db-bash-$ crsctl stat res -t
oranode1-vip  1 OFFLINE UNKNOWN node03

 

And it failed when we tried to start it:

db-bash-$ srvctl start vip -i oranode1-vip   
PRCR-1079 : Failed to start resource oranode1-vip  
CRS-2680: Clean of 'oranode1-vip  ' on 'node03' failed 
CRS-5804: Communication error with agent process

 

We cleared socket files of the first node from /var/tmp/.oracle folder, restart the CRS and checked if it failed back, but it didn’t. Support asked us to stop the second node, clear the socket files and start it to see if something changed, but we didn’t do it, because the single node wouldn’t be able to handle all connections.

At the end, we checked the interface of virtual up on OS level, and found it on node03

db-bash-$ netstat -win
lan900:805 1500 #### #### 2481604 0 51 0 0

 

Instead of restarting the CRS of production database (which takes 10 minutes), we decided to bring that interface down using on OS level. For HP-UX, it’s ifconfig … down command

Before running this command on production environment, we tried it on the test environment and realized that the down parameter is not enough. We have to provide 0.0.0.0 ip address with along the down parameter to bring down that interface. So we run the following command to bring it down:

ifconfig lan900:805 0.0.0.0 down

And it disappeared from the list. Next, we started the vip using srvctl start vip command and it succeeded!

Lessons learned:

  • Perform all actions on the test environment (if you are not sure what can happen) before trying it on production environment
  • Don’t try to “restart” or “reboot” the instance, cluster or the node. Sometimes it just doesn’t solve your problem. Even after restart, the system can’t startup correctly (because of changed parameters, configurations and etc.)
  • In 24 hours, severity #1 SR was assigned to 6 different engineers. It takes a lot of time to gather log files, submit them and have it reviewed by Oracle engineer until his/her shift is changed. Sometimes you just don’t have time to get answer from Oracle, you have to do it by your own and take all risks. It requires an experience.

Posted in RAC issues | No Comments »

How to pass Oracle Database 12c: RAC and Grid Infrastructure Administration exam – 1Z0-068 and become Oracle Certified Expert

Posted by Kamran Agayev A. on 3rd May 2018

In this post I will talk about my journey on how to prepare and pass the 12c RAC and Grid Administration exam.

 

About the exam

Check the following link to get more information about the exam from Oracle University page:

https://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=5001&get_params=p_exam_id:1Z0-068

 

The exam consists of 3 parts:

– Oracle 12c ASM Administration
– Oracle 12c Grid Infrastructure Installation and Administration
– Oracle 12c RAC Administration

 

I don’t want to scare you, but the exam is hard enough. The bad thing is – you fail the entire exam if you fail one of the sections. This means that you have to be well prepared for all 3 parts. For me, I was good at ASM and RAC Administration, and was not comfortable with Grid Infrastructure Installation and Administration part which I passed barely.

You may be Oracle high availability expert and fail the exam. You might have an experience but can fail because of useless (or may be uncommon) features and topics that you didn’t practice, or didn’t read or read superficial. Because most of the questions were not checking your practical experience, but theoretical knowledge. I manage high available cluster databases for last 8 years, and it was really hard to answer some of the questions that I haven’t ever faced and I didn’t see the reason to try.
There were a lot of questions like “Choose four option, where blah blah blah ….” And you have to choose 4 options out of 7. You might know 3 correct answers, but because of that 1 wrong option you might fail.

Next, you have to achieve a minimum score for all 3 sections in order to pass the entire exam. You might complete 2 sections with 100% and fail from the one and end up failing the entire exam.

 

How to prepare for the exam?

You have to read the documentation and play with ASM, RAC database and Grid Infrastructure A LOT!

If you want to learn Oracle 12c Grid Infrastructure installation, check the following video tutorial:

http://www.oraclevideotutorials.com/video/installing-oracle-12cr2-grid-infrastructure

 

Check the videos section in oraclevideotutorials.com to find out some clusterware related hands-on practices:

http://www.oraclevideotutorials.com/videos

 

The only available book related with the exam (RAC part mostly) is the following book which is worth reading written by friends of mine Syed Jaffar, Kai Yu and Riyaj Shamsudden:

Expert Oracle RAC 12c
https://www.amazon.com/Expert-Oracle-RAC-Experts-Voice/dp/1430250445/

 

In my OCM preparation book, I have two chapters that can help you during the preparation:

Chapter 7 – Grid Infrastructure and ASM

Chapter 8 – Real Application Clusters.

 

To get free trial pdf copy of the book, go to www.ocmguide.com , or purchase it from the following link:

https://www.amazon.com/Oracle-Certified-Master-Exam-Guide/dp/1536800791/

 

During the exam, I felt regret skipping reading some chapters in the documentation and viewing some of them superficial. I highly recommend to check ASM, RAC and Grid Infrastructure documentation and make sure you went through the entire documentation at least once. Here are the links to the documentations:

 

Real Application Clusters Administration and Deployment Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/racad/toc.htm

 

Clusterware Administration and Deployment Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cwadd/toc.htm

 

Automatic Storage Management Administrator’s Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/ostmg/toc.htm

 

Setting deadlines and booking the exam

Most of you (including me) postpone the exam and don’t put deadlines for the preparation and for the exam itself. My advice – set an approximate date for the exam and make a plan for each month, week and day. Then set a date and book the exam! Yes, book it – as you have a chance to rebook if you don’t feel ready unless it’s 24 hours before the exam. Registering for the exam weeks before the exam date will push you to make your preparation completed on time.

 

I booked the exam for Tuesday, rebooked it to Wednesday, then to Thursday, and then to Friday :). On Wednesday I decided to reschedule it to the next Monday and in the evening I was shocked when I saw that I didn’t actually rescheduled it on Friday. It will happen tomorrow! (on Thursday) Just in a few hours! :)

 

I didn’t feel that I’m ready and still having few incomplete sections where I was feeling weak, even was about to cancel the exam and don’t attend, but then decided to push hard and try. And if I lose, I decided to lose like a champ :)

 

So I stayed awake till 3am, took a nap till 6am and made last preparations till 9am. Attended exam at 10am and was completely exhausted, overworked and sleepy.

Fortunately I passed the exam successfully and wish you the same.

O_CertExpert_ODatabase12cORACandOGridInfrastructureAdmin_clr

 

 

 

 

 

 

 

This is my experience with Oracle Database 12c: RAC and Grid Infrastructure Administration exam  (1Z0-068).  Let me know if you plan to take the exam, so I guide you through it in more detail.

Good luck!

Posted in RAC issues | 1 Comment »

Using deprecated ASM parameter might prevent your Cluster to start

Posted by Kamran Agayev A. on 20th October 2017

Few days ago, I was testing some ASM parameters in my 3 nodes 12.2 Clusterware environment and used ASM_PREFERRED_READ_FAILURE_GROUPS parameter to see how I can force ASM to read specific failure group. Testings were successfull but I didn’t know that this parameter is deprecated in 12.2, and beside that, I didn’t imagine that it might cause me a downtime and prevent Clusterware to start.

Here’s the scenario that you can try in your test environment. First of all, I set this parameter to the failure group and then resetted it back:

SQL> alter system set ASM_PREFERRED_READ_FAILURE_GROUPS=”;

System altered.

SQL> 

 

Then I made some hardware changes to my nodes and rebooted them. After nodes are rebooted, I checked the status of the clusterware, and it was down at all nodes.

 

[oracle@oratest01 ~]$ crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4529: Cluster Synchronization Services is online

CRS-4534: Cannot communicate with Event Manager

 

 

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4529: Cluster Synchronization Services is online

CRS-4534: Cannot communicate with Event Manager

**************************************************************

oratest02:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4530: Communications failure contacting Cluster Synchronization Services daemon

CRS-4534: Cannot communicate with Event Manager

**************************************************************

oratest03:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4530: Communications failure contacting Cluster Synchronization Services daemon

CRS-4534: Cannot communicate with Event Manager

**************************************************************

 

Next, I check if ohasd and crsd background processes are up

[root@oratest01 oracle]# ps -ef|grep init.ohasd|grep -v grep

root      1252     1  0 02:49 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

[root@oratest01 oracle]#

 

[root@oratest01 oracle]# ps -ef|grep crsd|grep -v grep

[root@oratest01 oracle]#

 

OHAS was up and running, but CRSD not. ASM instance should be up in order to bring the crsd, so I checked if ASM instance is up, but it was also down:

[oracle@oratest01 ~]$ ps -ef | grep smon

oracle    5473  3299  0 02:50 pts/0    00:00:00 grep –color=auto smon

[oracle@oratest01 ~]$

 

 

 

Next, I decided to check log files. Logged in to adrci to find the centralized Clusterware log folder:

 

[oracle@oratest01 ~]$ adrci

ADRCI: Release 12.2.0.1.0 – Production on Fri Oct 20 02:51:59 2017

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

ADR base = “/u01/app/oracle”

adrci> show home

ADR Homes:

diag/rdbms/_mgmtdb/-MGMTDB

diag/rdbms/proddb/proddb1

diag/asm/user_root/host_4288267646_107

diag/asm/user_oracle/host_4288267646_107

diag/asm/+asm/+ASM1

diag/crs/oratest01/crs

diag/clients/user_root/host_4288267646_107

diag/clients/user_oracle/host_4288267646_107

diag/tnslsnr/oratest01/asmnet1lsnr_asm

diag/tnslsnr/oratest01/listener_scan1

diag/tnslsnr/oratest01/listener_scan2

diag/tnslsnr/oratest01/listener_scan3

diag/tnslsnr/oratest01/listener

diag/tnslsnr/oratest01/mgmtlsnr

diag/asmtool/user_root/host_4288267646_107

diag/asmtool/user_oracle/host_4288267646_107

diag/apx/+apx/+APX1

diag/afdboot/user_root/host_4288267646_107

adrci> exit

[oracle@oratest01 ~]$ cd /u01/app/oracle/diag/crs/oratest01/crs

[oracle@oratest01 crs]$cd trace

 

[oracle@oratest01 trace]$ tail -f evmd.trc

2017-10-20 02:54:26.533 :  CRSOCR:2840602368:  OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2017-10-20 02:54:27.552 :  CRSOCR:2840602368:  OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2017-10-20 02:54:28.574 :  CRSOCR:2840602368:  OCR context init failure.  Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

 

From evmd.trc file it can bees that OCR was not initialized. Then I check alert.log file:

 

[oracle@oratest01 trace]$ tail -f alert.log

2017-10-20 02:49:49.613 [OCSSD(3825)]CRS-1605: CSSD voting file is online: AFD:DATA1; details in /u01/app/oracle/diag/crs/oratest01/crs/trace/ocssd.trc.

2017-10-20 02:49:49.627 [OCSSD(3825)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1.

2017-10-20 02:49:58.812 [OCSSD(3825)]CRS-1601: CSSD Reconfiguration complete. Active nodes are oratest01 .

2017-10-20 02:50:01.154 [OCTSSD(5351)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 5351

2017-10-20 02:50:01.161 [OCSSD(3825)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.

2017-10-20 02:50:02.099 [OCTSSD(5351)]CRS-2403: The Cluster Time Synchronization Service on host oratest01 is in observer mode.

2017-10-20 02:50:03.233 [OCTSSD(5351)]CRS-2407: The new Cluster Time Synchronization Service reference node is host oratest01.

2017-10-20 02:50:03.235 [OCTSSD(5351)]CRS-2401: The Cluster Time Synchronization Service started on host oratest01.

2017-10-20 02:50:10.454 [ORAAGENT(3362)]CRS-5011: Check of resource “ora.asm” failed: details at “(:CLSN00006:)” in “/u01/app/oracle/diag/crs/oratest01/crs/trace/ohasd_oraagent_oracle.trc”

2017-10-20 02:50:18.692 [ORAROOTAGENT(3198)]CRS-5019: All OCR locations are on ASM disk groups [DATA], and none of these disk groups are mounted. Details are at “(:CLSN00140:)” in “/u01/app/oracle/diag/crs/oratest01/crs/trace/ohasd_orarootagent_root.trc”.

 

CRS didn’t started as the ASM is not up and running. To checking why ASM wasn’t started upon the server book sounded good starting point for the investigation, so logged in and tried to start ASM instance:

 

[oracle@oratest01 ~]$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Fri Oct 20 02:55:12 2017

Copyright (c) 1982, 2016, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup

ORA-01078: failure in processing system parameters

SQL> startup

ORA-01078: failure in processing system parameters

SQL> startup

ORA-01078: failure in processing system parameters

SQL>

 

I checked ASM alert.log file, but it didn’t provide enough information why ASM didn’t start:

NOTE: ASM client -MGMTDB:_mgmtdb:clouddb disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ufg_20658_-MGMTDB__mgmtdb.trc
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:clouddb connection state (reg:2993645709)
2017-10-20T02:47:20.588256-04:00
NOTE: client +APX1:+APX:clouddb deregistered
2017-10-20T02:47:21.201319-04:00
NOTE: detected orphaned client id 0x10004.
2017-10-20T02:48:49.613505-04:00
WARNING: Write Failed, will retry. group:2 disk:0 AU:9067 offset:151552 size:4096
path:AFD:DATA1
incarnation:0xf0a9ba5e synchronous result:’I/O error’
subsys:/opt/oracle/extapi/64/asm/orcl/1/libafd12.so krq:0x7f8fced52240 bufp:0x7f8fc9262000 osderr1:0xfffffff8 osderr2:0xc28
IO elapsed time: 0 usec Time waited on I/O: 0 usec
ERROR: unrecoverable error ORA-15311 raised in ASM I/O path; terminating process 20200

 

The problem seemed to be in the parameter file of ASM, so I decided to start it with default parameters and then investigate. For this, I opened searched for the string “parameters” in the ASM alert.log file to get list of parameters and paramter file location:

[oracle@oratest01 trace]$ more +ASM1_alert.log

Using parameter settings in server-side spfile +DATA/clouddb/ASMPARAMETERFILE/registry.253.949654249

System parameters with non-default values:

  large_pool_size          = 12M

  remote_login_passwordfile= “EXCLUSIVE”

  asm_diskstring           = “/dev/sd*”

  asm_diskstring           = “AFD:*”

  asm_diskgroups           = “NEW”

  asm_diskgroups           = “TESTDG”

  asm_power_limit          = 1

  _asm_max_connected_clients= 4

NOTE: remote asm mode is remote (mode 0x202; from cluster type)

2017-08-11T10:22:24.834431-04:00

Cluster Communication is configured to use IPs from: GPnP

 

Then I created parameter file (/tmp/pfile_asm.ora) and started the instance:

SQL> startup pfile=’/home/oracle/pfile_asm.ora’;

ASM instance started

 

Total System Global Area 1140850688 bytes

Fixed Size                                8629704 bytes

Variable Size                      1107055160 bytes

ASM Cache                            25165824 bytes

ASM diskgroups mounted

SQL> exit

 

Great! ASM is up. Now I can restore my parameter file and try to start ASM with it:

 

[oracle@oratest01 ~]$ sqlplus / as sysasm

SQL> create pfile=’/home/oracle/pfile_orig.ora’ from spfile=’+DATA/clouddb/ASMPARAMETERFILE/registry.253.957837377′;

File created.

SQL> 

 

And here is entry of my original ASM parameter file:

[oracle@oratest01 ~]$ more /home/oracle/pfile_orig.ora

+ASM1.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM2.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM3.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM3._asm_max_connected_clients=5

+ASM2._asm_max_connected_clients=8

+ASM1._asm_max_connected_clients=5

*.asm_diskgroups=’DATA’,’ACFSDG’#Manual Mount

*.asm_diskstring=’/dev/sd*’,’AFD:*’

*.asm_power_limit=1

*.asm_preferred_read_failure_groups=”

*.large_pool_size=12M

*.remote_login_passwordfile=’EXCLUSIVE’

 

Good. Now let’s start ASM with it:

SQL> shut abort

ASM instance shutdown

SQL> startup pfile=’/home/oracle/pfile_orig.ora’;

ORA-32006: ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter has been deprecated

 

ORA-01078: failure in processing system parameters

SQL>

 

Wohoo. ASM failed to start because of deprecated parameter?! Let’s remove it and start ASM without ASM_PREFERRED_READ_FAILURE_GROUPS parameter:

[oracle@oratest01 ~]$ sqlplus / as sysasm

Connected to an idle instance.

SQL> startup pfile=’/home/oracle/pfile_orig.ora’;

ASM instance started

 

Total System Global Area 1140850688 bytes

Fixed Size                                8629704 bytes

Variable Size                      1107055160 bytes

ASM Cache                            25165824 bytes

ASM diskgroups mounted

SQL> 

 

It is started! Next I create ASM parameter file based on this pfile and start the instance:

SQL> create spfile=’+DATA’ from pfile=’/home/oracle/pfile_orig.ora’;

File created.

 

SQL> shut immediate

ASM diskgroups dismounted

ASM instance shutdown

 

SQL> startup

ASM instance started

Total System Global Area 1140850688 bytes

Fixed Size                                8629704 bytes

Variable Size                      1107055160 bytes

ASM Cache                            25165824 bytes

ASM diskgroups mounted

SQL> 

 

After having ASM up and running I restart the clusterware on all nodes and check the status:

[root@oratest01 ~]$  crsctl stop cluster –all

[root@oratest01 ~]$ crsctl start cluster –all

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

CRS-4404: The following nodes did not reply within the allotted time:

oratest02, oratest03

 

The first node is up, but I wasn’t able to get status of clusterware in other nodes and got CRS-4404 error. To solve it, kill gpnpd process on all nodes and run the command again:

 

[oracle@oratest01 ~]$ ps -ef | grep gpn

oracle    3418     1  0 02:49 ?        00:00:15 /u01/app/12.2.0.1/grid/bin/gpnpd.bin

[oracle@oratest01 ~]$ kill -9 3418

[oracle@oratest01 ~]$ ps -ef | grep gpn

oracle   16169     1  3 06:52 ?        00:00:00 /u01/app/12.2.0.1/grid/bin/gpnpd.bin

 

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

oratest02:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

oratest03:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

[oracle@oratest01 ~]$

 

From this blog post you can learn step by step clusterware startup troubleshooting and not to use depracated ASM parameter

Posted in RAC issues | No Comments »

[INS-20802] Creating Container Database for Oracle Grid Infrastructure Management Repository failed

Posted by Kamran Agayev A. on 24th July 2017

After dealing with root.sh script to configure 3 node clusterware environment I succeeded but ended up with the following error when post configuration OUI returned the following error and was unable to create container database for  Oracle Grid Infrastructure Management Repository:

screenshot

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

There was no information in the mentioned log file, however the trace of database creation job was enabled and I was able to find a long trace file under the log directory, where I saw the following message:

set newname for datafile 1 to new;

set newname for datafile 3 to new;

set newname for datafile 4 to new;

restore datafile 1;

restore datafile 3;

restore datafile 4; }
[Thread-159] [ 2017-07-24 04:24:07.299 EDT ] [RMANEngine.executeImpl:1321] Notify reader to start reading
[Thread-177] [ 2017-07-24 04:24:07.300 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=echo set off
[Thread-177] [ 2017-07-24 04:24:07.300 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.305 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN> 2> 3> 4> 5> 6> 7> 8> 9> 10> 11> 12> 13>
[Thread-177] [ 2017-07-24 04:24:07.305 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.546 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.547 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.562 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.563 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.578 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.585 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=Starting restore at 24-JUL-17
[Thread-177] [ 2017-07-24 04:24:07.792 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=allocated channel: ORA_DISK_1
[Thread-177] [ 2017-07-24 04:24:07.797 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: SID=18 device type=DISK
[Thread-177] [ 2017-07-24 04:24:08.051 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:08.383 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: starting datafile backup set restore
[Thread-177] [ 2017-07-24 04:24:08.385 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: specifying datafile(s) to restore from backup set
[Thread-177] [ 2017-07-24 04:24:08.386 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: restoring datafile 00001 to +DATA
[Thread-177] [ 2017-07-24 04:24:08.387 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: reading from backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00571: ===========================================================
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00571: ===========================================================
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-03002: failure of restore command at 07/24/2017 04:24:23
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=ORA-19870: error while restoring backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=ORA-19872: Unexpected end of file at block 4800 while decompressing backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.495 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:23.496 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN>
[Thread-177] [ 2017-07-24 04:24:23.496 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=echo set on
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=set echo off;
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:1031] hasError is true
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:1037] ERROR TRACE DETECTED

 

So in ordreate a container database, the installer was trying to restore the database and was unable to do it and hit the following error:

ORA-19872: Unexpected end of file at block 4800 while decompressing backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb

So the problem was with the backup piece of the MGMT database. Permission were ok, so I compared the size of the restored backup piece with the one in the downloaded zip file:

[root@oratest01 ~]# cd /u01/app/12.2.0.1/grid/assistants/dbca/templates/
[root@oratest01 templates]# ll
total 131452
-rw-r--r-- 1 oracle oinstall 5734 Jan 26 10:48 DomainServicesCluster_GIMR.dbc
-rw-r----- 1 oracle oinstall 18628608 Jan 26 10:46 MGMTSeed_Database.ctl
-rw-r----- 1 oracle oinstall 5177 Jan 26 10:48 MGMTSeed_Database.dbc
-rw-r----- 1 oracle oinstall 39321600 Jan 26 10:46 MGMTSeed_Database.dfb
-rw-r----- 1 oracle oinstall 10578 Jun 10 2016 New_Database.dbt
-rw-r----- 1 oracle oinstall 76619776 Jan 26 10:11 pdbseed.dfb
-rw-r----- 1 oracle oinstall 6579 Jan 26 10:11 pdbseed.xml

 

It was 39M in the extracted folder, and 104Mb in the zip file itself. Screenshot2

 

 

 

 

 

Somehow it was not correctly unzipped. I moved all files to the backup folder, uploaded backup pieces from the downloaded installation zip file to the same folder in the first node and restarted the configuration – and it succeeded.

Screenshot3

 

 

 

 

 

 

 

 

 

 

 

 

Good Luck!

Posted in RAC issues | No Comments »

Perl related issues when running ./rootcrs.pl to deconfigure the node

Posted by Kamran Agayev A. on 24th July 2017

Today while deconfiguring one failed node from the clusterware I faced some Perl related issues that blocked me to run ./rootcrs.pl command. After installing few required packages I was able to deconfigure the node. Check the steps and let me know if it helped you and if you had different errors

 

[root@oratest02 install]$ ./rootcrs.pl -deconfig -force -verbose
-bash: ./rootcrs.pl: /usr/bin/perl: bad interpreter: No such file or directory

 

I checked for perl and didn’t find it.
[root@oratest02 install]$ which perl
/usr/bin/which: no perl in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/oracle/.local/bin:/home/oracle/bin:/home/oracle/.local/bin:/home/oracle/bin:/u01/app/12.2.0.1/grid/bin)

If the perl wasn’t installed by default, install it:

[root@oratest02 install]# yum install perl -y

Then I got the following errors and installed the required perl modules as follows:

[root@oratest02 install]$ ./rootcrs.pl -deconfig -force -verbose
Can't locate Env.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . . ./../../perl/lib) at crsinstall.pm line 286.
BEGIN failed--compilation aborted at crsinstall.pm line 286.
Compilation failed in require at ./rootcrs.pl line 165.
BEGIN failed--compilation aborted at ./rootcrs.pl line 165.


[root@oratest02 install]# yum install perl-Env -y


[root@oratest02 install]$ ./rootcrs.pl -deconfig -force -verbose
Can't locate XML/Parser.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 . . ./../../perl/lib) at crsutils.pm line 770.
BEGIN failed--compilation aborted at crsutils.pm line 770.
Compilation failed in require at crsinstall.pm line 290.
BEGIN failed--compilation aborted at crsinstall.pm line 290.
Compilation failed in require at ./rootcrs.pl line 165.
BEGIN failed--compilation aborted at ./rootcrs.pl line 165.
[root@oratest02 install]# yum install perl-XML-Parser -y

Finally I was able to run rootcrs.pl and deconfigure the node from the clusterware

Posted in RAC issues | No Comments »

Interim patch apply best practices in Oracle

Posted by Kamran Agayev A. on 2nd October 2015

Yestarday, after successfully applying an interim patch to the 3 node clusterware environment, I decided to share my experience with you. In this blog post, you can find some best practicec that I think must be followed before and during patch insallation to bring the downtime and failure risks to the minimum.

First of all, make sure you read the following metalink notes:

Master Note For OPatch (Doc ID 293369.1)
FAQ: OPatch/Patch Questions/Issues for Oracle Clusterware (Grid Infrastructure or CRS) and RAC Environments (Doc ID 1339140.1) 

 

Before applying any interim patches or upgrading the database or the clusterware, make sure you have answers to the following questions:

– Have you tested the patch installation? 

– Have you tested rollback of the patch? 

– What you will do if you can’t rollback the patch with default rollback mechanism? 

– What you will do if you fail to open the database after the patch installation? 

Do you have a backup? Have you tested it? What if you don’t have enough time to restore? 

 

Here is the list of what I would prefer to do before applying any interim patches to the mission critical environment:

– Backup the home folder that is going to be patched

tar -cvf grid_home_before_patch.tar /home/oracle/app/11.2.0

If the patch installation goes wrong and you can’t rollback the patch using default method, restore the backup of the installation home folder and bring the database (clusterware) up.

– Make sure you have a full backup of the database 

Most probably you will not go with the restoration, but you never know what might happen)

– Make sure your backup is recoverable

Yes. This might be a discussion topic, but I strongly believe (as an author of the RMAN Backup and Recovery book :) ) in “If you don’t test your backup, you don’t have a backup” philosophy. Restore it and make sure a) The backup is restorable/recoverable b) Your restore/recover scripts works fine. In my experience, I had a situation where the restore of the backup failed while restoring it for the developers for the testing purpose because of non tested recovery scripts. I heard a situation (a couple of years ago when I attended a wonderful OOW session) where one of the attendees complained how they failed to restore a backup when the production environment failed and this downtime (for days) costed them for a couple of million dollars.

– Make sure you have a Standby database.

Why? Imagine you took 30 minutes downtime to apply the patch and for any reason you were not able to do it and can’t rollback because you are in the middle of the patch apply procedure and trying to fix the issue. Or you can’t rollback the patch for any reason. You stuck! And you don’t have time to solve it. And you are forced to open the database right away. And you can’t do it as well. In this critical case, you can forward the applications to the Standby database. Build up a standby database, make sure archived log files of the production database are shipping to the standby server. You can also perform a failover to test your standby database and build it up again.

– Test the patch apply procedure on the test environment with the “same binaries”. 

Clone the database and clusterware soft to the test machine (or install the same release and apply the same patches as in the production environment) and apply the patch. Get the errors in the test environment before you get them in the production.

– Make sure there’s no any session runinng in the background related with the binaries of the home that is being patched. 

Yesterday, when I was trying to apply an interim patch to the 3 node clusterware (11.2.4) I came up with the following error:

Backing up files…

UtilSession failed: Map failed
java.lang.OutOfMemoryError: Map failed
Log file location: /home/oracle/11.2.0/grid_1124/cfgtoollogs/opatch/opatch2015-10-01_17-40-56PM_1.log

OPatch failed with error code 73

The reason was not the memory at all, we were having a lot of free memory at that time. There were some binary files in use in the background despite the fact that the whole clusterware stack was down. Anyway, I killed all processes and the installation proceeded.

The following metalink note also might be useful – OPatch Apply/Rollback Errors: ‘Prerequisite check “CheckActiveFilesAndExecutables” failed’ (Doc ID 747049.1)

– Download the latest OPatch

Check the following metalink note to download the latest OPatch. How To Download And Install The Latest OPatch Version (Doc ID 274526.1)

Download and extract it under the home folder that is going to be patched. Add the $GRID_HOME/OPatch or $ORACLE_HOME/OPatch to the PATH environment variable. Make sure which opatch command returns you a result

– Make sure you have enough free space in the mount point where the home folder reside.

A few years ago when I was trying to install a patch to the production environment, I decided to try it on the test environment (10 minutes before patching a production database). I ended up with the “There is no free space to proceed the installation” error. Home folder was full, and OPatch was taking backup of the binaries and library files that are being patched.  Check the following metalink note for more informaton: Opatch Fails Updating Archives with ” No space left on device ” Error. (Doc ID 1629444.1)

 – Bring the instance down before patching

If you have a Grid Infrastructure installed, you have a RAC database and you plan to apply the patch node by node without a downtime, bring the instance of the node you’re patching using the following command:

srvctl stop instance -d RACDB -i RACDB1

Why? Because if you start installing the patch and run the rootcrs.pl -unlock command which is the first step that brings the clusterware stack down, the database will be closed with ABORT mode and non of the sessions will be failed over.

– Try to rollback the patch installation at test environment after installing it

Why? Feel how you should rollback (and see if you get any error) the specific patch if you failed the installation and can’t proceed, or you installed successfully, but it caused another bug or problem. Check the following metalink note to learn how to rollback the patch and run opatch lsinventory to make sure it is rollbacked.  How to Rollback a Failed Interim Patch Installation (Doc ID 312767.1)

Sometimes, rollback also might fail :) In this case, the best option is to restore the whole home folder from the backup, but it is not mentioned in ths metalink note OPatch Fails to Rollback Patch Due to Relink Errors (Doc ID 1534583.1)

– Debug the OPatch if it is stuck 

You can use OPATCH_DEBUG=TRUE parameter to debug the OPatch. If it doesn’t generate enough information, use truss (or strace in Liunx) to debug OPatch. Check the following metalink note to learn how to use truss with OPatch. How To Use Truss With OPatch? (Doc ID 470225.1) 

Opatch might also hang due to corrupted jar and java executables. Check this metalink note – opatch napply Hanging (Doc ID 1055397.1)

 

This is all I have :) Please let me know if this document helped you and share your experience with me :) Have a successfull patching days ahead! :)

Posted in Administration, RAC issues | 8 Comments »

ORA-00304: requested INSTANCE_NUMBER is busy

Posted by Kamran Agayev A. on 27th August 2015

There are a lot of explanation and different solutions for the error “ORA-00304: requested INSTANCE_NUMBER is busy”. But today, in my case while I was tyring to shutdown one of the cluster nodes, it hanged. There were no more information related with the hang in the log and trace files, so I went with shut abort and startup and got the following message:

SQL> startup

ORA-00304: requested INSTANCE_NUMBER is busy

SQL>

The second node of the RAC database was up and running. And the instance_number was set to 2. After a little investigation, I found out that there was one process related with the database running on OS (even the database was closed) I killed that session and started the first node and it opened successfully

Posted in Administration, RAC issues | 3 Comments »

Default listener “LISTENER” is not configured when running DBCA

Posted by Kamran Agayev A. on 6th January 2015

When running dbca to create a new database you can get the following message:

Default Listener “LISTENER” is not configured in Grid Infrastructure home. Use NetCA to configure Default Listener and rerun DBCA

default_listener_problem

 

 

 

 

 

 

 

 

 

 

 

 

Actually, there’s no need to run netca, all you need is to create a new listener as follows:

srvctl add listener

srvctl start listener

 

Posted in RAC issues | No Comments »

Node names are missing from ./runInstaller output

Posted by Kamran Agayev A. on 4th January 2015

While installing Oracle Database after Oracle Grid Infrastructure installation, I was supposed to get the list of all nodes where I need to install Oracle Software (11gR2 – 11.2.0.4). But instead, I got nothing

runInstaller_output

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I checked the status of the clusterware, it was up and running on both nodes:

[oracle@node1 bin]$ ./olsnodes
node1
node2
[oracle@node1 bin]$ ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

 

Then I checked the inventor.xml file and found out that the CRS=true is missing.

[oracle@node1 bin]$ cat /etc/oraInst.loc | grep inventory_loc
inventory_loc=/u01/app/oraInventory

[oracle@node1 bin] cd /u01/app/oraInventory/ContentsXML/

[oracle@node1 bin] more inventory.xml

<output trimmed ————— >

<HOME NAME=”Ora11g_gridinfrahome1″ LOC=”/u01/app/product/11.2.0.3/grid” TYPE=”O” IDX=”1″>

</output trimmed ————->

After running the following command, I updated the inventory.xml file and node list appeared

[oracle@node1 ~]$ cd /u01/app/product/11.2.0.3/grid/oui/bin/
[oracle@node1 bin]$ ./runInstaller -updateNodeList ORACLE_HOME=”/u01/app/product/11.2.0.3/grid” CRS=true
Starting Oracle Universal Installer…

Checking swap space: must be greater than 500 MB. Actual 3919 MB Passed
The inventory pointer is located at /etc/oraInst.loc
The inventory is located at /u01/app/oraInventory
‘UpdateNodeList’ was successful.

 

[oracle@node1 bin] more inventory.xml

<HOME NAME=”Ora11g_gridinfrahome1″ LOC=”/u01/app/product/11.2.0.3/grid” TYPE=”O” IDX=”1″ CRS=”true”>

 

runInstaller_output2

Posted in RAC issues | No Comments »