Kamran Agayev's Oracle Blog

Oracle Certified Master

Investigation on why database doesn’t start after successfully dropping a diskgroup

Posted by Kamran Agayev A. on December 24th, 2020

Few months ago, while performing storage migration I faced an interesting issue which could lead to potential downtime if I didn’t notice a hidden warning in the log file.

The plan was to create a new ASM diskgroup in a normal redundancy with 2 disks from different storages and test the disk crash and confirm that there will be no data loss if one of the storages fail. After creating a diskgroup, creating a test tablespaces on it and corrupting the header of one disks, everything was ok and we decided to drop the diskgroup and start adding new disks as a failgroup to other diskgroups.


Below I created a scenario in my test environment which describes the same problem.

  • First of all, I get location of controlfiles and datafiles (of course redo log files as well) to make sure which diskgroups contain physical files:


SQL> show parameter control
NAME                                                        TYPE VALUE
------------------------------------ ----------- ------------------------------
control_files                                            string               +CFILE2/TESTDB/CONTROLFILE/current.256.1046097231

SQL> select name from v$datafile;


As you see, we have 2 diskgroups involved: +CFILE2 and +DATA. Next, I run srvctl config database command and grep list of Diskgroups which are used by this database. We see the same output – +CFILE2 and +DATA


-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2


  • Next, I query V$ASM_DISKGROUP view to get list of all diskgroups that are available in ASM:
SQL> col name format a40
SQL> set linesize 150


SQL> select group_number, name, state, type, total_mb, free_mb from v$asm_diskgroup;
GROUP_NUMBER NAME                                                                    STATE               TYPE       TOTAL_MB    FREE_MB
------------ ---------------------------------------- ----------- ------ ---------- ----------
                   4 TESTDG                                                               MOUNTED     EXTERN       1019                923
                   3 DATA                                                                    CONNECTED   EXTERN      15342             8239
                   1 CFILE2                                                                  CONNECTED   EXTERN       1019                892


  • We have three diskroups – +CFILE2, +DATA and +TESTDG. Next, I will create a new tablespace in the diskgroup +TESTDG to have it become a part of the database configuration:


SQL> create tablespace mytbs datafile '+TESTDG' size 10m;
Tablespace created.


  • Once I create a tablespace in the new diskgroup, it will be part of the database configuration and dependency is established between the database and the diskgroup which can be seen from the output of the alert.log file of the database:


Alert.log file
Wed Jul 29 09:02:47 2020
create tablespace mytbs datafile '+TESTDG' size 10m
Wed Jul 29 09:02:48 2020
NOTE: ASMB mounting group 4 (TESTDG)
NOTE: Assigning number (4,0) to disk (/dev/asm-disk5)
SUCCESS: mounted group 4 (TESTDG)
NOTE: grp 4 disk 0: TESTDG_0000 path:/dev/asm-disk5
Wed Jul 29 09:02:50 2020
NOTE: dependency between database testdb and diskgroup resource ora.TESTDG.dg is established
Completed: create tablespace mytbs datafile '+TESTDG' size 10m



  • Output of the ASM alert.log file:


Wed Jul 29 09:02:48 2020
NOTE: client testdb1:testdb:rac-scan mounted group 4 (TESTDG)
Wed Jul 29 09:02:49 2020
NOTE: Advanced to new COD format for group TESTDG


  • From the output of the crsd.trc file it can be seen that there’s a hard dependency between diskgroup and the database:


2020-07-29 09:02:50.015412 :UiServer:204928768: {1:32997:407} Container [ Name: UI_REGISTER
                TextMessage[Unknown process]


– Now to see the new list of diskgroups which are part of the database configuration, we run the following command:

-bash-4.1$ srvctl config database -d testdb | grep Disk

As you see, diskgroup +TESTDG is also part of the database configuration. Next, to imitate a storage failure, or disk crash, I corrupt the disk of the diskgroup +TESTDG using dd command as follows:


-bash-4.1$ dd if=/dev/zero of=/dev/asm-disk5 bs=1024 count=10000
10000+0 records in
10000+0 records out
10240000 bytes (10 MB) copied, 0.125557 s, 81.6 MB/s

– And check the alert.log file. Once it’s detected that the disk of the diskgroup with external redundancy is corrupted, database instance will crash:


Wed Jul 29 09:19:45 2020
USER (ospid: 27939): terminating the instance
Wed Jul 29 09:19:47 2020
Instance terminated by USER, pid = 27939


  • And from the alert.log file of an ASM instance, it can be seen that the disk is offlined:


Wed Jul 29 09:19:49 2020
NOTE: SMON did instance recovery for group DATA domain 3
NOTE: SMON detected lock domain 4 invalid at system inc 6 07/29/20 09:19:49
NOTE: SMON starting instance recovery of group TESTDG domain 4 inc 6 (mounted) at 07/29/20 09:19:49
NOTE: SMON will attempt offline of disk 0 - no header
NOTE: cache initiating offline of disk 0 group TESTDG
NOTE: process _smon_+asm1 (5245) initiating offline of disk 0.3916011317 (TESTDG_0000) with mask 0x7e in group 4 (TESTDG) with client assisting
NOTE: initiating PST update: grp 4 (TESTDG), dsk = 0/0xe9699735, mask = 0x6a, op = clear
Wed Jul 29 09:19:49 2020
GMON updating disk modes for group 4 at 14 for pid 18, osid 5245
ERROR: disk 0(TESTDG_0000) in group 4(TESTDG) cannot be offlined because the disk group has external redundancy.
Wed Jul 29 09:19:49 2020
ERROR: too many offline disks in PST (grp 4)


  • Now, we try to start the database


-bash-4.1$ srvctl start database -d testdb
PRCR-1079 : Failed to start resource ora.testdb.db
CRS-5017: The resource action "ora.testdb.db start" encountered the following error:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node1/crs/trace/crsd_oraagent_oracle.trc".


It will fail. Because it can’t access the datafile which is in the failed diskgroup. Here’s the output of the trace file:


CRS-2674: Start of 'ora.testdb.db' on 'node1' failed
CRS-2632: There are no more servers to try to place resource 'ora.testdb.db' on that would satisfy its placement policy
CRS-5017: The resource action "ora.testdb.db start" encountered the following error:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node2/crs/trace/crsd_oraagent_oracle.trc".
CRS-2674: Start of 'ora.testdb.db' on 'node2' failed


  • Output of alert.log file:


Wed Jul 29 09:22:15 2020
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_ora_28674.trc:
ORA-01157: cannot identify/lock data file 2 - see DBWR trace file
ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'
ORA-1157 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:32997:676} */...
Wed Jul 29 09:22:17 2020
License high water mark = 1
Wed Jul 29 09:22:17 2020
USER (ospid: 28854): terminating the instance
Wed Jul 29 09:22:18 2020
Instance terminated by USER, pid = 28854


  • Next, we offline the datafile and restart the database:


SQL> alter database datafile 2 offline;
Database altered.


-bash-4.1$ srvctl stop database -d testdb -stopoption abort
-bash-4.1$ srvctl start database -d testdb


Database is UP! Great! But …..  We solved the physical file dependency problem which was preventing database to start. But we still have the failed diskgroup in the configuration of the database resource:


-bash-4.1$ srvctl config database -d testdb | grep Disk


It means that once we restart the clusterware stack, the database resource will NOT start, because it has hard dependency with the diskgroup which is part of its configuration, which is FAILED …

Let’s restart the crs and check the status of the database:


-bash-4.1# crsctl stop crs
-bash-4.1# crsctl start crs


  • From the output of the ASM alert.log file, it can be seen that ASM tried to mount the diskgroup and failed:


Wed Jul 29 09:41:09 2020
ERROR: ALTER DISKGROUP TESTDG MOUNT  /* asm agent *//* {1:42096:2} */
Wed Jul 29 09:41:09 2020

WARNING: Disk Group DATA containing voting files is not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "TESTDG" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15013: diskgroup "DATA" is already mounted


  • CRS is up
[root@node1 oracle]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@node1 oracle]#


  • As we restarted crs in the first node, the instance is not running in the first node, and still up in the second node which will be down upon the next crs or node restart.


[root@node1 oracle]# srvctl status database -d testdb
Instance testdb1 is not running on node node1
Instance testdb2 is running on node node2
[root@node1 oracle]#


  • If we try to restart the instance in the first node, we’ll fail:


-bash-4.1$ srvctl start instance -d testdb -i testdb1
PRCR-1013 : Failed to start resource ora.testdb.db
PRCR-1064 : Failed to start resource ora.testdb.db on node node1
CRS-2674: Start of 'ora.TESTDG.dg' on 'node1' failed


A message appered in asm trace file once you try to start the instance


Wed Jul 29 09:44:28 2020
ERROR: ALTER DISKGROUP ALL MOUNT FOR testdb /* asm agent *//* {1:42096:192} *//* incarnation::1*/


It’s scary! You have a failed diskgroup which doesn’t contain ANY physical file in it, and it will stop you to start the database instance because the database resource is dependent on it. The only way is to modify the database resource configuration and remove the diskgroup as follows:


-bash-4.1$ srvctl modify database -d testdb -diskgroup DATA,CFILE2

  • Now if we check the crsd.log file, we can see that we have only two diskgroups : + DATA and CFILE2 with hard dependency


2020-07-29 09:46:09.329870 :UiServer:2822174464: {1:42096:285} Container [ Name: UI_REGISTER
                TextMessage[START_DEPENDENCIES=hard(ora.DATA.dg,ora.CFILE2.dg) weak(type:ora.listener.type,global:type:ora.scan_listener.type,uniform:ora.ons,global:ora.gns) pullup(ora.DATA.dg,ora.CFILE2.dg)STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DATA.dg,shutdown:ora.CFILE2.dg)]


To make sure it’s successfully modified, run the following command and check the output:


-bash-4.1$ srvctl config database -d testdb | grep Disk
Disk Groups: DATA,CFILE2


– Now we should be able to start the instance:

-bash-4.1$ srvctl start instance -d testdb -i testdb1


  • Output of the alert.log file
Wed Jul 29 09:47:30 2020
AQPC started with pid=55, OS id=14549
Starting background process CJQ0
Completed: ALTER DATABASE OPEN /* db agent *//* {1:42096:364} */


What I faced that night, was that the diskgroup was successfully dropped from ASMCA, but in the crsd.log file the hard dependency was not removed from the clusterware configuration, and I decided to not restart the crs, thinking it will not startup because of this dependency. Diskgroup was already empty containing no physical datafiles, dismounted and dropped successfully but it’s hard dependency from the database resource was not changed, probably because of a bug. Which means that after dropping the diskgroup if we tried to reboot both nodes or crs, the database wouldn’t start and would lead the downtime.

Lessons learned:

  • Make sure to check alert.log file of database and asm instance, and cluster log and trace files once you perform any change (even dropping a diskgroup in the production environment and even if it succeeded)
  • After making a cluster level change, make sure to restart the crs or even perform a node reboot to see everything is ok after the change.
  • Don’t stop the entire database. Restart the crs or db instances in rolling fashion. Make sure you have at least once instance available every time.


Posted in RAC issues | No Comments »

Exadata storage cell rolling restart caused datafile and redo log file header block corruptions

Posted by Kamran Agayev A. on February 8th, 2020

24 hours passed – still at work. Struggling to start up the database which was corrupted during cell storage rolling restart procedure. And I’ve never seen some Oracle error messages that I saw today. So here what is happened:


Exadata storage cell failure during so-called “rolling cell storage restart”. Data file headers are corrupted for some files just because of rolling restart of storage cells and it can’t read the mirror file in the normal redundancy diskgroup as well!!! Both are corrupted! SR created – but there’s no reply!


Read of datafile '+###1/###i/datafile/###_6077.1015929889' (fno 1367) header failed with ORA-01208
 Rereading datafile 1367 header from mirror side 'DA1_CD_05_CELADM02' failed with ORA-01208
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ckpt_360497.trc:
 ORA-63999: data file suffered media failure
 ORA-01122: database file 1367 failed verification check
 ORA-01110: data file 1367: '+DATAC1/###i/datafile/###.6077.1015929889'
 ORA-01208: data file is an old version - not accessing current version


Instance terminated in both RAC nodes!


License high water mark = 107
 Instance terminated by CKPT, pid = 360497
 USER (ospid: 173989): terminating the instance
 Instance terminated by USER, pid = 173989


Instance can’t be opened and media recovery is required!


Abort recovery for domain 0
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ora_175176.trc:
 ORA-01113: file 10 needs media recovery
 ORA-01110: data file 10: '+DATAC1/###i/datafile/###_4932.1028366333'
 ORA-1113 signalled during: ALTER DATABASE OPEN /* db agent *//* {0:3:84} */...
 NOTE: Deferred communication with ASM instance
 NOTE: deferred map free for map id 1127
 Fri Feb 07 16:27:06 2020
 License high water mark = 1
 USER (ospid: 175310): terminating the instance
 Instance terminated by USER, pid = 175310


Datafiles are corrupted!


Abort recovery for domain 0
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ora_75964.trc:
 ORA-01122: database file 410 failed verification check
 ORA-01110: data file 410: '+DATAC1/###i/datafile/###_5420.1007284567'
 ORA-01207: file is more recent than control file - old control file
 ORA-1122 signalled during: alter database open...


OMG! I didn’t do anything? Tried to restore some datafiles from backup and recover them. V$RECOVERY_FILE is empty now. Tried to start the database:


Abort recovery for domain 0
 Aborting crash recovery due to error 742
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ora_75964.trc:
 ORA-00742: Log read detects lost write in thread %d sequence %d block %d
 ORA-00312: online log 4 thread 1: '+DATAC1/###i/onlinelog/group_4.961.997203859'
 Abort recovery for domain 0
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ora_75964.trc:
 ORA-00742: Log read detects lost write in thread %d sequence %d block %d
 ORA-00312: online log 4 thread 1: '+DATAC1/###i/onlinelog/group_4.961.997203859'
 ORA-742 signalled during: alter database open...


This is the first time ever I see “Log read detects lost write” message! It means LGWR thinks that the changes are written to the redo log files, but they are not! Meanwhile, SR 1 was created 2 hours ago – no response from Oracle! After an investigation we detected that the CURRENT logfile is corrupted which reside in the normal redundancy disk group! Oracle support guy replied to SR to run “recover database until cancel” command :) Then the second guy came in and said don’t try this :)
During datafile restore, the first block (which is header) seemed to be corrupted in both ASM allocaiton units in different disks (cells) !!!


computed block checksum: 0x0
 Reading datafile '+DATAC1/###i/datafile/###899.1007892167' for corruption at rdba: 0x67400001 (file 413, block 1)
 Read datafile mirror 'DAC1_CD_07_CELADM02' (file 413, block 1) found same corrupt data (no logical check)
 Read datafile mirror 'DAC1_CD_02_CELADM01' (file 413, block 1) found same corrupt data (no logical check)
 Hex dump of (file 414, block 1) in trace file /u01/app/oracle/diag/rdbms/###i/###I_2/trace/###I_2_ora_122826.trc
 Corrupt block relative dba: 0x67800001 (file 414, block 1)
 Bad header found during kcvxfh v8

Started restoring from backup (20TB) to the different machine. Seems to be the only way to restore the service. Andddd ….. Recovery interrupted!


Errors with log /backup1/###I/ARCH/thread_2_seq_1169.14245.1031731123
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_1/trace/###I_1_pr00_81255.trc:
 ORA-00310: archived log contains sequence 1169; sequence 1160 required
 ORA-00334: archived log: '/backup1/###I/ARCH/thread_2_seq_1169.14245.1031731123'
 ORA-310 signalled during: ALTER DATABASE RECOVER LOGFILE '/backup1/###I/ARCH/thread_2_seq_1169.14245.1031731123' ...
 Signalling error 1152 for datafile 2!


RMAN is looking for the archived log file that was backed up and deleted in the beginning of the backup and wasn’t restored.
Aaaaandddddd …….. “OPEN RESETLOGS would get error” message! Are you kidding me?


Errors in file /u01/app/oracle/diag/rdbms/###i/###I_1/trace/###I_1_pr00_81255.trc:
 ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
 ORA-01152: file 2 was not restored from a sufficiently old backup
 ORA-01110: data file 2: '+DATAC1/###i/datafile/sysaux.2948.1031790467'
 ORA-1547 signalled during: ALTER DATABASE RECOVER CANCEL ...


Cataloged some missing backup files, restored required archived log files and the recovery proceeded. But we got another error!


File #142 added to control file as 'UNNAMED00142'. Originally created as:
 Errors with log /backup2/###I/ARCH/thread_2_seq_1176.9418.1031737965
 Recovery interrupted!
 Recovery stopped due to failure in applying recovery marker (opcode 17.30).
 Datafiles are recovered to a consistent state at change 8534770504316 but controlfile could be ahead of datafiles.
 Media Recovery failed with error 1244
 Errors in file /u01/app/oracle/diag/rdbms/###i/###I_1/trace/###I_1_pr00_323548.trc:
 ORA-00283: recovery session canceled due to errors
 ORA-01244: unnamed datafile(s) added to control file by media recovery
 ORA-01110: data file 142: '+DATAC1/###i/datafile/###797.1031737697'


Some datafiles were added after the last controlfile backup (and controlfile auto backup was not enabled) and those datafiles are created with UNNAMED name. Renamed datafiles and started the recovery again

At the end, opened database successfully. Changed 4 different support engineers, some of them seemed junior for me. They just copied some steps from metalink notes and sent to me. The reason is still under investigation

Posted in Uncategorized | 4 Comments »

Solution for ORA-27154: post/wait create failed ; ORA-27302: failure occurred at: sskgpbitsper

Posted by Kamran Agayev A. on June 21st, 2019

Today, while creating an empty database in Exadata machine where there was enough free space and memory, we got the following error:

SYS@TEST> startup nomount
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semget failed with status: 28
ORA-27301: OS failure message: No space left on device
ORA-27302: failure occurred at: sskgpbitsper


The problem wasn’t related with the space at all, even from the error message we see “No space left on device”.

From the error output, I realized “OS system dependent operation:semget“, where “sem” means “semaphore“. Having enough free memory and space, the process couldn’t allocate necessary semaphore, either because of the kernel parameter wasn’t configured correctly, or all memory is occupied. To get information about semaphores and shared memory, I ran ipcs command:

[oracle@node2~]$ ipcs
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status 
0x00000000 0 root 644 64 2 dest 
0x00000000 32769 root 644 16384 2 dest 
0x00000000 65538 root 644 280 2 dest 
0x00000000 98307 root 644 80 2 
0x00000000 131076 root 644 16384 2 
0x00000000 163845 root 644 280 2 
0x00000000 262602758 oracle 640 4096 0
------ Semaphore Arrays --------
key semid owner perms nsems 
0x61000625 98306 root 666 1 
0x00000000 163844 root 666 3 
0x00000000 1769477 root 666 3 
0x00000000 4096006 root 666 3 
0xd9942a14 3604487 oracle 600 514 
0xd9942a15 3637256 oracle 600 514 
0xd9942a16 3670025 oracle 600 514 
0x192b36e8 219578379 oracle 640 1004 
0x5f94bc50 6062092 oracle 640 1004 
0x00000000 286752781 root 666 3 
0xaa3762f4 6324238 oracle 640 154


The list was long, so I decided to count the rows

[oracle@node2 ~]$ ipcs -s | wc -l


So overall I have 256 semaphores allocated. Then I checked /etc/sysctl.conf file for the KERNEL.SEM parameter:

[oracle@node2 ~]$ more /etc/sysctl.conf | grep sem
kernel.sem = 1024 60000 1024 256
[oracle@node2 ~]$

You can get more detailed output from ipcs -ls command as follows:


[oracle@node2 ~]$ ipcs -ls
------ Semaphore Limits --------
max number of arrays = 256
max semaphores per array = 1024
max semaphores system wide = 60000
max ops per semop call = 1024
semaphore max value = 32767


The last column indicates the maximum number of semaphore sets for the entire OS. In this case you have to options to solve the problem:

  • Increase the max number of arrays parameter in the /etc/sysctl.conf file
  • Remove unnecessary semaphores


Increasing max number of arrays parameter is the easiest (and the fastest) way. Here how it works:


1. Get the value for the SEM parameter:

[root@node2 ~]# cat /etc/sysctl.conf | grep sem
kernel.sem = 1024 60000 1024 256
[root@node2 ~]#

2. Edit it and change it to 260 (more than the value you get from ” ipcs -s | wc -l” command) and run the following command to set the parameter to be persistent

/sbin/sysctl -p

3. Create a dummy parameter file and start the instance in NOMOUNT mode to see if the oracle user can get a semaphore from the memory:

[oracle@node2 dbs] mode initTEST.ora


[oracle@node2 ~] export ORACLE_SID=TEST
[oracle@node2 ~] sqlplus / as sysdba
SYS@TEST> startup nomount
ORACLE instance started.
Total System Global Area 2137886720 bytes
Fixed Size 2254952 bytes
Variable Size 956303256 bytes
Database Buffers 1090519040 bytes
Redo Buffers 88809472 bytes


It worked!


The second option to solve the problem, is to find out the ‘aged’ semaphores from the memory and remove them. Each semaphore is linked to the PID in the OS. In the following example I have overall 256 semaphores where 23 of them are related with oracle user (db instances etc.) and 229 of them related with root user. Most processes that hold the semaphore in the memory died long time ago, but semaphores didn’t age out. To find and kill the PID of the semaphore, we run ipcs command with -i parameter. First let’s get list of semaphores under oracle user and check one of them as follows:

[root@node2 ~]# ipcs -s | grep oracle 
0xcfe88130 3473414 oracle 600 514 
0xcfe88131 3506183 oracle 600 514 
0xcfe88132 3538952 oracle 600 514 
0xf0720010 411041803 oracle 640 802 
0xf8121f34 145653772 oracle 640 1004 
0xf0720011 411074573 oracle 640 802 
0xf0720012 411107342 oracle 640 802 
0xc5d91710 196444189 oracle 640 504 
0x86d48ae8 44236836 oracle 640 304 
0x67556608 199786542 oracle 640 876 
0x67556609 199819311 oracle 640 876 
0x6755660a 199852080 oracle 640 876 
0x6755660b 199884849 oracle 640 876 
0x6755660c 199917618 oracle 640 876 
0x806b87cc 157450352 oracle 640 752 
0x806b87cd 157483121 oracle 640 752 
0x806b87ce 157515892 oracle 640 752 
[root@node2 ~]#


Next, we run ipcs command with -i parameter to get the list of PIDs as follows:

[root@node2 ~]# ipcs -s -i 157450352 | more

Semaphore Array semid=157450352
uid=1001 gid=1002 cuid=1001 cgid=1002
mode=0640, access_perms=0640
nsems = 752
otime = Fri Jun 21 18:51:23 2019 
ctime = Fri Jun 21 18:51:23 2019 
semnum value ncount zcount pid 
0 1 0 0 315611 
1 4893 0 0 315611 
2 10236 0 0 315611 
3 32760 0 0 315611 
4 0 0 0 0 
5 0 0 0 0 
6 0 0 0 315729 
7 0 1 0 315731 
8 0 0 0 0 
9 0 1 0 315739 
10 0 0 0 0 
11 0 1 0 315743 
12 0 0 0 315745 
13 0 1 0 315747 
14 0 1 0 315749


Next, we run ps command and check the PID:

[root@node2 ~]# ps -fp 315729
oracle 315729 1 0 2018 ? 01:14:13 ora_pmon_SNEWDB
[root@node2 ~]#


As you see, we found out that the specific semaphore is associated with the database instance. Now let’s repeat the same steps for the semaphores of the root user:

[oracle@node2 ~]$ ipcs -s |grep root
0x61000625 98306 root 666 1
0x00000000 163844 root 666 3
0x00000000 1769477 root 666 3
0x00000000 4096006 root 666 3
0x00000000 248774666 root 666 3
0x00000000 286752781 root 666 3
0x00000000 6357007 root 666 3


Now we run ipcs -s -i command for the semaphore which is marked in bold to find the PID :

[oracle@node2 ~]$ ipcs -s -i 248774666
Semaphore Array semid=248774666
uid=0 gid=11140 cuid=0 cgid=11140
mode=0666, access_perms=0666
nsems = 3
otime = Sun Dec 16 18:34:22 2018
ctime = Sun Dec 16 18:34:22 2018
semnum value ncount zcount pid
0 1024 0 0 156155
1 32000 0 0 156155
2 0 0 0 156155


If we check the PID in the system, we see that it’s not available:

[oracle@node2 ~]$ ps -fp 156155
[oracle@node2 ~]$


Now we can safely remove that semaphore from the memory using ipcrm command in order to release space for new semaphores:

[root@node2 ~]# ipcrm -s 248774666
[root@node2 ~]#
Let's check if it was removed:
[root@node2 ~]# ipcrm -s 248774666
ipcrm: invalid id (248774666)
[root@node2 ~]#


As you see, we found out the semaphores which associated process is not available in the system, and removed it to make space for new semaphores. Now let’s start the instance:

SYS@TEST> startup nomount
ORACLE instance started.
Total System Global Area 2137886720 bytes
Fixed Size 2254952 bytes
Variable Size 956303256 bytes
Database Buffers 1090519040 bytes
Redo Buffers 88809472 bytes


Posted in Administration | No Comments »

Connect to Oracle from Python – write your first Python script!

Posted by Kamran Agayev A. on June 3rd, 2019

Python is getting more popular nowadays, because it is reliable and efficient, it has great corporate sponsors, and because of it’s amazing libraries that helps you to save time during the initial development cycle.

It’s much more easy to connect to an Oracle Database from Python by using cx_Oracle module. To get more information about cx_Oracle module, check the following links:




In this blog post, I will show how to install Python and configure the environment and connect to the database.

First of all, make sure you’ve an internet connection and install Python with yum as follows:

yum install python

After python is installed, install easy_install on Linux in order to download and manage Python packages easily using the following command:

wget http://bootstrap.pypa.io/ez_setup.py -O -| sudo python

easy_install installation

Next install pip using easy_install as follows:


Now install cx_Oracle module using pip as follows:



Now install Oracle instant client:

cd /etc/yum.repos.d
wget https://yum.oracle.com/public-yum-ol7.repo
yum install -y yum-utils
yum-config-manager --enable ol7_oracle_instantclient
yum list oracle-instantclient*



Now install Oracle instance client basic and sqlplus as follows:



After installing Oracle client, configure environment variables as follows:

vi .bashrc
export CLIENT_HOME=/usr/lib/oracle/18.3/client64


run .basrhc file to set environment variables and write your first Python script as follows:

vi connect.py 
import cx_Oracle
print con.version


If we run this script, we will get Oracle Database version in the output:

[root@oratest ~]python connect.py
[root@oratest ~]


Now let’s use split function in Python and split the version into “Version, Release and Patchset” sections as follows:

import cx_Oracle
print 'Version:', ver[0],'\nRelease:',ver[1],'\nPatchset:',ver[3]

[root@oratest ~]python connect.py
Version: 11
Release: 2
Patchste: 4
[root@oratest ~]


Now let’s create a table in Oracle and write a simple python code to query and print all rows in the table:

SQL> create table test_table(id number, name varchar2(10));
Table created.
SQL> insert into test_table values(1,'Oracle DB');
1 row created.
SQL> insert into test_table values(2,'SQL');
1 row created.
SQL> insert into test_table values(3,'PL/SQL');
1 row created.


Now create a python code to query the table:

import cx_Oracle
cur.execute('select * from test_table order by 1')
for result in cur:
      print result


[root@oratest ~]python connect.py
(1,'Oracle DB')
[root@oratest ~]

Congratulations! You’ve installed/configured Python, connected to an Oracle database, queried the table and printed the output!

Posted in Administration | 3 Comments »

Second OCM exam is cleared. New book and online course are on the way

Posted by Kamran Agayev A. on April 23rd, 2019

2 months ago after a long preparation I decided to upgrade my OCM certification and registered for the exam in Shanghai. Few years ago when I cleared 10g OCM exam I started my preparations for the upgrade right away. I did a lot of research and practical hands-ons and then thought it would be great if I can collect everything what I have in a single book. It took almost 2 years for me to publish the book. Few months after the book was published, I started getting emails from the readers on how the book helped them during their preparations and was happy to see them passing the exam! Having a lot of different projects during those days, I didn’t manage to take the exam. And unfortunately 11g OCM 1 day exam was retired. It means that I was supposed to take another 2 days exam again! But it was ok. If this is the only option, then I have nothing to do.

I will not talk about how my travel was hard, but eventually the exam day has arrived. It was 9 sections (2 days) with lot of different practical tasks. I wouldn’t also like to go in more details regarding the questions and so on, but what I realized was that the book that I’ve published even before taking the  OCM 11g exam was covering almost everything that I had during the exam 😊 Reviewing topics directly from my book helped me to be confident during the exam.

Few weeks passed, and I got a happy email from Oracle – that I’ve passed the exam and became 2xOCM. Now it’s time for the third and last one )) And it means that I’ve already started my preparation with along the new book which will be published in a few months.

For those of you guys who want to clear the OCM 11g exam, believe it or not, my book covers almost all the topics. And after clearing the second OCM exam, I decided to start an online course and help you on your preparation individually. So keep tuned and I will announce the course information shortly 😊

OCM 11g Certificate

Posted in Uncategorized | 2 Comments »

PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

Posted by Kamran Agayev A. on April 15th, 2019

Last week we had a clusterware issue on one of the critical 3 node RAC environment. In the first node, network resource is restarted by ending up killing all sessions on that node abnormally. Oracle VIP that was running on that node failed over to the third node. The first node was up and running, but didn’t accept connections because it was trying to register the instance using LOCAL_LISTENER parameter where the oranode1-vip was specified that was not running on that node. We tried to relocate it back to the first node, but it failed because it couldn’t stop it. Everytime we tried to stop or relocate it, the cleaning process started and failed in a few minutes.

Neither support, nor us didn’t find any readable information in the clusterware log files. Despite the fact that there were 2 instance up and running, as load was so high, they were barely handle all connections. The ping succeeded to the oranode1-vip, but it wasn’t able to stop it even with force mode. We couldn’t able to start it as well, because it didn’t stop successfully and wasn’t able to clean up successfully. The status was “enabled” and “not running”, but ping was ok

db-bash-$ srvctl status vip -i oranode1-vip
VIP oranode1-vip is enabled 
VIP oranode1-vip is not running 

From crsctl stat res command we could see that it’s OFFLINE and failed over to the node3


db-bash-$ crsctl stat res -t
oranode1-vip  1 OFFLINE UNKNOWN node03


And it failed when we tried to start it:

db-bash-$ srvctl start vip -i oranode1-vip   
PRCR-1079 : Failed to start resource oranode1-vip  
CRS-2680: Clean of 'oranode1-vip  ' on 'node03' failed 
CRS-5804: Communication error with agent process


We cleared socket files of the first node from /var/tmp/.oracle folder, restart the CRS and checked if it failed back, but it didn’t. Support asked us to stop the second node, clear the socket files and start it to see if something changed, but we didn’t do it, because the single node wouldn’t be able to handle all connections.

At the end, we checked the interface of virtual up on OS level, and found it on node03

db-bash-$ netstat -win
lan900:805 1500 #### #### 2481604 0 51 0 0


Instead of restarting the CRS of production database (which takes 10 minutes), we decided to bring that interface down using on OS level. For HP-UX, it’s ifconfig … down command

Before running this command on production environment, we tried it on the test environment and realized that the down parameter is not enough. We have to provide ip address with along the down parameter to bring down that interface. So we run the following command to bring it down:

ifconfig lan900:805 down

And it disappeared from the list. Next, we started the vip using srvctl start vip command and it succeeded!

Lessons learned:

  • Perform all actions on the test environment (if you are not sure what can happen) before trying it on production environment
  • Don’t try to “restart” or “reboot” the instance, cluster or the node. Sometimes it just doesn’t solve your problem. Even after restart, the system can’t startup correctly (because of changed parameters, configurations and etc.)
  • In 24 hours, severity #1 SR was assigned to 6 different engineers. It takes a lot of time to gather log files, submit them and have it reviewed by Oracle engineer until his/her shift is changed. Sometimes you just don’t have time to get answer from Oracle, you have to do it by your own and take all risks. It requires an experience.

Posted in RAC issues | No Comments »

ODev Yathra Tour 2018 – discovering Incredible India

Posted by Kamran Agayev A. on August 10th, 2018

Last month, after long brainstorm, I decided to take my chance and accepted my participation at Indian Oracle ODev Yathra tour. Despite the fact that I’ve visited India (Hyderabad) 2 times in the past for the Sangam conferences, I wanted to discover India more and decided to take 4 cities out of 7.

For the Yathra Tour I submitted 2 papers:

The first one was about “8 ways to migrate your On-Premis database to Oracle Cloud” where I was talking about different ways to migrate the database based on the downtime and the migration requirements to the Oracle Cloud.

The second session “Create, configure and manage Disaster Recovery in Oracle Cloud for On-Premises database” was about creating, configuring and managing DR on the Oracle Cloud using different techniques as well as configuring high level database, backup and network security.

When the agenda was published, I got a lot of messages from the DBAs of the cities at which I was not supposed to participate – that they are looking forward to meet me. So I talked to Sai Ram, the organizer of the Yathra Tour and he managed to put me into the agenda of the rest cities and I accepted one of the hardest decisions of my life and took all cities. I was having (and still have) a lot of ongoing projects in my company and had health issues that were blocking me to travel a long distance for two weeks. But I decided to push my limits and go beyond it.

So, finally, the travel started. I took my first flight to Abu Dhabi, and from Abu Dhabi to Chennai. Landed in Chennai, took a cab to the hotel, have some rest and was in the lobby at 7.30 AM next morning. Yes, this time was the common checkout time from the hotel every day :) I met Oracle Fusion expert Basheer Khan, Machine Learning PM Sandesh, Exadata PM Gurmit in the morning and we had a breakfast together. Then we took a cab and went to the venue.



































So the daily routine for the conference was 7.30AM checkout from the hotel, cab drive to the venue, registration, introduction speech of Sai and other AIOUG members, then delivering presentations, having launch (most of the time spicy Indian launch :) ), closing ceremony at 6.00PM, driving to the airport, flying to the next city, bunch of security checks and etc., driving to the hotel, check-in and off to bed at 1.00AM and then checkout at 7.30 AM and off to the next venue again. Scary, right? :)

The next city was Bengaluru. As we had one extra day there, I decided to have a lunch outside in a random restaurant. The place was near the hotel and I ordered biryani as always :) Although I asked for “less spicy” biryani, I was served with the spicy one. My tongue was burned out and I was hardly drinking the tea for the next 2 days :) But it was very delicious. In the evening I took a small trip to MG (Mahatma Gandhi) road. It was too crowded, fascinating place and I was hardly got rid of a man who was chasing me and trying to sell a chess for 1500 Rupes (which originally was for 600 Rupee) :) He didn’t know I train JiuJitsu :)


















The next city was Ahmedabad. And I was not the only person who was visiting this city for the first time. Actually none of us (mostly Indian speakers) visited Ahmedabad so far :) The roads of this city were wide, and I was told that the Ahmadabad guys are coolest guys in India )) My session was after the launch and I managed to sleep a little bit more and attended the venue later. But unfortunately didn’t manage to visit the barber in the open air whom I was filming with curiosity. He yelled me with his hand and invited me to try his service, but I was late to my session


Ahmedabad_1 Ahmedabad_2


















Next city was Hyderabad and the airport was very familiar to me. I already visited Hyderabad 2 times before. Again, was fortunate to have only one session after the launch and attended the venue a little bit later, met lot of friends that I met in my previous visits and all of us were off to the airport right after the conference.


Hyderabad_1 Hyderabad_2


















And we headed to the Pune. I was happy, because we had an extra day in Pune. We arrived to the city in the evening, and the next day after having launch in the hotel, I missed city tour with speakers who were more energized than me :) and found a Starbucks coffee shop and spend few hours reading book (Ikigai – Japanese concept that means “a reason for being.”)  and relaxed a lot. The next day, we checked out from the hotel early in the morning and went to the Oracle office that was bit far from the hotel, and fortunately did a city tour in parallel )) The venue was huge and beautiful and there was a coffee machine that I used a lot to drink a coffee to stay alive. We had a very interactive sessions and after the conference the bus was waiting for us to take us to the Mumbai! It took approximately 4 hours for us to reach to Mumbai, but we enjoyed the travel a lot. In the following link you can see part of our trip in Connor’s video shoot :)



Pune1 Pune_2




















Mumbai meetup was awesome. I got more questions in just a single session than the rest of the tour J and it ended up finishing the 45 minute session in 1.30 hour! But it was not just a presentation, because of those questions the session was like a discussion which I liked a lot!

Mumbai_1 Mumbai_2



















And after the conference, we headed to airport to take the last city – Gurgaon! The next morning I was extremely tired, barely was walking and standing straight. But got a lot of positive energy from the attendees and did 2 sessions successfully. As my flight was on the next day at 4.00 AM, I returned back to the hotel, had some rest and headed to the airport and returned back to my lovely country, Azerbaijan.


Gurgaon_1 Gurgaon_2 Gurgaon_3





































So overall, the trip was awesome! It was hard, but it was worth it. I made a new friendships, met online friends that were using my blog posts for years and got a lot of positive feedback, listened stories about how my blog posts saved their lives and etc. :) and it motivated me to write more blog posts in the future. I also attended sessions of other speakers and learned a lot both in terms of presentations and technical skills

I would like to thank to the ODev Yathra Tour organizers, especially Sai Ram for all he had done to make us feel like home, to AIOUG staff, to ACE program – especially Jennifer and Lori for supporting us, to all attendees for taking time and attending our sessions. I love India and the community a lot and looking forward to visit the amazing and incredible India again!

Posted in Uncategorized | 1 Comment »

Download and install Oracle Database 18c – NOW!

Posted by Kamran Agayev A. on July 25th, 2018

Most of you already have seen that Oracle Database 18c has been already released. If you haven’t downloaded and installed it yet, let’s do it!

First of all, check the following address and download the installation of Oracle Database 18c:


If you want to download it from the host itself, you can use wget by providing username, password and the installation zip file as follows:

wget –http-user=YOUR_USERNAME –http-password=YOUR_PASSWORD –no-check-certificate –output-document=LINUX.X64_180000_db_home.zip “https://download.oracle.com/otn/linux/oracle18c/180000/LINUX.X64_180000_db_home.zip”

If you want to get more information on this technique, check the following metalink note:

Using WGET to download My Oracle Support Patches (Doc ID 980924.1)


Next, unzip the file and run ./runInstaller :



















Choose the first option and click Next:


















If you don’t want to choose the components and configure the advanced options, choose “Desktop class” and click Next:



















Provide the Oracle Base and database file locations, database name and the SYS password and click Next




















Check the summary information and click Install




















Installation (actually relinkin) will proceed and you will be asked to run the root.sh script with the root user. Run it and click Ok to proceed




















The installation will create a database, provide the OEM page and finishes.

Click close, switch to the terminal, login to the database and start getting your hands dirty with Oracle 18c!












In the next posts, I will share 18c new featuers with practical use cases. Good Luck!


Posted in Administration | 2 Comments »

The most horrific Oracle messages you might get in the production database – or – why DBAs get older

Posted by Kamran Agayev A. on May 10th, 2018

If you are a production DBA of mission critical system, then you might have already seen the following critical, I would say mortal messages in your alert.log file.

  • When your database was up and running, you shutdown it and open and it fails to MOUNT the database and abort






  • The database was hanged with millions of online transactions, and aborted. You start the instance, switch to the MOUNT mode, do some maintenance tasks and try to open the database and …. wait …. wait …. wait …..











  • system01.dbf contains corrupted blocks








  • When it takes 15 hours to restore the database, you run the recover database command and get the following errors:













  • When you’ve done with restore/recover and open the database with RESETLOGS option and see the following errors:










  • When you have missing datafiles of a tablespace with 10Tb size due to hard disk corruption and don’t have a backup











  • Incomplete recovery due to missing archived log files and most probably you are going to fail using *.allow_resetlogs_corruption parameter as well













  • When your database hangs, you get a hard disk corruption and lose some datafiles, and it takes an hour and half to perform and instance recovery and you just wait for that time of period for the database to be opened:

















  • Aaaand most annoying message during the recovery









I will keep updating this post with your and my screenshots. Feel free to send me screenshot of cases where you stressed, but eventually succeeded to solve the database issue

Posted in Administration | 2 Comments »

How to pass Oracle Database 12c: RAC and Grid Infrastructure Administration exam – 1Z0-068 and become Oracle Certified Expert

Posted by Kamran Agayev A. on May 3rd, 2018

In this post I will talk about my journey on how to prepare and pass the 12c RAC and Grid Administration exam.


About the exam

Check the following link to get more information about the exam from Oracle University page:



The exam consists of 3 parts:

– Oracle 12c ASM Administration
– Oracle 12c Grid Infrastructure Installation and Administration
– Oracle 12c RAC Administration


I don’t want to scare you, but the exam is hard enough. The bad thing is – you fail the entire exam if you fail one of the sections. This means that you have to be well prepared for all 3 parts. For me, I was good at ASM and RAC Administration, and was not comfortable with Grid Infrastructure Installation and Administration part which I passed barely.

You may be Oracle high availability expert and fail the exam. You might have an experience but can fail because of useless (or may be uncommon) features and topics that you didn’t practice, or didn’t read or read superficial. Because most of the questions were not checking your practical experience, but theoretical knowledge. I manage high available cluster databases for last 8 years, and it was really hard to answer some of the questions that I haven’t ever faced and I didn’t see the reason to try.
There were a lot of questions like “Choose four option, where blah blah blah ….” And you have to choose 4 options out of 7. You might know 3 correct answers, but because of that 1 wrong option you might fail.

Next, you have to achieve a minimum score for all 3 sections in order to pass the entire exam. You might complete 2 sections with 100% and fail from the one and end up failing the entire exam.


How to prepare for the exam?

You have to read the documentation and play with ASM, RAC database and Grid Infrastructure A LOT!

If you want to learn Oracle 12c Grid Infrastructure installation, check the following video tutorial:



Check the videos section in oraclevideotutorials.com to find out some clusterware related hands-on practices:



The only available book related with the exam (RAC part mostly) is the following book which is worth reading written by friends of mine Syed Jaffar, Kai Yu and Riyaj Shamsudden:

Expert Oracle RAC 12c


In my OCM preparation book, I have two chapters that can help you during the preparation:

Chapter 7 – Grid Infrastructure and ASM

Chapter 8 – Real Application Clusters.


To get free trial pdf copy of the book, go to www.ocmguide.com , or purchase it from the following link:



During the exam, I felt regret skipping reading some chapters in the documentation and viewing some of them superficial. I highly recommend to check ASM, RAC and Grid Infrastructure documentation and make sure you went through the entire documentation at least once. Here are the links to the documentations:


Real Application Clusters Administration and Deployment Guide



Clusterware Administration and Deployment Guide



Automatic Storage Management Administrator’s Guide



Setting deadlines and booking the exam

Most of you (including me) postpone the exam and don’t put deadlines for the preparation and for the exam itself. My advice – set an approximate date for the exam and make a plan for each month, week and day. Then set a date and book the exam! Yes, book it – as you have a chance to rebook if you don’t feel ready unless it’s 24 hours before the exam. Registering for the exam weeks before the exam date will push you to make your preparation completed on time.


I booked the exam for Tuesday, rebooked it to Wednesday, then to Thursday, and then to Friday :). On Wednesday I decided to reschedule it to the next Monday and in the evening I was shocked when I saw that I didn’t actually rescheduled it on Friday. It will happen tomorrow! (on Thursday) Just in a few hours! :)


I didn’t feel that I’m ready and still having few incomplete sections where I was feeling weak, even was about to cancel the exam and don’t attend, but then decided to push hard and try. And if I lose, I decided to lose like a champ :)


So I stayed awake till 3am, took a nap till 6am and made last preparations till 9am. Attended exam at 10am and was completely exhausted, overworked and sleepy.

Fortunately I passed the exam successfully and wish you the same.









This is my experience with Oracle Database 12c: RAC and Grid Infrastructure Administration exam  (1Z0-068).  Let me know if you plan to take the exam, so I guide you through it in more detail.

Good luck!

Posted in RAC issues | 1 Comment »