Tuesday, January 12, 2021

How to Change Database Name Using Control file trace

How to Change Database Name.


Scenario :-  Unix Team Took Snapshot Current QA Server and Created a  new Server with QAtest Name

         And here I have to open a database with newname QRT from PLO.

On New Server Lets login.

export ORACLE_SID=PLO

 $ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Tue Jan 12 22:27:38 2021

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select name from v$database;
NAME
---------
PL0
SQL> select open_mode,name from V$database;
OPEN_MODE            NAME
-------------------- ---------    =====> DB name PL0
MOUNTED              PL0


  • Take Backup of Existing  Redologfile  which are need for recovery. ======>

$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Tue Jan 12 22:24:04 2021
Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select member from v$logfile;
MEMBER
--------------------------------------------------------------------------------
/oracle/PL0/origlogB/log_g4m1.dbf
/oracle/PL0/mirrlogB/log_g4m2.dbf
/oracle/PL0/origlogA/log_g3m1.dbf
/oracle/PL0/mirrlogA/log_g3m2.dbf
/oracle/PL0/origlogB/log_g2m1.dbf
/oracle/PL0/mirrlogB/log_g2m2.dbf
/oracle/PL0/origlogA/log_g1m1.dbf
/oracle/PL0/mirrlogA/log_g1m2.dbf
8 rows selected.
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options


$ cp /oracle/PL0/origlogB/log_g4m1.dbf /oracle/PL0/origlogB/log_g4m1.dbf_bkp
$ cp /oracle/PL0/mirrlogB/log_g4m2.dbf /oracle/PL0/mirrlogB/log_g4m2.dbf_bkp
$ cp /oracle/PL0/origlogA/log_g3m1.dbf /oracle/PL0/origlogA/log_g3m1.dbf_bkp
$ cp /oracle/PL0/mirrlogA/log_g3m2.dbf /oracle/PL0/mirrlogA/log_g3m2.dbf_bkp
$ cp /oracle/PL0/origlogB/log_g2m1.dbf /oracle/PL0/origlogB/log_g2m1.dbf_bkp
$ cp /oracle/PL0/mirrlogB/log_g2m2.dbf /oracle/PL0/mirrlogB/log_g2m2.dbf_bkp
$ cp /oracle/PL0/origlogA/log_g1m1.dbf /oracle/PL0/origlogA/log_g1m1.dbf_bkp
$ cp /oracle/PL0/mirrlogA/log_g1m2.dbf /oracle/Pl0/mirrlogA/log_g1m2.dbf_bkp



Take Controlfile Backup as trace and it located  on ===>

./oracle/diag/rdbms/qrt/QRT/trace/

location.

SQL> startup Mount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 5.3447E+10 bytes
Fixed Size                  2265864 bytes
Variable Size            1.7717E+10 bytes
Database Buffers         3.5702E+10 bytes
Redo Buffers               26480640 bytes
Database mounted.
Database opened.
SQL> alter database backup controlfile to trace;

Database altered.

SQL> select * from V$diag_info;

   
   INST_ID NAME
---------- ----------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
         1 Diag Cdump
/oracle/diag/rdbms/qrt/QRT/cdump

         1 Health Monitor
/oracle/diag/rdbms/qrt/QRT/hm

         1 Default Trace File
/oracle/diag/rdbms/qrt/QRT/trace/QRT_ora_12755.trc


Update the Controlfile trace. and update REUSE with SET.

STARTUP NOMOUNT
CREATE CONTROLFILE SET DATABASE "QRT" RESETLOGS  ARCHIVELOG
    MAXLOGFILES 255
    MAXLOGMEMBERS 2
    MAXDATAFILES 3000
    MAXINSTANCES 2
    MAXLOGHISTORY 22495
LOGFILE
  GROUP 1 (
    '/oracle/QRT/origlogA/log_g1m1.dbf',
    '/oracle/QRT/mirrlogA/log_g1m2.dbf'
  ) SIZE 250M BLOCKSIZE 512,
  GROUP 2 (
    '/oracle/QRT/origlogB/log_g2m1.dbf',
    '/oracle/QRT/mirrlogB/log_g2m2.dbf'
  ) SIZE 250M BLOCKSIZE 512,
  GROUP 3 (
    '/oracle/QRT/origlogA/log_g3m1.dbf',
    '/oracle/QRT/mirrlogA/log_g3m2.dbf'
  ) SIZE 250M BLOCKSIZE 512,
  GROUP 4 (
    '/oracle/QRT/origlogB/log_g4m1.dbf',
    '/oracle/QRT/mirrlogB/log_g4m2.dbf'
  ) SIZE 250M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
  '/oracle/QRT/sapdata1/system_1/system.data1',
  '/oracle/QRT/sapdata1/undo_1/undo.data1',
  '/oracle/QRT/sapdata2/undo_2/undo.data2',
  '/oracle/QRT/sapdata1/undo_3/undo.data3',
  '/oracle/QRT/sapdata2/undo_4/undo.data4',
  '/oracle/QRT/sapdata1/sysaux_1/sysaux.data1',
  '/oracle/QRT/sapdata2/dat_1/dat.data1',
  '/oracle/QRT/sapdata2/dat_2/dat.data2',
  '/oracle/QRT/sapdata5/dat_3/dat.data3',
  '/oracle/QRT/sapdata3/dat_4/dat.data4'
CHARACTER SET UTF8
;

  • Shut down The database instance

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Tue Jan 12 22:33:57 2021

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> shut abort
ORACLE instance shut down.

  • Take Backup of Controlfile ====> 
dp3:oraQRT 71> grep con initQRT.ora
*.control_file_record_keep_time=30
*.control_files='/oracle/QRT/origlogA/cntrl/cntrlQRT.dbf','/oracle/QRT/origlogB/cntrl/cntrlQRT.dbf','/oracle/QRT/sapdata1/cntrlrlQRT.dbf'
dp3:oraQRT 72>


$ cd /oracle/QWT/origlogA/cntrl/
$ mv cntrlQWT.dbf cntrlQRT.dbf_bkp

 mv/oracle/QRT/origlogB/cntrl/cntrlQRT.dbf','/oracle/QRT/sapdata1/cntrlrlQRT.dbf_bkp'

Now to startup a Database with QRT name.

export ORACLE_SID=QRT
$ ps -ef|grep pmon
oraQRT     9987   8763  0 22:36 pts/2    00:00:00 grep pmon
$
$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Tue Jan 12 22:36:56 2021
Copyright (c) 1982, 2013, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> @new_con.sql
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 5.3447E+10 bytes
Fixed Size                  2265864 bytes
Variable Size            1.7717E+10 bytes
Database Buffers         3.5702E+10 bytes
Redo Buffers               26480640 bytes
Control file created.
SQL> alter database open resetlogs;
alter database open resetlogs
*
ERROR at line 1:
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/oracle/QRT/sapdata1/system_1/system.data1'

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01589: must use RESETLOGS or NORESETLOGS option for database open

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 31328916885 generated at 01/09/2021 12:53:52 needed for
thread 1
ORA-00289: suggestion : /oracle/QRT/oraarch/QRTarch1_14978_1036372415.dbf
ORA-00280: change 31328916885 for thread 1 is in sequence #14978

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/oracle/P10/mirrlogB/log_g2m2.dbf_bkp
Log applied.
Media recovery complete.
SQL> alter database open resetlogs;


SQL> show parameter db_name
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      QWT
SQL>
SQL>
SQL> shut immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 5.3447E+10 bytes
Fixed Size                  2265864 bytes
Variable Size            1.7717E+10 bytes
Database Buffers         3.5702E+10 bytes
Redo Buffers               26480640 bytes
Database mounted.
Database opened.

SQL> select name from V$database;

NAME
---------
QWT

===============================EOD===============================

Monday, January 11, 2021

CRS STACK :- Series 3

                                              Cluster Ready Service STACK.


Below are the Components of CRS Stack.

  • Cluster Synchronization Service.
  • Oracle ASM.
  • Cluster Time Synchronization Service 
  • Event Management
  • GRID NAMING Service.
  • Oracle Agent.
  • Oracle Notification Service.
  • Oracle Root Agent.


Event Management:- 


This background process publishes events to all the members of the cluster. On Linux/Unix, the process name is evmd.bin, and on Windows, it is evmd.exe. 

[root@rac1 ~]# ps -ef|grep evmd
grid      2317     1  1 06:46 ?        00:00:55 /u002/app/oracle/product/12.1.0/grid/bin/evmd.bin
grid      2426  2317  0 06:46 ?        00:00:15 /u002/app/oracle/product/12.1.0/grid/bin/evmlogger.bin -o /u002/app/oracle/product/12.1.0/grid/log/[HOSTNAME]/evmd/evmlogger.info -l /u002/app/oracle/product/12.1.0/grid/log/[HOSTNAME]/evm /evmlogger.log.

EVMD process run as Grid user.
EVMD process spawn the child process evmlogger. and generate the event when something happened in the cluster.
EVMD process scan the callout directory to invoke the callout message.
EVMD process will restart on failure  & does not bring  node reboot. It will start automatically.

[root@rac1 evmd]# ps -ef|grep evmd.bin
grid      2317     1  1 07:36 ?        00:01:10 /u002/app/oracle/product/12.1.0/grid/bin/evmd.bin
root      9175  5073  0 09:04 pts/0    00:00:00 grep --color=auto evmd.bin
[root@rac1 evmd]# kill -9 2317
[root@rac1 evmd]#  ps -ef|grep evmd.bin
grid     10510     1  3 09:08 ?        00:00:00 /u002/app/oracle/product/12.1.0/grid/bin/evmd.bin
root     10528  5073  0 09:08 pts/0    00:00:00 grep --color=auto evmd.bin
[root@rac1 evmd]#

EVMD process received the FAN event posted by client  and distributes the FAN event to client which subscribed to them.
 
Lets Consider two Node cluster RAC1,RAC2  , If rac2  leave the cluster , the OCSSD process on RAC1  post a leave FAN event  and then EVMD publish and distribute this message to CRSD process on Node A. 
CRS Process is always Communicate with EVMD process because its  Subscribe to EVMD demon for FAN event.



GRID NAMING Service.:-

This process provides the name resolutions with the cluster. 
From 12cR1 Version, GNS can be used for multiple clusters in contrast to the single-cluster version.
GNS is used to simplify network management.
If GNS used then No need to manually set the SCAN & VIP address.
It also eliminates the need to configure VIP names and SCAN names in a Domain Name Server (DNS) as GNS provides the name resolution for the cluster.

When we Select GNS configuration then then it will GNS take responsibility in DNS for cluster Node. 

1. Allocate one static IP address on public network for the GNS VIP.
2. Setup GNS as a delegate for the cluster's sub-domain on the DNS server.
3. Validate that there is a DHCP server that is addressable by the cluster nodes.

When installing RAC , there are 2 options:

1. Configure GNS and let it handle name resolution in the cluster subdomain.
2. Do not select GNS and configure each Node and SCAN name with IP addresses defined in DNS.


Oracle ASM:-

From 11g the ASM is important  part of RAC. and  its provide the Disk, Volume management in Cluster ware and oracle database.

below Are the ASM process.

[root@rac1 evmd]# ps -ef|grep asm
grid      3149     1  0 07:37 ?        00:00:00 asm_pmon_+ASM1
grid      3151     1  0 07:37 ?        00:00:02 asm_psp0_+ASM1
grid      3154     1  2 07:37 ?        00:03:32 asm_vktm_+ASM1
grid      3158     1  0 07:37 ?        00:00:06 asm_gen0_+ASM1
grid      3160     1  0 07:37 ?        00:00:00 asm_mman_+ASM1
grid      3164     1  0 07:37 ?        00:00:10 asm_diag_+ASM1
grid      3166     1  0 07:37 ?        00:00:00 asm_ping_+ASM1
grid      3168     1  0 07:37 ?        00:00:23 asm_dia0_+ASM1
grid      3170     1  0 07:37 ?        00:00:27 asm_lmon_+ASM1
grid      3172     1  0 07:37 ?        00:00:13 asm_lmd0_+ASM1
grid      3174     1  0 07:37 ?        00:00:41 asm_lms0_+ASM1
grid      3178     1  0 07:37 ?        00:00:15 asm_lmhb_+ASM1
grid      3180     1  0 07:37 ?        00:00:00 asm_lck1_+ASM1
grid      3182     1  0 07:37 ?        00:00:00 asm_dbw0_+ASM1
grid      3184     1  0 07:37 ?        00:00:00 asm_lgwr_+ASM1
grid      3186     1  0 07:37 ?        00:00:00 asm_ckpt_+ASM1
grid      3188     1  0 07:37 ?        00:00:00 asm_smon_+ASM1
grid      3190     1  0 07:37 ?        00:00:00 asm_lreg_+ASM1
grid      3192     1  0 07:37 ?        00:00:00 asm_pxmn_+ASM1
grid      3194     1  0 07:37 ?        00:00:01 asm_rbal_+ASM1
grid      3196     1  0 07:37 ?        00:00:04 asm_gmon_+ASM1
grid      3198     1  0 07:37 ?        00:00:00 asm_mmon_+ASM1
grid      3201     1  0 07:37 ?        00:00:02 asm_mmnl_+ASM1
grid      3203     1  0 07:37 ?        00:00:02 asm_lck0_+ASM1
grid      3205     1  0 07:37 ?        00:00:00 asm_gcr0_+ASM1
grid      3328     1  0 07:38 ?        00:00:00 asm_asmb_+ASM1
grid      3330     1  0 07:38 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3889     1  0 07:38 ?        00:00:00 mdb_asmb_-MGMTDB
grid      3893     1  0 07:38 ?        00:00:00 oracle+ASM1_asmb_-mgmtdb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle    8443     1  0 09:03 ?        00:00:00 ora_asmb_TEST1
grid      8458     1  0 09:03 ?        00:00:00 oracle+ASM1_asmb_test1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))


Cluster Time Synchronization Service :-


CTSS introduce in 11g. and its handle time synchronization among the node in cluster.

Either you use NTP or CTSS to synch up time among cluster Node.

If NTP service disabled  then CTSS provide time synchronization Service. 

[root@rac1 evmd]# ps -ef|grep ctss

root      2943     1  0 07:37 ?   00:01:23 /u002/app/oracle/product/12.1.0/grid/bin/octssd.bin reboot


Oracle Notification Service:-

ONS installed& configured  during the Clustered installation. 

State of resources in cluster has changed then ONS process communicate with each other on node & published the message about resource change.

CRS closely work with ONS process.  HA event information triggered by CRS and shared with ONS Process & this ONS process publish this HA event with Middleware tier.

ONS process play very critical role while sharing any FAN  event with client. 

to use this setup then we have to build some logic so application can get HA event information quickly.

HA events are used for multiple things for eg to detect failure quickly in cluster. & publish the HA event (FAN) Failure Application notification. This FAN event are Published by ONS.


Oracle Agent:-

[root@rac1 evmd]# ps -ef|grep oraagent

grid 2301     1  1 07:36 ?   00:01:5 /u002/app/oracle/product/12.1.0/grid/bin/oraagent.bin
grid      3424     1  1 07:38 ?        00:02:52 /u002/app/oracle/product/12.1.0/grid/bin/oraagent.bin
oracle    8299     1  0 09:03 ?        00:00:44 /u002/app/oracle/product/12.1.0/grid/bin/oraagent.bin

Run by Grid & oracle User & handle OAHSD grid, oracle owned resources.
Run a server callout scripts  when FAN event Occur.



Oracle Root Agent.:-

The orarootagent is the agent responsible for managing all root-owned ohasd resources.
Its help CRSD process to manage the root owned resource.

for eg Network,VIP etc.




==============================------EOD----==============================




Friday, January 8, 2021

System Tablespace Growing Fast

System Tablespace Growing Fast


Current size is 44GB in my Prod Database.

select sum(bytes)/1024/1024/1024  "GB", TABLESPACE_NAME from dba_data_files group by TABLESPACE_NAME order by 1;


                      GB    TABLESPACE_NAME
------------------------ ------------------------------
                       44     SYSTEM


Free Size is 500 MB only.

SQL> select sum(bytes)/1024/1024 from dba_free_space
  2  where TABLESPACE_NAME='SYSTEM';

SUM(BYTES)/1024/1024
--------------------
             561.625


Now Let me see who consuming  More space.

SQL> column segment_name format a30
set numf 999,999,999,999,999,999

select * from
(select segment_name, bytes from dba_segments where tablespace_name = 'SYSTEM' order by bytes desc)
where rownum < 26;

SEGMENT_NAME                                      BYTES
------------------------------ ------------------------
AUD$                                     32,827,768,832  =====> 32 GB too Much space Consume.
SOURCE$                                   3,489,660,928
IDL_UB1$                                  1,902,116,864
I_SOURCE1                                 1,412,431,872
IDL_UB2$                                    752,877,568
ARGUMENT$                                   738,197,504
C_OBJ#                                      574,619,648
I_ARGUMENT1                                 478,150,656
I_ARGUMENT2                                 377,487,360
VIEW$                                       370,147,328
C_TOID_VERSION#                             291,504,128

SEGMENT_NAME                                      BYTES
------------------------------ ------------------------
I_COL2                                      270,532,608
I_COL1                                      234,881,024
IDL_CHAR$                                   218,103,808
I_COL3                                      142,606,336
DEPENDENCY$                                 125,829,120
IDL_SB4$                                    117,440,512
ACCESS$                                     109,051,904
I_DEPENDENCY2                               102,760,448
C_OBJ#_INTCOL#                              100,663,296
KOTAD$                                       87,031,808
HIST_HEAD$                                   83,886,080

SEGMENT_NAME                                      BYTES
------------------------------ ------------------------
I_DEPENDENCY1                                83,886,080
OBJ$                                         57,671,680
I_ACCESS1                                    55,574,528

25 rows selected.


Lets Reduced size.

create table tmpAud.
 
SQL> sho user
USER is "SYS"
SQL>

create table tmpAud as
select * from sys.aud$
where to_date(trunc(ntimestamp#)) > trunc(sysdate-60);

Table created.

Cleanup and coalesce aud$

SQL>  truncate table sys.aud$;

Table truncated.

-- Repopulate
SQL> insert into sys.aud$ select * from tmpAud;
4407589 rows created.
SQL> commit;
Commit complete.

SQL> drop table tmpAud;
Table dropped.
SQL>

Now Lets Re-Check the space size.

SQL> column segment_name format a30
set numf 999,999,999,999,999,999

select * from
(select segment_name, bytes from dba_segments where tablespace_name = 'SYSTEM' order by bytes desc)
where rownum < 26;

SEGMENT_NAME                                      BYTES
------------------------------ ------------------------
SOURCE$                                   3,489,660,928
IDL_UB1$                                  1,902,116,864
I_SOURCE1                                 1,412,431,872
AUD$                                    897,581,056 ===> We Achieved.



SQL> 
select sum(bytes)/1024/1024/1024 from SM$TS_FREE
where TABLESPACE_NAME='SYSTEM';

SUM(BYTES)/1024/1024/1024
-------------------------
               30.7945557  ====> 30 GB space free. We are good


                                                 ===EOD==

Sunday, January 3, 2021

Cluster Synchronization Service ===> series 2

                        

                          Cluster Synchronization Service.


                                         CRS STACK COMPONENTS

  • Cluster Ready Service.
  • Cluster Synchronization Service.
  • Oracle ASM.
  • Cluster Time Synchronization Service 
  • Event Management
  • GRID NAMING Service.
  • Oracle Agent.
  • Oracle Notification Service.
  • Oracle Root Agent. 


 Cluster Synchronization Service:-

OCSSD  (CSS daemon) - This process is spawned by the cssdagent process. It runs in both
clusterware and non-clusterware environments.  

Its Muti-threaded Process.

OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle/grid user.

This service manages and monitors the node membership in the cluster and updates the node status information in Voting Disk. 

Its also update ,manage the cluster configuration, Node membership  when new node added/remove in the cluster.

CSSDagent process monitor the cluster and provide I/O  fencing.

This service runs as the ocssd.bin process on Linux/Unix and OracleOHService (ocssd.exe) on Windows.

Its provided synchronization Service among the node.

OCSSD run as Grid user.                                                                         grid     20834     1  1 Jan02 ?    00:10:32 /u002/app/oracle/product/12.1.0/grid/bin/ocssd.bin.


OCCSD Configure and Maintain the Cluster using Node Cluster Service and Group membership Service. In back OCSSD use two type of service  Network heartbeat & disk heartbeat service's.

 Where Network Heartbeat :- This Ensure that Node are accessible in the cluster.

              Disk Heartbeat:- It Ensure that the no Spilt brain happened in the Cluster. and Each Node periodically cast the vote in Voting Disk .

OCSSD provide lock Service , its mean the Cluster wide serialization locking function.

and its used the FIFO mechanism to manage the locking.

OCR Data  also updated by OOCSSd process.

OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery.

   OCSSD process Log Location.

/u002/app/gridbase/diag/crs/rac2/crs/trace


if you kill or failure of ocssd.bin causes the reboot of Server to Avoid the Spilt brain situation. So its very critical processs.

[root@rac2 ~]# ps -ef|grep ocssd

grid     20834     1  1 Jan02 ?        00:10:32 /u002/app/oracle/product/12.1.0/grid/bin/ocssd.bin

root     22419 20010  0 16:05 pts/0    00:00:00 grep --color=auto ocssd

[root@rac2 ~]# kill -9 20834

[root@rac2 ~]#

My Node got rebooted.


[root@rac2 ~]# uptime

 02:52:06 up 1 min,  1 user,  load average: 1.09, 0.49, 0.19 ===> Node rebooted & started 1 min before.

You can see the rac2 is evicted from cluster on alert.log of Node1.

[root@rac1 trace]# tail -100f alert.log

2021-01-04 04:03:45.630 [OCSSD(10356)]CRS-1612: Network communication with node rac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.620 seconds

2021-01-04 04:03:53.634 [OCSSD(10356)]CRS-1611: Network communication with node rac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds

2021-01-04 04:03:57.635 [OCSSD(10356)]CRS-1610: Network communication with node rac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.620 seconds

2021-01-04 04:04:00.257 [OCSSD(10356)]CRS-1632: Node rac2 is being removed from the cluster in cluster incarnation 503751816

2021-01-04 04:04:00.271 [OCSSD(10356)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac1 .

2021-01-04 04:04:00.282 [CRSD(9324)]CRS-5504: Node down event reported for node 'rac2'.

2021-01-04 04:04:07.255 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'Generic'.

2021-01-04 04:04:07.255 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'ora.RACDB'.

2021-01-04 04:04:07.256 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'ora.TEST'.


This particular eviction happened when we had hit the network timeout.  CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD. 

If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes. 


Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin.

OCSSD :- This process is spawned by the cssdagent process.OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. One Thread Monitor network heartbeat. One thread monitor the Disk heartbeat.

CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality).  

Responsible for spawning  ocssd.bin.

Monitor node & CSSD process for hangs.

This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent 


CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). 

This is a multi-threaded process that runs at an elevated priority and runs as the root user.

This Provide redundancy of CSSD Monitoring.

 Monitor node & CSSD process for hangs.

Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

=================================EOD================================

Reference:-
Troubleshooting Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)

Saturday, January 2, 2021

CRSD Cluster Ready Service ====>Series: 1

                                   CRSD =========>Cluster Ready Service.               

           From 11g Oracle Cluster ware consist of two Stack .

OHASD STACK :- High Availability Service stack its at lower End of Cluster process & its consist of Set of background process.

CRS STACK :- CRS stack is separate stack , Which consist of set of background process. This mainly handle by CRSD 

As I mentioned this two stack consist of set of background process which help to run the cluster resources.  with the help of OHASD Stack CRS stack start/stop and provide its service  to each resources in cluster.

                         CRS STACK COMPONENTS

Cluster Ready Service.

Cluster Synchronization Service.

Oracle ASM.

Cluster Time Synchronization Service 

Event Management

GRID NAMING Service.

Oracle Agent.

Oracle Notification Service.

Oracle Root Agent. 


Every process in CRS stack is critical. Let explore the CRS Stack first. 


Cluster Ready Service.


[root@rac2 ~]# ps -ef|grep -v grep | grep crsd

root     32449     1  1 04:00 ?        00:02:27 /u002/app/oracle/product/12.1.0/grid/bin/crsd.bin reboot

[root@rac2 ~]#

Some Observation :- 

1) CRSD process always run as a root operating system user.

2) CRSD Process is always run from Different Oracle Home that is GRID_HOME.

3) CRSD manage the cluster resources based on the configuration information stored on OCR.

4) CRSD process generate event when resources status has been changed.

5) CRSD start/stop/monitor/manage cluster resource ,oracle instance/Database , listener ,vip, ASM instance ,resource failover operation.

6)CRSD also responsible to automatically restart failed components.

7)CRSD responsible to updated status of  cluster resource in OCR. 

8)Public interface, private interface, Virtual IP (VIP) all this interface should be up and running to make the CRS workable. Each Node interface should pingable. Without this network interface CRS cannot installed. 

  

CRSD Log file location:

 /u002/app/gridbase/diag/crs/rac2/crs/trace/crsd_oraagent_oracle.trc

To find the Master node in cluster  use the CRSD.trc file.

cd /u002/app/gridbase/diag/crs/rac2/crs/trace

[root@rac2 trace]# grep "OCR MASTER" crsd.trc


Missing  CRSD process associated <node>.pid file  ????====?

cd /u002/app/oracle/product/12.1.0/grid/crs/init

total 60
-rw-r--r--. 1 root root 4361 Sep 28 08:09 oka
-rw-r--r--. 1 root root 7250 Sep 28 08:09 ohasd.sles
-rw-r--r--. 1 root root 7160 Sep 28 08:09 ohasd
-rw-r--r--. 1 root root 9126 Sep 28 08:09 init.ohasd
-rw-r--r--. 1 root root 6159 Sep 28 08:09 afd.sles
-rw-r--r--. 1 root root 5905 Sep 28 08:09 afd
-rw-r--r--. 1 root root    0 Sep 28 08:13 rac2
-rw-r--r--. 1 root root    6 Dec 19 03:35 rac2.pid-bkp
-rw-r--r--  1 root root    5 Dec 19 10:14 rac2.pid ========> this Node Pid file for CRSD
[root@rac2 init]#

lets remove rac2.pid file and restart the Cluster Services from Node2.


[root@rac2 init]# rm rac2.pid
rm: remove regular file ârac2.pidâ? yes
[root@rac2 init]#

[root@rac2 init]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl start crs


[root@rac2 init]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services =======> you will get famous error 
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
[root@rac2 init]#


CRS-4535: Cannot communicate with Cluster Ready Services.

Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]


Please check below trace & log file for more trouble shoot.


/u002/app/gridbase/diag/crs/rac2/crs/trace/ohasd_orarootagent_root.trc


2020-12-19 22:46:08.316548 :CLSDYNAM:1366206208: [ora.crsd]{0:0:2} [check] PID FILE doesn't exist.
2020-12-19 22:46:08.316557 :CLSDYNAM:1366206208: [ora.crsd]{0:0:2} [check] PID  from /u002/app/oracle/product/12.1.0/grid/crs/init/rac2.pid
2020-12-19 22:46:08.316663 :  CLSDMC:1366206208: Connecting to ipc://rac2_DBG_CRSD
2020-12-19 22:46:08.316773 :  CLSDMC:1366206208: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2020-12-19 22:46:08.316815 :CLSDYNAM:1366206208: [ora.crsd]{0:0:2} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2020-12-19 22:46:08.316840 :CLSDYNAM:1366206208: [ora.crsd]{0:0:2} [check] Calling PID check for daemon

You can see the straight forward message about missing of crs process pid.

Messages or Syslog from all nodes from the time of the problem:

  • Linux: /var/log/messages  ==============>
  • Sun: /var/adm/messages
  • HP-UX: /var/adm/syslog/syslog.log
  • IBM: /bin/errpt -a > messages.out
[root@rac2 trace]# grep CRSD alert.log =====> Checking CRSD status

2021-01-02 22:45:52.543 [CRSD(14337)]CRS-1012: The OCR service started on node rac2.
2021-01-02 22:45:54.451 [CRSD(14337)]CRS-1201: CRSD started on node rac2.
2021-01-02 22:51:26.657 [CRSD(17255)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 17255
[root@rac2 trace]# date
Sat Jan  2 22:52:43 IST 2021
[root@rac2 trace]# ps -ef|grep 17255
root     17657  2664  0 22:52 pts/0    00:00:00 grep --color=auto 17255
[root@rac2 trace]#

Still No startup of Process.


Without CRS Process none of the process has been started. for eg Database instance,ASM ,Listener 

even   This below stat command also not working 

[root@rac2 trace]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
[root@rac2 trace]#

[root@rac2 trace]# ps -ef|grep pmon

root      3322 25238  0 02:35 pts/2    00:00:00 grep --color=auto pmon

[root@rac2 trace]# ps -ef|grep tns

root        22     2  0 Dec18 ?        00:00:00 [netns]

root      3339 25238  0 02:35 pts/2    00:00:00 grep --color=auto tns

[root@rac2 trace]#



[root@rac2 trace]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac2                     STABLE
ora.crf
      1        ONLINE  OFFLINE                               STABLE
ora.crsd
      1        ONLINE  OFFLINE                               STABLE
ora.cssd
      1        ONLINE  ONLINE       rac2                     STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       rac2                     STABLE
ora.ctssd
      1        ONLINE  OFFLINE                               STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  INTERMEDIATE rac2                     STABLE
ora.gipcd
      1        ONLINE  ONLINE       rac2                     STABLE
ora.gpnpd
      1        ONLINE  ONLINE       rac2                     STABLE
ora.mdnsd
      1        ONLINE  ONLINE       rac2                     STABLE
ora.storage
      1        ONLINE  OFFLINE                               STABLE
--------------------------------------------------------------------------------

Now lets resolve this issue.

Simply create file rac2.pid  <node>.pid  name file under ============>/u002/app/oracle/product/12.1.0/grid/crs/init 

[root@rac2 init]# touch rac2.pid

[root@rac2 init]#

[root@rac2 init]# touch rac2.pid

[root@rac2 init]# ls -ltr

total 64

-rw-r--r--. 1 root root 4361 Sep 28 08:09 oka

-rw-r--r--. 1 root root 7250 Sep 28 08:09 ohasd.sles

-rw-r--r--. 1 root root 7160 Sep 28 08:09 ohasd

-rw-r--r--. 1 root root 9126 Sep 28 08:09 init.ohasd

-rw-r--r--. 1 root root 6159 Sep 28 08:09 afd.sles

-rw-r--r--. 1 root root 5905 Sep 28 08:09 afd

-rw-r--r--. 1 root root    0 Sep 28 08:13 rac2

-rw-r--r--. 1 root root    6 Dec 19 03:35 rac2.pid-bkp

-rw-r--r--  1 root root    5 Dec 19 10:14 rac2.pid_bkp_1

-rw-r--r--  1 root root    6 Jan  2 22:45 rac2.pid_bkp

-rw-r--r--  1 root root    0 Jan  2 22:53 rac2.pid



Now Restart the OHASD and cluster process

[root@rac2 init]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl stop crs -f ======>

CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'rac2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'rac2'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'rac2'
CRS-2673: Attempting to stop 'ora.evmd' on 'rac2'
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'rac2'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'rac2' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'rac2' succeeded
CRS-2677: Stop of 'ora.evmd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'rac2'
CRS-2677: Stop of 'ora.gpnpd' on 'rac2' succeeded
CRS-2677: Stop of 'ora.cssd' on 'rac2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'rac2'
CRS-2677: Stop of 'ora.gipcd' on 'rac2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'rac2' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@rac2 init]# ls -ltr

total 68

-rw-r--r--. 1 root root 4361 Sep 28 08:09 oka

-rw-r--r--. 1 root root 7250 Sep 28 08:09 ohasd.sles

-rw-r--r--. 1 root root 7160 Sep 28 08:09 ohasd

-rw-r--r--. 1 root root 9126 Sep 28 08:09 init.ohasd

-rw-r--r--. 1 root root 6159 Sep 28 08:09 afd.sles

-rw-r--r--. 1 root root 5905 Sep 28 08:09 afd

-rw-r--r--. 1 root root    0 Sep 28 08:13 rac2

-rw-r--r--. 1 root root    6 Dec 19 03:35 rac2.pid-bkp

-rw-r--r--  1 root root    5 Dec 19 10:14 rac2.pid_bkp_1

-rw-r--r--  1 root root    6 Jan  2 22:45 rac2.pid_bkp

-rw-r--r--  1 root root    6 Jan  2 23:02 rac2.pid =======>CRSD process id has been added.

[root@rac2 init]#



[root@rac2 init]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl start crs.

[root@rac2 init]# /u002/app/oracle/product/12.1.0/grid/bin/crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

[root@rac2 init]#


[root@rac2 init]# ps -ef|grep pmon

grid     21444     1  0 23:03 ?        00:00:00 asm_pmon_+ASM2

oracle   21586     1  0 23:03 ?        00:00:00 ora_pmon_TEST2

root     23606  4013  0 23:08 pts/1    00:00:00 grep --color=auto pmon

[root@rac2 init]# ps -ef|grep tns

root        22     2  0 22:13 ?        00:00:00 [netns]

grid     21269     1  0 23:03 ?        00:00:00 /u002/app/oracle/product/12.1.0/grid/bin/tnslsnr ASMNET1LSNR_ASM -no_crs_notify -inherit

grid     21304     1  0 23:03 ?        00:00:00 /u002/app/oracle/product/12.1.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit

grid     21380     1  0 23:03 ?        00:00:00 /u002/app/oracle/product/12.1.0/grid/bin/tnslsnr LISTENER_SCAN1 -no_crs_notify -inherit

root     23626  4013  0 23:08 pts/1    00:00:00 grep --color=auto tns

[root@rac2 init]#


We can see the CRS process has been started.

[root@rac2 init]# ps -p 21098 -o cmd
CMD
/u002/app/oracle/product/12.1.0/grid/bin/crsd.bin reboot

[root@rac2 init]#

2021-01-02 23:02:44.955 [OCSSD(20834)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac1 rac2 .
2021-01-02 23:02:47.671 [ORAROOTAGENT(21014)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 21014
2021-01-02 23:02:47.713 [OCTSSD(21027)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 21027
2021-01-02 23:02:48.861 [OCTSSD(21027)]CRS-2407: The new Cluster Time Synchronization Service reference node is host rac1.
2021-01-02 23:02:48.865 [OCTSSD(21027)]CRS-2401: The Cluster Time Synchronization Service started on host rac2.
2021-01-02 23:02:48.831 [OCTSSD(21027)]CRS-2408: The clock on host rac2 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2021-01-02 23:02:57.657 [OSYSMOND(21091)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 21091
2021-01-02 23:02:58.684 [CRSD(21098)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 21098
2021-01-02 23:03:01.625 [CRSD(21098)]CRS-1012: The OCR service started on node rac2.
2021-01-02 23:03:01.676 [CRSD(21098)]CRS-1201: CRSD started on node rac2.


=================================EOD=================================

Other Reason's are  to get below infamous message is ======>

CRS-4535: Cannot communicate with Cluster Ready Services.

  • OCR is inaccessible
  • ocr.loc content mismatch with other cluster nodes.
  • Difference between time  in cluster Node.