Sunday, January 3, 2021

Cluster Synchronization Service ===> series 2

                        

                          Cluster Synchronization Service.


                                         CRS STACK COMPONENTS

  • Cluster Ready Service.
  • Cluster Synchronization Service.
  • Oracle ASM.
  • Cluster Time Synchronization Service 
  • Event Management
  • GRID NAMING Service.
  • Oracle Agent.
  • Oracle Notification Service.
  • Oracle Root Agent. 


 Cluster Synchronization Service:-

OCSSD  (CSS daemon) - This process is spawned by the cssdagent process. It runs in both
clusterware and non-clusterware environments.  

Its Muti-threaded Process.

OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle/grid user.

This service manages and monitors the node membership in the cluster and updates the node status information in Voting Disk. 

Its also update ,manage the cluster configuration, Node membership  when new node added/remove in the cluster.

CSSDagent process monitor the cluster and provide I/O  fencing.

This service runs as the ocssd.bin process on Linux/Unix and OracleOHService (ocssd.exe) on Windows.

Its provided synchronization Service among the node.

OCSSD run as Grid user.                                                                         grid     20834     1  1 Jan02 ?    00:10:32 /u002/app/oracle/product/12.1.0/grid/bin/ocssd.bin.


OCCSD Configure and Maintain the Cluster using Node Cluster Service and Group membership Service. In back OCSSD use two type of service  Network heartbeat & disk heartbeat service's.

 Where Network Heartbeat :- This Ensure that Node are accessible in the cluster.

              Disk Heartbeat:- It Ensure that the no Spilt brain happened in the Cluster. and Each Node periodically cast the vote in Voting Disk .

OCSSD provide lock Service , its mean the Cluster wide serialization locking function.

and its used the FIFO mechanism to manage the locking.

OCR Data  also updated by OOCSSd process.

OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery.

   OCSSD process Log Location.

/u002/app/gridbase/diag/crs/rac2/crs/trace


if you kill or failure of ocssd.bin causes the reboot of Server to Avoid the Spilt brain situation. So its very critical processs.

[root@rac2 ~]# ps -ef|grep ocssd

grid     20834     1  1 Jan02 ?        00:10:32 /u002/app/oracle/product/12.1.0/grid/bin/ocssd.bin

root     22419 20010  0 16:05 pts/0    00:00:00 grep --color=auto ocssd

[root@rac2 ~]# kill -9 20834

[root@rac2 ~]#

My Node got rebooted.


[root@rac2 ~]# uptime

 02:52:06 up 1 min,  1 user,  load average: 1.09, 0.49, 0.19 ===> Node rebooted & started 1 min before.

You can see the rac2 is evicted from cluster on alert.log of Node1.

[root@rac1 trace]# tail -100f alert.log

2021-01-04 04:03:45.630 [OCSSD(10356)]CRS-1612: Network communication with node rac2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.620 seconds

2021-01-04 04:03:53.634 [OCSSD(10356)]CRS-1611: Network communication with node rac2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.620 seconds

2021-01-04 04:03:57.635 [OCSSD(10356)]CRS-1610: Network communication with node rac2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.620 seconds

2021-01-04 04:04:00.257 [OCSSD(10356)]CRS-1632: Node rac2 is being removed from the cluster in cluster incarnation 503751816

2021-01-04 04:04:00.271 [OCSSD(10356)]CRS-1601: CSSD Reconfiguration complete. Active nodes are rac1 .

2021-01-04 04:04:00.282 [CRSD(9324)]CRS-5504: Node down event reported for node 'rac2'.

2021-01-04 04:04:07.255 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'Generic'.

2021-01-04 04:04:07.255 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'ora.RACDB'.

2021-01-04 04:04:07.256 [CRSD(9324)]CRS-2773: Server 'rac2' has been removed from pool 'ora.TEST'.


This particular eviction happened when we had hit the network timeout.  CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD. 

If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes. 


Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin.

OCSSD :- This process is spawned by the cssdagent process.OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. One Thread Monitor network heartbeat. One thread monitor the Disk heartbeat.

CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality).  

Responsible for spawning  ocssd.bin.

Monitor node & CSSD process for hangs.

This is a multi-threaded process that runs at an elevated priority and runs as the root user.

Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent 


CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). 

This is a multi-threaded process that runs at an elevated priority and runs as the root user.

This Provide redundancy of CSSD Monitoring.

 Monitor node & CSSD process for hangs.

Startup sequence:

INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

=================================EOD================================

Reference:-
Troubleshooting Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)