Pivotal Knowledge Base

Follow

How to RCA for segment(s) marked down

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x
OS RHEL 6.x

Purpose

This article helps to understand the facts to review and understand the cause of segments going down. There can be many reasons for segments going down, so we need to understand the basic principle and take actions accordingly.

Information Collection

In order to RCA a "Segment Down" event, the following information needs to be reviewed:

-- gp_segment_configuration table

select * from gp_segment_configuration gsc join pg_filespace_entry pfe on gsc.dbid=pfe.fsedbid where content= <contentID for segment which went down> ;

-- gp_configuration_history table

select * from gp_configuration_history order by "desc" desc;

-- DB log files from the date and time segments went down

Master logs ($MASTER_DATA_DIRECTORY/pg_log/)
Primary logs (segment_data_directory/pg_log/)
Mirror logs (segment_data_directory/pg_log/)

-- Relevant configuration parameters (gpconfig -s)

gpconfig -s gp_fts_probe_interval 
gpconfig -s gp_fts_probe_threadcount
gpconfig -s gp_fts_probe_timeout
gpconfig -s gp_segment_connect_timeout

-- The time (exact or approximate) when the segment went down

-- Identification of the segment - either DBID or combination of ContentID and Role (Primary/Mirror)

Note: Corresponding primary or corresponding mirror segment for a specific segment can be found in the "gp_segment_configuration" table - primary/mirror pair will have same "content_id".

How to do the RCA

From the gp_segment_configuration and master logs identify few things:

Which segments went down?
What time they all went down?
Is it only primaries?
Is it only mirrors?
Is it only one server or rack?
Is it a mix of primaries or mirrors?

In most cases it is the mirror segments that are marked down. The general cause of mirror segments going down is inability for the primary and mirror to keep timely communication and the primary segment unable to receive confirmation within the time limit controlled by gp_segment_connect_timeout. 

CASE 1:

This is an indicative of mirrors going down due to high workload or networking overload. In this case the "gp_segment_connect_timeout" can be increased to allow for longer response time from mirror. This is not a permanent fix and if the workload keeps increasing, another failure can happen later.

2013-05-08 04:10:50.730638 EDT,,,p28480,th111540096,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=1500' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,

CASE 2:

Mirrors can also go down due to missing files. In this case search for log entry referring to 'transition' and missing files in the segment log files.

CASE 3:

Primaries for segments are marked down. There can be multiple reasons. Start with review of the primary segment log files and search through the timeframe when the segment was marked down. Look for the word "transition", the log messages around this will be more helpful to understand the cause of segments going down. The reason could be one of the following:

PANIC/SIGSEGV
Out of memory (OS or VMEM)
Network issues

CASE 4:

The postmaster process on primary segment will verify periodically if the I/O on the segment data directory works properly (file can be written and read). It does that by writing a file under the data directory ("fts_probe_file.bak"). If there is a problem with the I/O (stuck controller), the segment will not be able to respond to FTS process on the master and FTS will promote the mirror to primary and transition the primary to mirror. Symptoms for these issues are problems where segments are transitioned and segment servers seem "stuck" while nobody being able to connect to them.

Long Term Trend Analysis

Often we need to analyze past behavior of segment failures to identify any long term trends such as possible hardware issues. There is a PSQL script attached to this articles (segment_failures.sql) which can be used for this purpose.

This script will analyze the last three months of segment failures and produce 3 reports:

  1. Any primary segments that have failed more than once within the reporting window.
  2.  Any mirrors that have failed more than once within the reporting window
  3. Any server with segment failures, the date and time (to an hour granularity) and the number of segment failures (mirror and primary) within that hour window.

The output will look similar to the following:

[gpadmin@mdw ~]$ psql -p 54320 -f f.sql
Timing is on.
Primary segments with more than 1 failure
 hostname | content | number_failures
----------+---------+-----------------
(0 rows)

Time: 15.570 ms
Mirror segments with more than 1 failure
 hostname | content | number_failures
----------+---------+-----------------
 sdw1     |       2 |               2
 sdw1     |       3 |               2
(2 rows)

Time: 4.946 ms
Hosts and time with failures
 hostname |      failure time      | number_failures
----------+------------------------+-----------------
 sdw1     | 2016-01-14 10:00:00-08 |               4
 sdw1     | 2016-02-29 07:00:00-08 |               2
 sdw1     |                        |               6
          |                        |               6
(4 rows)

Time: 2.476 ms

Based on the above, certain segmetns can be flagged for investigation for potential failures.  The third report can also be used to roll up the total number of segment failures per node for the reporting period.

If you need to change the reporting period then simply alter the first line of the script:

\set report_interval ('3 month')::INTERVAL

Tips

  1. Details about the segments down are located in the segment db logs.
  2. Use gp_configuration_history to understand if there any patterns.
  3. Use gpstate -e to see the quick state of segments.
  4. Catalog issues can also cause segments to go down.
  5. Always understand why segment went down before you can suggest the recovery.
  6. gprecoverseg full deletes all the data and files in the segment data directory.
  7. Running incremental recovery after full will not work, due to above reasons.

Check the document mentioned here for more information on segment failure analysis.

Comments

Powered by Zendesk