Pivotal Knowledge Base

Follow

How to recover Pivotal HDB local file system segment files

Environment

Product Version
Pivotal HDB 2.0.x
Ambari 2.x

Purpose

In case of disk replacement of file system inconsistencies there may be files within the HAWQ segment directory that are missing or corrupt. This articles explains how to recover from this situation in Pivotal HDB 2.x. 

You may need to use this procedure if you see a message similar to this while starting up a HAWQ segment: 

2016-07-15 04:11:20.903106 IST,,,p380707,th571824256,,,,0,,,seg-10000,,,,,"FATAL","58P01", "could not open directory ""base"": No such file or directory",,,,,,,,”

Procedure

Before starting this procedure the root cause should be understood to avoid this happening again. Once the root cause is understood and addressed (for example fixed bad hardware) the procedure can be followed.

1. Based on the error message, attempt to recover the missing files - note that there is no customer data located in these files so only minimal effort should be used to recover the segment files. The following steps will automatically re-create the files with very little effort.

2. Log into Ambari.

3. Go to HAWQ / HAWQ Segments / locate the segment with a corrupt segment data and stop the segment if it is not already down:

4. In Ambari under HAWQ / Configs find the value of "HAWQ Segment Directory":

5. Log into the affected host as user root via SSH.

6. If they already exist take a backup of the segment files using the directory path found in step 4. above (although the data is not need this keeps a backup of logs in case RCA is needed): 

[root@hawq20dn2 ~]# tar czf /root/segment_backup_20160717.tgz /data/hawq/segment
tar: Removing leading `/' from member names
[root@hawq20dn2 ~]#

7. If they exist move away the existing segment files: 

[root@hawq20dn2 ~]# mv /data/hawq/segment/ /data/hawq/segment_old/
[root@hawq20dn2 ~]# ls -ltr /data/hawq/
total 8
drwx------. 16 gpadmin gpadmin 4096 Jul 15 03:36 master
drwx------. 17 gpadmin gpadmin 4096 Jul 15 07:07 segment_old
[root@hawq20dn2 ~]#

8. In Ambari go to HAWQ / HAWQ Segments / locate the segment with corrupt data and start the segment:

9. Ambari should then automatically take the following actions: 

  • Create /usr/local/hawq/etc/hawq-site.xml if it does not exist
  • Create all users and directories if they do not exist
  • Put all necessary configuration files in /data/hawq/segment/
  • Start all hawq segment services

Note: If access to Ambari is not possible, hawq init segment and hawq start segment can be used instead on the affected node.

10. Confirm the new segment directory has been created:

[root@hawq20dn2 ~]# ls -ltr /data/hawq/
total 12
drwx------. 16 gpadmin gpadmin 4096 Jul 15 03:36 master
drwx------. 17 gpadmin gpadmin 4096 Jul 15 07:07 segment_old
drwx------. 16 gpadmin gpadmin 4096 Jul 15 07:10 segment
[root@hawq20dn2 ~]#

11. Confirm the segment is up in hawq state and Ambari:

[gpadmin@hawq20dn2 ~]$ hawq state
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:--HAWQ instance status summary
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master instance = Active
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Master standby = hawq20dn1.lab
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Standby master state = Standby host passive
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment instance count from config file = 3
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Segment Status
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:------------------------------------------------------
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segments count from catalog = 3
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment valid (at master) = 3
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total segment failures (at master) = 0
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files missing = 0
20160715:07:12:00:411146 hawq_state:hawq20dn2:gpadmin-[INFO]:-- Total number of postmaster.pid files found = 3
[gpadmin@hawq20dn2 ~]$

 

 

Comments

Powered by Zendesk