Pivotal Knowledge Base

Follow

GPRECOVERSEG Fails with Error: "Cannot Write: No Space Left on Device"

Environment

 Product  Version
 Pivotal HD  2.1
 HAWQ  1.2, 1.3.x

Symptom

When using gprecoverseg to recover segments, the following error is: "Cannot write: No space left on device."

20170126:11:19:30:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Starting gprecoverseg with args: -i /tmp/gprecoverseg -F
(...)
20170126:11:19:50:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-2 segment(s) to recover
20170126:11:19:50:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Ensuring 2 failed segment(s) are stopped
...
20170126:11:19:54:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Cleaning files from 2 segment(s)
.........
20170126:11:20:03:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Building template directory
20170126:11:20:03:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Creating template
20170126:11:20:04:120429 gprecoverseg:hawqmaster:gpadmin-[INFO]:-Starting copy of segment dbid 2 to location /tmp/GPSQL/gpsql_template20170126_112003
20170126:11:21:10:120429 gprecoverseg:hawqmaster:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 2
Command was: '/bin/tar -C /tmp/GPSQL/gpsql_template20170126_112003 -xf /tmp/GPSQL/gpsql_template20170126_112003/hawq_template20170126_112004'
rc=2, stdout='', stderr='/bin/tar: ./pg_distributedlog/016F: Wrote only 7680 of 10240 bytes
/bin/tar: ./pg_distributedlog/0170: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0171: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0172: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0173: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0174: Cannot write: No space left on device
/bin/tar: ./pg_distributedlog/0175: Cannot write: No space left on device

(...)

/bin/tar: ./postgresql.conf: Cannot write: No space left on device
/bin/tar: ./postmaster.pid: Cannot write: No space left on device
/bin/tar: Exiting with failure status due to previous errors
'
Traceback (most recent call last):
File "/usr/local/hawq/ext/python/lib/python2.6/logging/__init__.py", line 769, in emit
stream.write(fs % msg)
IOError: [Errno 28] No space left on device

The / partition will be 100% full:

[root@hawq21 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 9.8G 9.8G 0G 100% /
tmpfs 2.9G 0 2.9G 0% /dev/shm
/dev/sda1 477M 41M 411M 9% /boot
/dev/sda7 55G 22G 31G 41% /data
/dev/sda2 20G 45M 19G 1% /home
/dev/sda3 9.8G 24M 9.2G 1% /tmp
[root@hawq21 ~]#

Cause

  • When running gprecoverseg with HAWQ 1.x, the master will copy the whole segment directory from one of the running segments into the master's /tmp directory to create a template.
  • In the above example, DBID 2 was chosen as the template segment to copy the data from.
  • The template will be compressed and copied to /tmp/GPSQL/gpsql_template<TIMESTAMP>.
  • The template is uncompressed and pg_log and other directories are removed. As the uncompressed size may be large, this may lead to the "out of space" errors.
  • If X segments are being recovered, the contents will be untarred X amount of times into the /tmp directory which may increase the risk of running into the "out of space" error.

Workaround

Make sure there is enough space left on / compared to the size of the segment directory being chosen to copy from. If there is not enough free space, move log files out of the pg_log directory on running segment that the files are being copied from and use du -sh ./* to understand where space is being used.

Once gprecoverseg is complete, the log files can be placed back on the pg_log directory on the source segment.

Alternatively, segments can be recovered in smaller groups instead of all of them at a time with gprecoverseg -i <file>

 

Comments

Powered by Zendesk