- Pivotal Greenplum Database (GPDB) 4.3.x
- Operating System- Red Hat Enterprise Linux 6.x
After running for a couple of hours, gpdbrestore failed on the segments. The error message in the gpdbrestore logs looked somewhat like the one shown below:
20160402:11:30:54|gp_restore-[ERROR]:-Error executing query SELECT * FROM gp_read_backup_file('db_dumps/20160402', '20160402112535', 2) : ERROR: Backup File /data1/primary/gpseg0/db_dumps/20160402/gp_restore_status_0_2_20160402112535 Type 2 could not be be found 20160402:11:30:54|gp_restore-[ERROR]:-restore failed for source dbid 2, target dbid 2 on host sdw1
Upon tracing, it was found that the segment that reported the error encountered an Out of Memory (OOM) situation. Refer to the first occurrence of "Type 2" error in the gpdbrestore log under a ~/gpAdminLogs directory and check the status file for that segment. In this case, it was seg0 on sdw1. The restore status file would be in the segment directory.
[gpadmin@sdw1 ~]$ cat /data1/primary/gpseg0/gp_restore_status_0_2_20160402112535 20160402:11:29:14|gp_restore_agent-[INFO]:-Before fork of gp_restore_agent 20160402:11:29:14|gp_restore_agent-[INFO]:-Command Line: /bin/cat /data1/primary/gpseg0/db_dumps/20160402/gp_dump_0_2_20160402112535.gz | /bin/gunzip -c | /usr/local/GP-188.8.131.52/bin/psql -h sdw1 -p 37000 -U gpadmin -d spring -a 20160402:11:29:14|gp_restore_agent-[INFO]:-Starting monitor thread ERROR: column "test" of relation "testbackup1" does not exist out of memory gunzip: stdout: Broken pipe /bin/cat: write error: Broken pipe 20160402:11:30:54|gp_restore_agent-[ERROR]:-psql finished abnormally with return code 1. 20160402:11:30:54|gp_restore_agent-[ERROR]:-Finished with errors
In the Greenplum Database, backups are taken using COPY commands with the data to be loaded again using standard input (STDIN). With STDIN, the way COPY behaves is, if the table does not exist OR the columns are not exactly the same as the metadata, then the whole data will be sent to the memory and OOM situation can happen if the data is huge. These issues were possibly results of the following scenarios:
- The schema for the partitioned tables has been changed at the source. This will allow the child tables to keep existing in the old schema and the backup will point towards that. This has been fixed in version 184.108.40.206 and above. Now the table will log alter statements with the child table to change the schema if the backup has been taken using version 220.127.116.11 and above.
- During the data transfer between the PROD and DR cluster, sometimes the table structure does not match. This could happen, for example, if the table has been altered anywhere and a new column has been added or dropped. In that case, the structure is different and data will be sent to the memory.
Upgrade to version 18.104.22.168 and above and take the backup again if the schemas are different for parent and child tables. You can crosscheck at the source database using the following query if the schemas of the parent and child are different:
SELECT schemaname , tablename , partitionschemaname , partitiontablename FROM pg_partitions WHERE schemaname <> partitionschemaname ;
Secondly, make sure the structure is same on the destination if metadata is already there. If not, a full restore is needed at the destination.
Refer number 25549 in the resolved issues section in the version 22.214.171.124 Release Notes
For further information, please refer to the following: