Pivotal Knowledge Base

Follow

Greenplum External Table Error Handling

Environment

Product Version
 Pivotal Greenplum (GPDB)  4.3.x
 
Overview

External Error Tables

When external tables (GPDB Administrator Guide - Loading and Unloading Data) are created, there is an option to use error table, where any errors from scanning the files will be stored (for the purpose of diagnosing dirty data). This error handling feature is very important in loading and cleaning data with a small number of random formatting and data errors.
 
Up to GPDB 4.3.3 the error tables were created as normal heap tables (GPDB 4.3.2.0 Reference Guide - CREATE EXTERNAL TABLE - see LOG ERRORS INTO <error_table> clause). The access to these tables was with normal SQL statements (SELECT, TRUNCATE, etc.). The fact that the error data was written to these tables during scanning the external table (reader slice), created problems because multiple reader-slices would write into a heap table and that would result in corruption on rare occasions.
 
Resolution
 
To avoid the problem described above, from GPDB 4.3.3.0+, there is a new type of external error tables added, where the external error tables are not created as heap tables anymore (GPDB 4.3.6.0 Reference Guide - CREATE EXTERNAL TABLE - see the LOG ERRORS clause). This initiative avoids the problem of transactional and unsynchronized write to a heap table from multiple sessions. Consequently, the access to these external error tables is done through functions instead of SQL statements:
  • gp_read_error_log(<ext_table_name>)
  • gp_truncate_error_log(<ext_table_name>) 
The external error table is not named explicitly in the CREATE TABLE statement and is accessed by the name of the main external table through the functions.
 
Note: The old syntax (LOG ERRORS INTO <error_table>) and old way of external error table handling is still kept and functional for compatibility purposes, but the recommendation is to convert to new error table handling and new syntax (LOG ERRORS).
 
Note: If you upgrade to GPDB 4.3.3+ and do not change the external table creation syntax and external error table access, you will still be using the old heap external error tables and be exposed to the risks associated with that.
 
Pre 4.3.3.0 syntax:
CREATE 
[READABLE] EXTERNAL TABLE
<table name>
(<column definitions>)
LOCATION (‘<location>’)
FORMAT <format>
LOG ERRORS INTO <error table name>
SEGMENT REJECT LIMIT <reject limit>;
4.3.3.0+ syntax: 
CREATE 
[READABLE] EXTERNAL TABLE
<table name>
(<column definitions>)
LOCATION (‘<location>’)
FORMAT <format>
LOG ERRORS
SEGMENT REJECT LIMIT <reject limit>;
Example:
 
1. Prepare a data file:
lpetrovmac:tmp lpetrov$ cat a.txt
1
2
3
4
5
lpetrovmac:tmp lpetrov$
2. Create external table (note the "LOG ERRORS" clause with no name specified):
lpetrov=# create external table e1(a int) location ('file:///tmp/a.txt') format 'text' log errors segment reject limit 10 rows;
CREATE EXTERNAL TABLE
lpetrov=#
 
lpetrov=# \d e1
  External table "public.e1"
 Column |  Type   | Modifiers 
--------+---------+-----------
 a      | integer | 
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter '    ' null '\N' escape '\'
External location: file:///tmp/a.txt
Segment reject limit: 10 rows
Error Log in File
3. Select data - no errors
lpetrov=# select * from e1;
 a 
---
 1
 2
 3
 4
 5
(5 rows)
4. Artificially create errors (external table will parse integers, file contains strings):
lpetrovmac:tmp lpetrov$ cat a.txt 
a1
a2
a3
a4
a5
lpetrovmac:tmp lpetrov$ 
5. Select data with errors and review the errors:
lpetrov=# select * from e1;
NOTICE:  Found 5 data formatting errors (5 or more input rows). Rejected related input data.
 a 
---
(0 rows)
 
lpetrov=# select * from gp_read_error_log('e1');
            cmdtime            | relname |            filename            | linenum | bytenum |                      errmsg                      | rawdata | rawbytes 
-------------------------------+---------+--------------------------------+---------+---------+--------------------------------------------------+---------+----------
 2015-10-21 11:19:19.028313-07 | e1      | file:///tmp/a.txt [/tmp/a.txt] |       1 |         | invalid input syntax for integer: "a1", column a | a1      | 
 2015-10-21 11:19:19.028313-07 | e1      | file:///tmp/a.txt [/tmp/a.txt] |       2 |         | invalid input syntax for integer: "a2", column a | a2      | 
 2015-10-21 11:19:19.028313-07 | e1      | file:///tmp/a.txt [/tmp/a.txt] |       3 |         | invalid input syntax for integer: "a3", column a | a3      | 
 2015-10-21 11:19:19.028313-07 | e1      | file:///tmp/a.txt [/tmp/a.txt] |       4 |         | invalid input syntax for integer: "a4", column a | a4      | 
 2015-10-21 11:19:19.028313-07 | e1      | file:///tmp/a.txt [/tmp/a.txt] |       5 |         | invalid input syntax for integer: "a5", column a | a5      | 
(5 rows)
6. Clearing the error log
lpetrov=# select gp_truncate_error_log('e1');
 gp_truncate_error_log 
-----------------------
 t
(1 row)
 
lpetrov=# select * from gp_read_error_log('e1');
 cmdtime | relname | filename | linenum | bytenum | errmsg | rawdata | rawbytes 
---------+---------+----------+---------+---------+--------+---------+----------
(0 rows)
 
lpetrovmac:errlog lpetrov$ pwd
/Users/lpetrov/greenplum-db-data/gpseg0/errlog
 
lpetrovmac:errlog lpetrov$ ls -ltr
lpetrovmac:errlog lpetrov$ 
7. Show the location of the error external table file:
lpetrovmac:errlog lpetrov$ pwd
/Users/lpetrov/greenplum-db-data/gpseg0/errlog
lpetrovmac:errlog lpetrov$ 
 
lpetrovmac:errlog lpetrov$ ls -ltr
total 8
-rw-------  1 lpetrov  staff  660 Oct 21 10:52 175758_482860
lpetrovmac:errlog lpetrov$ 
External error table file name format is <DBOID_TABOID>, where DBOID is the database oid and TABOID is the external table oid:
lpetrovmac:errlog lpetrov$ ls -ltr
total 8
-rw-------  1 lpetrov  staff  660 Oct 21 10:52 175758_482860
 
lpetrov=# select oid, datname from pg_database where datname = 'lpetrov';
  oid   | datname 
--------+---------
 175758 | lpetrov
(1 row)
 
lpetrov=# select oid from pg_class where relname = 'e1';
  oid   
--------
 482860
(1 row)
Related GPDB Parameters:
lpetrov=# show gp_initial_bad_row_limit;

gp_initial_bad_row_limit -------------------------- 1000 (1 row)

Comments

Powered by Zendesk