Pivotal Knowledge Base

Follow

Invalid query results when querying EXTERNAL TABLES with gpfdist location

Environment

Product Version
Pivotal Greenplum (GPDB) 4.3.x

Symptom

When querying an EXTERNAL TABLE with the location set to gpfdist, multistage query plans will produce invalid results without producing any error messages.

  • File based external tables do not exhibit this issue.
  • Hash based tables do not exhibit this issue.

Regardless of using a hash table, external table with gpfdist or file, the query planner is creating the same plan but:

  • Hash tables provide the correct results
  • External Tables with File definition provide correct results
  • External Tables with gpfdist definition never provide correct results unless gp_enable_multiphase_agg is off.

Error Message:

No error messages will be logged for this issue.  Master and segment logs will show normal activity while gpfdist, even with debugging, will show normal activity as well. 

Cause 

This issue has been raised to Pivotal Engineering via MPP-26406 and is awaiting their feedback on the cause to this issue.

RCA

1) Setup a file to load with gpfdist such as:

[gpadmin@gpdb-sandbox gpfdist]$ cat new_file_list.csv
a;1;1
b;2;2
c;3;3
d;4;4
e;5;5
f;6;6

2) Create the external table definition:

CREATE EXTERNAL TABLE ext_new_file_list (name text, id int, val int)
LOCATION ('gpfdist://localhost:9300/new_file_list.csv')
FORMAT 'text' (delimiter ';' null '' escape 'off')
LOG ERRORS INTO err_new_file_list SEGMENT REJECT LIMIT 100;

3. Start gpfdist -p 9300 where the file is located

4. Run some queries and notice you never get the correct values back:

distinct_repro=# select * from ext_new_file_list;
 name | id | val
------+----+-----
 a    |  1 |   1
 b    |  2 |   2
 c    |  3 |   3
 d    |  4 |   4
 e    |  5 |   5
 f    |  6 |   6
(6 rows)

distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     6 |     6 |     0
(1 row)

distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     0 |     0 |     6
(1 row)

distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     6 |     6 |     0
(1 row)

distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     6 |     6 |     0
(1 row)

5. Set gp_enable_multiphase_agg off and immediately the results are correct:

distinct_repro=# set gp_enable_multiphase_agg=off;
SET
distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     6 |     6 |     6
(1 row)

distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ;
 count | count | count
-------+-------+-------
     6 |     6 |     6
(1 row)

Resolution

This issue has been raised to Pivotal Engineering via MPP-26406 and is awaiting their resolution. 

Workaround

  • Set the GUC gp_enable_multiphase_agg=off
  • If possible, use a file based external table vs. gpfdist as it does not exhibit this problem

 

Comments

Powered by Zendesk