Pivotal Knowledge Base

Follow

How To - Join Two Tables Running on Two Separate Databases

Environment

Product Version
 GPDB  All version

Purpose

GPDB doesn't have the DBLINK feature yet, so it currently doesn't allow joins on tables in different databases.

In this article we will look at a alternative way to join two tables running on different databases using the power of external table and gpfdist.

Procedure

Below is a quick example on how to perform a join on two tables in different databases. For example, here, we are trying to join two tables source_table and destination_table on two different database.

Please note: In this setup we have the source and destination on the same host , but with this setup you can also join table from database running on different host.

Setup

  • On the source database, we have the source_table

    gpadmin=# \c source
    You are now connected to database "source" as user "gpadmin".
    
    source=# CREATE TABLE source_table (a char,b int,c int);
    NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table.
    HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
    CREATE TABLE
    
  • Currently the data in the source table is

    source=# INSERT INTO source_table values ('A',generate_series(1,10),1);
    INSERT 0 10
    source=# INSERT INTO source_table values ('B',generate_series(1,10),1);
    INSERT 0 10
    source=# INSERT INTO source_table values ('C',generate_series(1,10),1);
    INSERT 0 10
    
  • On our destination database we have the table destination_table

    source=# \c destination
    You are now connected to database "destination" as user "gpadmin".
    destination=# CREATE TABLE destination_table (a char,d int,e int); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE
  • The data associated with the destination_table is

    destination=# INSERT INTO destination_table values ('A',generate_series(1,10),1);
    INSERT 0 10
    destination=# INSERT INTO destination_table values ('D',generate_series(1,10),1);
    INSERT 0 10
    destination=# INSERT INTO destination_table values ('E',generate_series(1,10),1);
    INSERT 0 10
    
  • Start the gpfdist

    [gpadmin@gpdbsne ~]$ gpfdist -d /tmp -p 8081 -l /tmp/log1 &
    [1] 3114
    [gpadmin@gpdbsne ~]$ 2015-10-17 03:57:57 3114 INFO Before opening listening sockets - following listening sockets are available:
    2015-10-17 03:57:57 3114 INFO IPV6 socket: [::]:8081
    2015-10-17 03:57:57 3114 INFO IPV4 socket: 0.0.0.0:8081
    2015-10-17 03:57:57 3114 INFO Trying to open listening socket:
    2015-10-17 03:57:57 3114 INFO IPV6 socket: [::]:8081
    2015-10-17 03:57:57 3114 INFO Opening listening socket succeeded
    2015-10-17 03:57:57 3114 INFO Trying to open listening socket:
    2015-10-17 03:57:57 3114 INFO IPV4 socket: 0.0.0.0:8081
    Serving HTTP on port 8081, directory /tmp
    

Source database

  • Create a writable external table similar to source_table

    source=# CREATE WRITABLE EXTERNAL TABLE external_source_table (LIKE source_table) LOCATION ('gpfdist://gpdbsne:8081/source.out')
    source-# FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
    source-# distributed by (a);
    CREATE EXTERNAL TABLE
    
  • Write the data to the external table file

    source=# INSERT INTO external_source_table  SELECT * FROM source_table ;
    INSERT 0 30
    

Destination database

  • Create a readable external table similarly to source_table on destination database

    destination=# CREATE READABLE EXTERNAL TABLE external_source_table (a char , b int , c int) LOCATION ('gpfdist://gpdbsne:8081/source.out')
    destination-# FORMAT 'TEXT' ( DELIMITER '|' NULL ' ');
    CREATE EXTERNAL TABLE
    
  • Now perform the join of the two tables that is destination_table and external_source_table like below

    destination=# SELECT *
    destination-# FROM destination_table dest , external_source_table ext 
    destination-# where dest.a=ext.a 
    destination-# AND dest.d=ext.b;
     a | d  | e | a | b  | c 
    ---+----+---+---+----+---
     A |  1 | 1 | A |  1 | 1
     A |  2 | 1 | A |  2 | 1
     A |  3 | 1 | A |  3 | 1
     A |  4 | 1 | A |  4 | 1
     A |  5 | 1 | A |  5 | 1
     A |  6 | 1 | A |  6 | 1
     A |  7 | 1 | A |  7 | 1
     A |  8 | 1 | A |  8 | 1
     A |  9 | 1 | A |  9 | 1
     A | 10 | 1 | A | 10 | 1
    (10 rows)
    

Additional Information

For more information on external tables and gpfdist please refer to the documentation.

Comments

  • Avatar
    Faisal Ali

    Hi Sandeep,

    Regarding duplicate data

    I dont see the command where you have to write the data to the external table , how many time did you insert the data into the table ( writable external table )

    source=# INSERT INTO external_source_table  SELECT * FROM source_table ;
    INSERT 0 30
    

    From the output i see there are duplicate data in the source.out , so it seems you might have executed 4 times , could you please confirm.

    Regarding performing this on different host

    On the destination database on destination host , create a external table and the gpfdist location you can put something like .

    LOCATION ('gpfdist://<source_host>:8081/source.out')
    
Powered by Zendesk