Pivotal Knowledge Base

Follow

Tip - Copying data tables from one server to another which are not connected (Using gpfdist)

Goal

How to copy data of a table from one server to another server ?

Solution

On Source Server

1. Create a direcotry under /data filesystem or which ever has available space.

cd /data
mkdir transfer_tables

2. Start a "gpfdist" process for the directory created.

gpfdist -d /data/transfer_tables -p 8080 -l /tmp/gpfdist.log &

-- Check if the gpfdist process started or not using "ps -ef|grep gpfdist" command. Output would something like below:

[gpadmin@mdw ~]$ ps -ef|grep dist
gpadmin657610 May03 ?00:00:00 gpfdist -d /data/transfer_tables -p 8080 -l gpfdist.log
gpadmin1543993210 17:00 pts/100:00:00 grep dist

3. Connect to database and create "writable" external tables for each table that needs to be moved. -- my table to transfer is "move_me"

gpadmin=# \d move_me
Table "public.move_me"
Column |Type| Modifiers
--------+-------------------+-----------
id| integer|
name| character varying |
Distributed by: (id)

-- Create a writable external table ext_of_move_me LIKE move_me table using below command:

gpadmin=# CREATE WRITABLE EXTERNAL TABLE ext_of_move_me(like move_me) location ('gpfdist://mdw:8080/move_me.csv') format 'text';
NOTICE:Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE

-- Transfer the data into a flat file by inserting into external table created:

gpadmin=# insert into ext_of_move_me select * from move_me;
INSERT 0 1000

-- You can see the file created in location given to gpfdist while starting it.

ls -ltrh /data/transfer_tables

4. Once you have created flat files for all tables which needs to be copied, using above procedure(step 3), you can use an external hard disk to move the files to target machine as there is no connectivity between servers.

On Target Server

5. Once all flat files are copied into target machine(assuming you have enough space to hold these flat files), start the gpfdist process(like in step2). Assuming you have copied the files in same location as source(it can be different, however you will have to change the directory in gpfdist command):

gpfdist -d /data/transfer_tables -p 8080 -l /tmp/gpfdist.log &

6. Connect to database and create a "readable" external table to load the data into tables. However you will have to create these external tables with the same structure as source system. If you already have a structure in the target table, that would be good.

-- Connect to database and create readable external table like below(assuming you dont have structure at target), so specifying all columns names while creating it:

checkcat=# CREATE READABLE EXTERNAL TABLE ext_to_load_move_me(id int, name varchar) location ('gpfdist://mdw:8080/move_me.csv') FORMAT 'TEXT';
CREATE EXTERNAL TABLE
checkcat=#

NOTE: here we have used "move_me.csv" file, for other table you might have copied into other files, so specify that file name accordingly in the command.

-- Create the table in which you want to load the data:

checkcat=# CREATE TABLE move_me(id INT, name VARCHAR);
NOTICE:Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

-- Insert the data using below command:

checkcat=# insert into move_me select * from ext_to_load_move_me ;
INSERT 0 1000
checkcat=# select count(*) from move_me;
count
-------
1000
(1 row)

Related articles

Comments

  • Avatar
    Alfredo Gomez

    Hi there, I have been trying something slightly different than this process to make the data movement, flexible, and the most efficient possible, but I am not sure if this is possible, maybe anyone can help me here.
    So I am trying that, instead of need to unload all the data per table generating one or more files, I want to use named pipes to stream the operation, that is:
    - create writable external table using named pipes instead of files
    - create external table in target DB using the same location and named pipes
    - launch insert queries on both sides.
    Is this even possible? as Greenplum documentation indicates, using named pipes is possible, I just do not now how to do it, what is the syntax, and searching on many sources I have not found the way, any help here would be great!

    Thanks!

  • Avatar
    Faisal Ali

    Hi Alfredo,

    Yes you can use named pipes with gpfdist.

    May be the below example would help you understand how to use it.

    • Creating necessary environment
    flightdata=# create database destination ;
    CREATE DATABASE
    flightdata=# create database source;
    CREATE DATABASE
    flightdata=# 
    flightdata=# 
    
    • Creating a source table
    flightdata=# \c source
    You are now connected to database "source" as user "gpadmin".
    source=# 
    source=# create table source_table as select * from pg_attribute;
    NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'attrelid' as the Greenplum Database data distribution key for this table.
    HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
    SELECT 3001
    
    • Create readable external table on the destination database
    source=# \c destination
    You are now connected to database "destination" as user "gpadmin".
    destination=# 
    destination=# CREATE EXTERNAL TABLE target_ext (like pg_attribute) location ('gpfdist://gpdbsne:8080/external_table_pipe') format 'text' (DELIMITER '|');
    CREATE EXTERNAL TABLE
    
    • Create a named pipe at source host
    [gpadmin@gpdbsne tmp]$ mkfifo external_table_pipe 
    [gpadmin@gpdbsne tmp]$ chmod u+w external_table_pipe 
    [gpadmin@gpdbsne tmp]$ ls -ltr | grep external
    prw-rw-r-- 1 gpadmin gpadmin    0 Dec  7 04:06 external_table_pipe
    
    • Start the gpfdist at source host
    [gpadmin@gpdbsne tmp]$ gpfdist -d /tmp/ -p 8080 -l /tmp/gpfdist.log &
    [1] 65020
    [gpadmin@gpdbsne tmp]$ 2015-12-07 04:07:58 65020 INFO Before opening listening sockets - following listening sockets are available:
    2015-12-07 04:07:58 65020 INFO IPV6 socket: [::]:8080
    2015-12-07 04:07:58 65020 INFO IPV4 socket: 0.0.0.0:8080
    2015-12-07 04:07:58 65020 INFO Trying to open listening socket:
    2015-12-07 04:07:58 65020 INFO IPV6 socket: [::]:8080
    2015-12-07 04:07:58 65020 INFO Opening listening socket succeeded
    2015-12-07 04:07:58 65020 INFO Trying to open listening socket:
    2015-12-07 04:07:58 65020 INFO IPV4 socket: 0.0.0.0:8080
    Serving HTTP on port 8080, directory /tmp
    
    • On the source database send out the data to named pipe
    source=# copy source_table  to '/tmp/external_table_pipe' delimiter '|';
    

    if you need to send specfic rows then

    source=# copy ( select * from source_table where <column>=<value> )  to '/tmp/external_table_pipe' delimiter '|';
    
    • On destination , pull the data
    destination=# select * from target_ext ;
    

    Thanks
    Faisal

  • Avatar
    Alfredo Gomez

    Hello Faisal, thanks for the information, this is pretty interesting, however, I am looking to something further.

    In the example, you are using the gpfdist to read from the external table. What I want to achieve is leverage the gpfdist to unload the information from the source database as well, so I do not manage the extraction of the source data from the master but the segment servers.

    I have tried exactly this with gpfdist, but apparently it does nothing, and I am not sure why, maybe something in the configuration, I keep trying.

    Is it clearer now?

    Thanks!

  • Avatar
    Faisal Ali

    Hi

    Yes you can do that as well, but make sure you have two gpfdist process running on different ports accessing the same file

    • Create gpfdist
    gpfdist -d /tmp/ -p 8080 -l /tmp/gpfdist_8080.log -v &
    gpfdist -d /tmp/ -p 8081 -l /tmp/gpfdist_8081.log -v &
    
    • on source create writable external table with one port
    CREATE WRITABLE EXTERNAL TABLE source_ext (like source_table) location ('gpfdist://gpdbsne:8080/external_table_pipe') format 'text' (DELIMITER '|');
    
    • On destination create readable external table with other port
    CREATE EXTERNAL TABLE target_ext (like pg_attribute) location ('gpfdist://gpdbsne:8081/external_table_pipe') format 'text' (DELIMITER '|');
    
    • Send data from the source
    insert into source_ext select * from source_table;
    
    • From target pull the data
    select * from target_ext;
    

    Thanks

  • Avatar
    Alfredo Gomez

    aha!!! that was the problem!!!! two different gpfdists! let me try it!

    Thanks!

Powered by Zendesk