Pivotal Knowledge Base


How to Read Data from the Greenplum Database into the Spark Using Greenplum-Spark Connector


  • Pivotal Greenplum Database 4.3.x and 5.x
  • Operating System- Red Hat Enterprise Linux 6.x and 7.x
  • Centos 7.4 node


This article explains the process to test the functionality of the Greenplum-Spark Connector. This will help you to successfully read data from a Greenplum Database (GPDB) table into your Spark cluster. The instructions in this article are written for a single-node GPDB cluster installed on Centos 7.4 and a standalone Apache Spark 2.2.1 cluster. This article is also written assuming the single-node Greenplum Database cluster is already installed and working. Both the GPDB cluster and Spark cluster are installed on the same Centos 7.4 node.


Follow the steps given below:

Step 1

Install Java as Scala (don't forget to set your "JAVA_HOME" and "JRE_HOME" environment variables).

Step 2

Install Scala

cd ~
wget http://downloads.lightbend.com/scala/2.12.4/scala-2.12.4.rpm
yum install scala-2.12.4.rpm

Confirm your installation using the "scala -version" command. 
You should get the following message or something similar: [root@mdw ~]# scala -version Scala code runner version 2.12.4 -- Copyright 2002-2017, LAMP/EPFL and Lightbend, Inc. [root@mdw ~]#

Step 3

Install Spark standalone

cd ~
wget http://apache.claz.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

tar xf spark-2.2.1-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.2.1-bin-hadoop2.7/* /usr/local/spark/

Set the below environment variables in your .bashrc file and source it before you run Spark:

export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.2.1.jar

Now, test to see if you can start your scala shell using the "spark-shell" command to run Spark:

[root@mdw ~]# spark-shell
18/01/06 16:14:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/01/06 16:14:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at
Spark context available as 'sc' (master = local[*], app id = local-1515273243158).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151), type in the expressions to have them evaluated.
Type :help for more information.


Step 4

Download the gpdb-spark connector from network.pivotal.io

Step 5

Download the postgresql JDBC driver using the below script:



cd `dirname $0`

wget -O postgresql-42.1.4.jar https://jdbc.postgresql.org/download/postgresql-42.1.4.jar

cd $current

Step 6

Place the gpdb-spark connector and postgresql jars in your spark installation jars directory. In our case we will place it in "/usr/local/spark/jars"

Step 7

Test from the scala command-line to make sure that both the gpdb-spark connector and postgresql drivers are successfully loaded into the Spark shell using the below commands:

To verify if GPDB-Spark connector driver was loaded successfully into the Spark shell:

scala> Class.forName("io.pivotal.greenplum.spark.GreenplumRelationProvider") res0: Class[_] = class io.pivotal.greenplum.spark.GreenplumRelationProvider

The above output confirms that the GPDB-Spark connector was loaded successfully.


To verify if the postgresql JDBC driver was loaded successfully into the Spark shell, perform,

scala> Class.forName("org.postgresql.Driver") res1: Class[_] = class org.postgresql.Driver

The above output confirms that the postgresql JDBC driver was loaded successfully.

Step 8

Now, let's read GPDB data into the Spark.

Create a Greenplum Database test table:

template1=# CREATE TABLE testtable (a bigint not null, b text) distributed by (a); 

template1=# \d testtable Table "public.testtable" 
Column | Type | Modifiers 
a | bigint | not null b | text | 
Distributed by: (a) 

template1=# INSERT INTO testtable VALUES (124312342,'hello'); 

template1=# SELECT * from testtable ; 
a | b 
124312342 | hello 
(1 row) 

Ingest the gpdb data into Spark via the scala shell command-line:

scala> :paste // Entering paste mode (ctrl-D to finish) 

val dataFrame = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider") .option("dbtable", "testtable") .option("url", "jdbc:postgresql://") .option("user", "gpadmin") .option("password", "pivotal") .option("driver", "org.postgresql.Driver") .option("partitionColumn", "a") .load() // Exiting paste mode, now interpreting. dataFrame: org.apache.spark.sql.DataFrame = [a: bigint, b: string] scala>

Verify the schema details of the table:

scala> dataFrame.printSchema 

root |-- a: long (nullable = false) |-- b: string (nullable = true) scala> --- 

Read the GPDB data:

scala> dataFrame.show() 

+---------+-----+ | a| b| +---------+-----+ |124312342|hello| +---------+-----+ scala> ---


scala> dataFrame.explain 

== Physical Plan == *Scan GreenplumRelation(StructType(StructField(a,LongType,false), StructField(b,StringType,true)),[Lio.pivotal.greenplum.spark.GreenplumPartition;@403f6c04,io.pivotal.greenplum.spark.GreenplumOptions@2693e39c) [a#0L,b#1] ReadSchema: struct<a:bigint,b:string> 


Additional Information 

Please review https://greenplum-spark.docs.pivotal.io/110/index.html for further details regarding the Greenplum-Spark Connector.


Powered by Zendesk