Pivotal Knowledge Base

Follow

How Hawq analyzes PXF external tables

Environment

 

Product Version
HAWQ 1.3.x

Purpose

It is no secret database statistics are very important to query performance in RDBMS databases. Here we will discuss how hawq handles table stastics for PXF external tables

How analyze is executed for pxf

  • When user executes the "analyze <external tablename>" hawq will send Json api call to the PXF service defined in the external table URI path. The call in this case is "getEstimatedStats"
  • PXF service will then execute the analyze plugin class associated with the datatype. In the case of the HdfsAnalyzer pxf will perform the following actions
    • Goes to HDFS and fetches the number of blocks and blocksize for given data path, which can be a file, directory or wildcard
    • HdfsAnalyzer then gets the first block and reads all of it counting how many tuples there are in that block. The total number of tuples will be derived from only reading the first block using formlua ( <number of tuples in first block> * <number of blocks>)
  • pxf service will then send back the stats in a json response to hawq
  • Hawq will parse the response and input the data into pg_class for the given relid

What are the Default Stats

These stats are applied to pg_class for any PXF table that does not have the analyzer plugin or has not been analyzed

  • Block Size = 64mb
  • Number of Blocks = 1
  • Number of Tuples = 1,000,000

Demo

Some Examples

Assume you have the following pxf-profile defined ( this is the default hdfs profile ). Here we have the analyzer plugin defined in our profile

    <profile>
        <name>HdfsTextSimple</name>
        <description>This profile is suitable for using when reading delimited single line records from plain text files
            on HDFS
        </description>
        <plugins>
            <fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
            <accessor>com.pivotal.pxf.plugins.hdfs.LineBreakAccessor</accessor>
            <resolver>com.pivotal.pxf.plugins.hdfs.StringPassResolver</resolver>
            <analyzer>com.pivotal.pxf.plugins.hdfs.HdfsAnalyzer</analyzer>
        </plugins>
    </profile>

Lets see how HAWQ statistics behave in this scenario

create external table pxfread (a int, b int, c int, d int) LOCATION('pxf://node2:50070/tmp/mytest.txt?Profile=HdfsTextSimple') FORMAT 'text' (DELIMITER ',');

gpadmin=# select count(*) from pxfread;
 count
-------
  1001

gpadmin=# explain select * from pxfread;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Gather Motion 6:1  (slice1; segments: 6)  (cost=0.00..543.81 rows=1000000 width=32)
   ->  External Scan on pxfread  (cost=0.00..439.07 rows=166667 width=32)
(2 rows)

The total number of tuples in mytest.txt is 1001 but explain plan reports 1 million rows. This is because HAWQ is picking the default stat for PXF external tables

Lets analyze and see if anything changes

gpadmin=# analyze pxfread;
ANALYZE
gpadmin=# explain select * from pxfread;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Gather Motion 6:1  (slice1; segments: 6)  (cost=0.00..431.12 rows=1020 width=32)
   ->  External Scan on pxfread  (cost=0.00..431.01 rows=170 width=32)
(2 rows)

Now HAWQ estimates there are 1000+ rows in the table which is what we expect

What happends if we don't have analyze plugin

Lets copy the default hdfs profile to a new profile exluding analyzer plugin

 	<profile>
        <name>MyHdfsTextSimple</name>
        <description>This profile is suitable for using when reading delimited single line records from plain text files
            on HDFS
        </description>
        <plugins>
            <fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
            <accessor>com.pivotal.pxf.plugins.hdfs.LineBreakAccessor</accessor>
            <resolver>com.pivotal.pxf.plugins.hdfs.StringPassResolver</resolver>
    </profile>

Lets Observe

create external table pxfread_noanalyze (a int, b int, c int, d int) LOCATION('pxf://node2:50070/tmp/mytest.txt?Profile=MyHdfsTextSimple') FORMAT 'text' (DELIMITER ',');

gpadmin=# analyze pxfread_noanalyze;
WARNING:  skipping "pxfread_noanalyze" --- error returned: remote component error (500) from '172.17.0.12:51200':  type  Exception report   message   PXF 'Analyzer' class was not found. Please supply it in the LOCATION clause or use it in a PXF profile in order to run ANALYZE on this table    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   java.lang.IllegalArgumentException: PXF 'Analyzer' class was not found. Please supply it in the LOCATION clause or use it in a PXF profile in order to run ANALYZE on this table
ANALYZE

Oops the pxf service returns a warning because the pxf profile "MyHdfsTextSimple" does not have a analyzer plugin defined

Here is where the stats get stored

gpadmin=# select relname, reltuples, relpages from pg_class where relname in ( 'pxfread', 'pxfread_noanalyze');
      relname      | reltuples | relpages
-------------------+-----------+----------
 pxfread           |      1020 |    24576
 pxfread_noanalyze |     1e+06 |     1000
(2 rows)

Notes

For more in depth information on other PXF features please see this video 

Comments

Powered by Zendesk