Pivotal Knowledge Base

Follow

HBase Basics

Understand Master Servers, Region Servers and Regions.

Regions are a subset of the table’s data, and they are essentially a contiguous, sorted range of rows that are stored together.  Regions are the basic element of availability and distribution for tables.

HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.

HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the NameNode. A cluster may have multiple masters, all Masters compete to run the cluster. If the active Master loses its lease in ZooKeeper (or the Master shuts down),  then the remaining Masters jostle to take over the Master role.   

Region Splitting

Initially, there is only one region for a table.  When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.

 

 

  • Pre splitting

Create a table with many regions by supplying the split points at the table creation time.  Pre splitting is recommended but have a caveat-- poorly chosen splitting points can end up with heterogeneous load distribution, which degrades the cluster performance.

  • Auto splitting

Default split policy. It splits the regions when the total data size of one of the stores in the region gets bigger than configured “hbase.hregion.max.filesize”, which has a default value of 10GB.

  • Manual splitting

Split regions that are unevenly loaded from CLI.

***********************************************************************************

Example:

hbase(main):024:0> split 'b07d0034cbe72cb040ae9cf66300a10c',    'b'
             0 row(s) in 0.1620 seconds

***********************************************************************************

Region Assignment and Balancing

Region Assignment at startup

When HBase starts regions are assigned as follows (short version):

  1. The Master invokes the AssignmentManager upon startup.
  2. The AssignmentManager looks at the existing region assignments in META.
  3. If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.
  4. If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the region. The DefaultLoadBalancer will randomly assign the region to a RegionServer.
  5. META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.

Region Assignment at Fail-over

  1. The regions immediately become unavailable because the RegionServer is down.
  2. The Master will detect that the RegionServer has failed.
  3. The region assignments will be considered invalid and will be re-assigned just like the startup sequence.

Region Assignment Upon load Balancing

Periodically, and when there are no regions in transition, a load balancer will run and move regions around to balance the cluster's load. The balancer runs on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000 (5 minutes).

HBase Data Model

HBase tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells -- the intersection of row and column coordinates -- are versioned. A cell’s content is an uninterrupted array of bytes.

Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key -- its primary key.

Column Family

A column name is made of its column family prefix and a qualifier. For example, the column contents:html is of the column family contents The colon character (:) delimits the column family from the column family qualifier.

Following is an example HBase table-- webtable

 

Row Key

Time Stamp

ColumnFamily Contents

ColumnFamily anchor

"com.cnn.www"

t9

 

anchor:cnnsi.com="CNN"

"com.cnn.www"

t8

 

anchor:my.look.ca="CNN.com"

"com.cnn.www"

t6

contents:html="<html>..."

 

"com.cnn.www"

t5

contents:html="<html>..."

 

"com.cnn.www"

t3

contents:html="<html>..."

 

 

A request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from time stamp t6, the value of anchor:cnnsi.com from time stamp t9, the value of anchor:my.look.ca from time stamp t8.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

Cells

A (row, column, version) tuple exactly specifies a cell in HBase. Cell content is uninterrupted bytes.

Versions

It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

The version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.

Basic HBase Operations

get: Returns attributes of a specified row. If no explicit version specified, the cell whose version has the largest value is returned .

put: Either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

scan: Iteration over multiple rows for specified attributes. Same behavior  as get when working with versions.

delete: removes a row from a table. There are three types:

  • Delete: for a specific version of a column.
  • Delete column: for all versions of a column.
  • Delete family: for all columns of a particular ColumnFamily

HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions.

Catalog Tables

Catalog tables  are used by HBase to keep the info of the user tables. There are two catalog tables: -ROOT- and .META.

-ROOT-

ROOT- keeps track of where the .META. table is. The -ROOT- table structure is as follows:
Key:

  •   .META. region key (.META.,,1)

Values:

  • info:regioninfo (serialized HRegionInfo instance of .META.)
  • info:server (server:port of the RegionServer holding .META.)
  • info:serverstartcode (start-time of the RegionServer process holding .META.)

.META.

The .META. table keeps a list of all regions in the system. The .META. table structure is as follows:
Key:

  • Region key of the format ([table],[region start key],[region id])

Values:

  • info:regioninfo (serialized HRegioninfo instance for this region)
  • info:server (server:port of the RegionServer containing this region)
  • info:serverstartcode (start-time of the RegionServer process containing this region)

Note: When you start HBase, the HMaster is responsible for assigning the regions to each HRegionServer. This also includes the "special" -ROOT- and .META. tables.

Write Ahead Log(WAL) and HLog

WAL

The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure.
The WAL is a standard Hadoop SequenceFile and it stores HLogKey's. These keys contain a sequential number as well as the actual data and are used to replay not-yet-persisted data after a server crash.
Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to the Memstore the affected Store. This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer.

(Memstore, Store and how data is stored in HBase will be discussed below)

HLog

The class which implements the WAL is called HLog. The HLog class specification can be found here.

When a HRegion is instantiated the single HLog is passed on as a parameter to the constructor of HRegion.

HLog has the following important features:

  • HLog has the append() method, which internally eventually calls doWrite()
  • HLog is keeping track of the changes.
  • HLog has the facilities to recover and split a log left by a crashed HRegionServer.

HBase and Zookeeper

Overview

The ZooKeeper cluster acts as a coordination service for the entire HBase cluster, handling master selection, root region server lookup, node registration, and so on.  Region Servers and Master discovery via Zookeeper

All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.

When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native zoo.cfg file, or, the easier option is to just specify ZooKeeper options directly in conf/hbase-site.xml. A ZooKeeper configuration option can be set as a property in the HBase hbase-site.xml XML configuration file by prefacing the ZooKeeper option name with hbase.zookeeper.property. For example, the clientPort setting in ZooKeeper can be changed by setting the hbase.zookeeper.property.clientPort property.

HBase Directory Structure on ZK


 

  • Master
    • If more than one master, master election will be forced.
  • Root Region Server
    • This znode holds the location of the server hosting the root of all tables in hbase.
  • rs
    • A directory in which there is a znode per Hbase region server.
    • Region Servers register themselves with ZooKeeper when they come online
  • On Region Server failure (detected via ephemeral znodes and notification via ZooKeeper), the master splits the edits out per region

 

How Is Data Stored in HBase

The Big Picture




HBase handles basically two kinds of file types. One is used for the write-ahead log and the other for the actual data storage. The files are primarily handled by the HRegionServer's. But in certain scenarios even the HMaster will have to perform low-level file operations

The Data Flow

  1. A new client first contacts the Zookeeper quorum (a separate cluster of Zookeeper nodes) to find a particular row key. It does so by retrieving the server (i.e., host) name that hosts the -ROOT- region from Zookeeper.
  2. With that information the client can query that server to get the server that hosts the .META. table. Both of these details are cached and only looked up once.
  3. The client can query the .META. server and retrieve the server that has the row the client is looking for.
  4. Once the client has been told where the row resides (i.e., in what region), it caches this information and directly contacts the HRegionServer hosting that region. That way, the client has a pretty complete picture over time of where to get rows without needing to query the .META. server again.
  5. On the other hand, when HRegionServer opens the region, it creates a corresponding HRegion object. As the HRegion is "opened," it sets up a Store instance for each HColumnFamily for every table, as defined by the user beforehand. Each of the Store instances in turn can have one or more StoreFile instances, which are lightweight wrappers around the actual storage file called HFile. An HRegion also has a MemStore and a HLog instance.
  6. Whether the data should first be written to the "Write-Ahead-Log" (WAL) represented by the HLog class. The decision is based on the flag set by the client using Put.writeToWAL(boolean) method.
  7. Once the data is written (or not) to the WAL, it gets placed in the MemStore. At the same time, WAL checks to see if the MemStore is full. If so, a flush-to-disk is requested. When the request is served by a separate thread in the HRegionServer, it writes the data to an HFile located in the HDFS. The WAL also saves the last written sequence number so the system knows what was persisted so far.


 

 

 

 

Comments

Powered by Zendesk