Pivotal Knowledge Base

Follow

Hbase regionserver fails with error telling master we are up

Environment

  • Hbase 0.96

Symptom

Hbase region server reports the following Exception

2014-10-21 12:16:56,538 WARN  [regionserver60020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/3.3.84.44:60635 remote=hbaseMaster.domain.com/192.168.255.40:60000]
        at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1667)
        at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1708)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:5402)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1924)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:790)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/3.3.84.44:60635 remote=hbaseMaster.domain.com/192.168.255.40:60000]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient.java:573)
        at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:858)
        at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1532)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1421)
        at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1650)
        ... 5 more

TCP sessions from region server to master are stuck in SYN_SENT

[gpadmin@regionserver hbase]$ netstat -an | egrep 6000
tcp        0      1 3.3.84.44:21569             192.168.255.40:60000        SYN_SENT

Cause

From the error we see that the local subnet for hbase region server is 3.x but the remote master server subnets is 192.x

local=/3.3.84.44:60635 remote=hbaseMaster.domain.com/192.168.255.40:60000]

This can happen when you have multipe network interfaces assigned to your servers. By default the region server will identify its primary hostname and performance a DNS lookup to determine what interface it should bond to. Based on teh following network routing table the region server determines it needs to vind to bond0 interface even though the master server is communicating on bond1

[gpadmin@regionserver hbase]$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
3.3.84.0        0.0.0.0         255.255.254.0   U         0 0          0 bond0
192.168.254.0   0.0.0.0         255.255.254.0   U         0 0          0 bond1
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond1
0.0.0.0         3.3.84.1        0.0.0.0         UG        0 0          0 bond0

Fix

Set param hbase.regionserver.dns.interface in /etc/gphd/hbase/conf/hbase-site.xml to force region server to use bond1 on startup so it can communicate with hbase master on the correct network interface.

  
<property>
    <name>hbase.regionserver.dns.interface</name>
    <value>bond1</value>
</property>

Comments

Powered by Zendesk