Pivotal Knowledge Base

Follow

How to fix stream of Dialhomes with Symptom Code 15.2 (SNMP configuration issue on host)

Environment

  • DCA v1
  • DCA v2

Problem

There are instances when hundreds of DialHomes  are generated on DCA clusters with Symptom Code 15.2. This sometimes causes the EMConnect front-end to block dial homes originating from the cluster (aka HIB event with code 29.AAA1.902).   A snippet of the Dial Home code is below : 

SymptomCode : 15.2
Category : Status
Severity : Error
Status : OK
Component : hdw5 : hdw5
ComponentID : DCA2-DS-HD
FirstTime : 10/29/2013 01:06:29
Count : 1
Description : Network Device Operational Status: SNMP configuration issue on host

Also, the command 'dcacheck' return a lot of errors with "SNMP configuration issue on host" messages

DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data1) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.2]
DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data2) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.3]
DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data3) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.4]
DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data4) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.5]
DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data5) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.6]
DCA_CHECK_ERROR host(hdw10.gphd.local): [Status=>error] [Disk Space Used Percentage on Hadoop Worker (/data6) Value=>SNMP configuration issue on host] [oid=>.1.3.6.1.4.1.2021.9.1.9.7]

Cause 

This issue is caused due to misconfiguration in the following areas :

  • incorrect entries in the snmpd.conf file
  • incorrect hostname to IP address mapping in /etc/hosts file. 

Troubleshooting 

1) check if the network address (CIDR format) is correct for the "com2sec" directive in /etc/snmp/snmpd.conf

The correct subnet entry here should match with the IP address/subnet from hostname lookup (either from DNS or /etc/hosts file). A quick test to check this would be to ping any node, say for eg hdw1. If the IP address returned from the ping command does not match the correct network/subnet in the "com2sec" directive in /etc/snmp/snmpd.conf file, then healthmon/snmpwalk/dcacheck will fail

$ ping hdw1
PING hdw1.gphd.local (3.14.144.72) 56(84) bytes of data.
64 bytes from hdw1.gphd.local (3.14.144.72): icmp_seq=1 ttl=64 time=0.014 ms
64 bytes from hdw1.gphd.local (3.14.144.72): icmp_seq=2 ttl=64 time=0.014 ms
64 bytes from hdw1.gphd.local (3.14.144.72): icmp_seq=3 ttl=64 time=0.014 ms

The correct com2sec directive entry should be the following :

$ cat /etc/snmp/snmpd.conf
com2sec internalUser 3.14.144.0/24 public
com2sec externalUser default public

Note from the above example, the IP address returned from the ping command matches the subnet entry in the com2sec directive in snmpd.conf file

In some cases, the default subnet 172.28.0.0/22 needs to be changed to a subnet to match the subnet that is configured for the hostname, especially in the case for hadoop nodes where the hostnames are usually mapped to customer's external IP addresses and not the DCA internal IP address (for more details, refer to the basic DCA overview doc)

2) check if the the correct snmpd.conf template is applied for a particular type of host ie hdw (hadoop worker node), hdm (hadoop master node), segment node (sdw)

$ ls -1 /opt/dca/etc/snmpd_templates/*
/opt/dca/etc/snmpd_templates/snmpd.conf.dia
/opt/dca/etc/snmpd_templates/snmpd.conf.hdc
/opt/dca/etc/snmpd_templates/snmpd.conf.hdm
/opt/dca/etc/snmpd_templates/snmpd.conf.hdw
/opt/dca/etc/snmpd_templates/snmpd.conf.mdw
/opt/dca/etc/snmpd_templates/snmpd.conf.sdw

For eg, the hadoop worker node template (snmpd_conf_hdw) has the following entries 

root@batman-mdw ~]# cat /opt/dca/etc/snmpd_templates/snmpd.conf.hdw
com2sec internalUser 172.28.0.0/20 public
com2sec externalUser default public group internalGroup v1 internalUser
group internalGroup v2c internalUser
group externalGroup v1 externalUser
group externalGroup v2c externalUser view all included .1
view dcaview included .1.3.6.1.4.1.1139.23.1.1 access internalGroup "" any noauth exact all none none
access externalGroup "" any noauth exact dcaview none none dontLogTCPWrappersConnects 1
master agentx
pass_persist 1.3.6.1.4.1.1139.23.1.2.1 /opt/dca/bin/dca_subagent
pass .1.3.6.1.4.1.3582 /usr/sbin/lsi_mrdsnmpmain disk /
disk /data1
disk /data2
disk /data3
disk /data4
disk /data5
disk /data6
disk /data7
disk /data8
disk /data9
disk /data10

Note the additional entries for /data1, /data2, /data3 etc.. to monitor additional disks on hadoop workers nodes.

Typical cause of misconfiguration is when snmpd.conf is copied to all the all nodes in the cluster especially in PHD clusters. This will surely break SNMP configuration and in turn, healthmon and Dial Homes

3) Once the fixes are in place, always run 'dcacheck' to verify the changes. 'dcacheck' is a standalone utility that functions as an SNMP client similar to snmpwalk and healthmon. dcacheck walks the MIB tree to get status of end-points. If dcacheck returns clean barring real events and failures, then it serves as a good check to make sure that there won't be any false dial homes generated due to 15.2 events. 

Note: If changes are made to the snmpd.conf files, please follow steps in this article to recycle services appropriately based on DCA V1 or V2.

How to restart and verify SNMP services on DCA

Comments

Powered by Zendesk