After a long period of inactivity on the Pivotal HDB server, the first query may fail as below:
HAWQ=# select count(*) from hawq.test; ERROR: failed to acquire resource from resource manager, 5 of 5 segments are unavailable, exceeds 25.0% defined in GUC hawq_rm_rejectrequest_nseg_limit. The allocation request is rejected. (pquery.c:804)
Issuing the query a second time resolves the issue.
The Pivotal HDB master logs may show errors such as these over the past 24 hours:
2016-06-30 11:50:03.434494 BST,"gpadmin","test",p416321,th1875830912,"172.27.12.103","38205",2016-06-30 11:49:52 BST,595090,con1485,cmd8,seg-10000,,,x595090,sx1,"LOG","58030","fail to connect hdfs at hdfs://uatcluster/, errno = 13","Acce
ssControlException: Failed to evaluate challenge: GSSAPI error in client while negotiating security context in gss_init_sec_context() in SASL library. This is most likely due insufficient credentials or malicious interactions.",,,,, 2016-07-06 01:47:52.146335 BST,,,p182370,th140327040,,,,0,con4,,seg-10000,,,,,"WARNING","01000","Resource manager sets host hawq.local down in cleanup phase for resource broker error.",,,,,,,0,,"resourcemanager.c",2637, 2016-07-06 01:47:51.945902 BST,,,p182375,th140327040,,,,0,con4,,seg-10000,,,,,"WARNING","01000","YARN mode resource broker failed to get container report. LibYarnClient::getContainerReports, Catch the Exception:YarnIOException: Unexpected exception: when calling ApplicationCl ientProtocol::getContainers in /data1/pulse2-agent/agents/agent1/work/LIBYARN-main-opt/rhel5_x86_64/src/libyarnserver/ApplicationClientProtocol.cpp: 195",,,,,,,0,,"resourcebroker_LIBYARN_proc.c",1748,
This is caused by a software defect in HAWQ 2.0.0 when Kerberos is enabled. The Kerberos ticket is only enabled during HDFS interactions and NOT during libYARN interactions. So, if the Kerberos ticket is expired and a YARN interaction is attempted, authentication will fail.
PERMANENT SOLUTION - Upgrade to HDB 2.0.1:
Pivotal HDB 2.0.1 has been released which includes a fix for this issue. Refer to release note for details. Upgrading to HDB 2.0.1 is suggested to have the fix.
WORKAROUND - Force the ticket to renew every hour by issuing a query to HDFS:
1. Even with fix included in HDB 2.0.1 for this issue, HDB will only automatically renew ticket at specific intervals, which is configured with system parameter
server_ticket_renew_interval. By default this parameter is set to 12 hours. It means that, for those environment where kerberos ticket will expire after 12 hours,
server_ticket_renew_interval needs to be adjusted to make sure kerberos ticket will be renewed prior to ticket expiration.
Tips for changing this parameter:
via Ambari, add a Configuration under HAWQ / Configs / Advanced / Customer hawq-site.xml :
server_ticket_renew_interval = 18000000 #This will force the ticket to review every 5 hours
2. Restart HAWQ services as requested by Ambari
3. Via PSQL, set up a dummy table:
psql create table crontable (id int); insert into crontable value (1);
4. Create a script that will source the greenplum_path.sh file and query the dummy table created in step 3.
source /usr/local/hawq/greenplum_path.sh && psql -p 5443 -d postgres -c 'select count(*) from crontable;'
5. As user gpadmin run "crontab -e" to edit the gpadmin crontab:
6. In the gpadmin crontab, add this line to run the script every 10 minutes. This will renew the Kerberos ticket every 10 minutes:
*/10 * * * * /home/gpadmin/run.sh >>/home/gpadmin/crontab.log 2>&1