Pivotal Knowledge Base

Follow

Critical YARN Alerts in Ambari

Environment

 Product  Version
 Pivotal HDP  2.3.x
 OS  RHEL 6.x
 Others  Ambari 2.1.x

Symptom

Ambari keeps sending out critical YARN alerts followed by OK alerts sent minutes later.

Ambari notification email complains giving the following messages:

Alert Summary: <ClusterName> - OK[0], Warning[0], Critical[1], Unknown[0]

Services Reporting Alerts
http://AmbariServer:8080/#/main/dashboard/metrics
CRITICAL [YARN]
YARN
CRITICAL App Timeline Web UI
Connection failed to http://AppTimelineServer:8188


Cause
 

This seems to be a known issue of Ambari where App Timeline Server isn't able to respond in a timely manner when it has big timeline Database to manage. 

Resolution

This issue will be fixed with a new release of Ambari 2.2.x.

Workarounds - Choose one of below two approaches that would suit your situation.

  • Clean up or relocate the App Timeline Database. Timeline Database will be recreated once the timeline server restarts:
  1. Stop App Timeline Server from Ambari.
  2. Clean up the App Timeline Database. You can locate the path from the property "yarn.timeline-service.leveldb-timeline-store.path" in yarn-site.xml.
  3. Start App Timeline Server from Ambari.
  • Increase the associated timeout value using the method below:
  • Note - These steps need to be run from Ambari server. 
  • 'hdp24a' in the example below is the cluster name. Substitute it with your own cluster name.
  1. Identify Alert ID of App Timeline Web UI.
[root@admin ~]# curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions
{
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions",
"items" : [
:
:
{
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74",
"AlertDefinition" : {
"cluster_name" : "hdp24a",
"id" : 74,    <<<!!! Note this id of App Timeline Web UI.
"label" : "App Timeline Web UI",
"name" : "yarn_app_timeline_server_webui"
}
},
:
:
]
}

      2. Retrieve the definition of the alert in JSON format.

[root@admin ~]# curl -H "X-Requested-By: ambari" -X GET -u admin:admin http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74 > alert.json


      3. Edit the alert.json to increase the "connection_timeout" to 25 from default 5. 

[root@admin ~]# vi alerts.json
"href" : "http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74", <<!! Remove this line.*
:
:
"default_port" : 0.0, <<!! Remove this line
"connection_timeout" : 25.0 <<!! Change to 25 from default 5.
:

      4. Apply the edited JSON file back. 

[root@admin ~]# curl -X PUT -d @alert.json -i -u admin:admin -H 'X-Requested-By: ambari' http://localhost:8080/api/v1/clusters/hdp24a/alert_definitions/74
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
User: admin
Set-Cookie: AMBARISESSIONID=1g1rebkc8aziuciu8vi0jwgk;Path=/;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 0
Server: Jetty(8.1.17.v20150415)

     5. Restart the Ambari server.

[root@admin ~]# ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Using python /usr/bin/python
Stopping ambari-server
Ambari Server stopped
Using python /usr/bin/python
Starting ambari-server
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start....................
Ambari Server 'start' completed successfully.

 

Comments

Powered by Zendesk