Pivotal Knowledge Base

Follow

clock_global job causes high memory usage

Environment

Pivotal Cloud Foundry (PCF) 1.12.3 

Component: clock-global

Symptom

The initial symptom is high memory usage of clock-global. The “dips” in the graph below is when the job or the VM had become so unresponsive that Bosh restarted it.

2017-11-27_1146.png

Logs: “/var/vcap/sys/clock_global/cloud_controller_clock_ctl.log”  

2017-11-10 XX:XX:19+0000] E, [2017-11-10TXX:XX:19.445328 #32029] 
ERROR -- : can't create Thread: Resource temporarily unavailable (ThreadError) [2017-11-10 XX:XX:19+0000]
/var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `initialize' [2017-11-10 XX:XX:19+0000]
/var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `new' --------snip----------- [2017-11-10 XX:XX:19+0000] /var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/timeout.rb:106:in `timeout' [2017-11-10 XX:XX:19+0000]
/var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/app/jobs/timeout_job.rb:12:in

The number of open file handles are high. Nearly all of the open file handles are due to the clock_global process.

 $ lsof | wc -l 
406321

After restarting the file handles are back to normal. On tailing the logs Operators have found a new error message.

==> cloud_controller_clock_ctl.log <==
[2017-11-10 XX:XX:16+0000] 
E, [2017-11-10TXX:XX:16.323780 #26820] ERROR -- : droplet presence (Sequel::ValidationFailed) [2017-11-10 XX:XX:16+0000]
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/sequel-4.49.0/lib/sequel/model/base.rb:1824:in `save'
<snip>
[2017-11-10 XX:XX:16+0000]
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.1.326/lib/new_relic/agent.rb:443:in `disab  

The number of file handles quickly grows at about a rate of 400 new handles per 30s. Until the process starts giving the first error message.

 

Cause

clock-global job tries to run a process called diego_sync.  

  1. There may be some orphaned tasks that had been marked as failed, but the sync process could not mark them as failed as the corresponding droplet had been removed.
  2. There is a bug in the sync routine that caused it to continuously spawn threads due to the condition above, however, the process would not clean up the old threads. This bug is introduced in ERT 1.12.3

 

Resolution

To update the 'missing-diego-task' task’s state as failed in ccdb.

Steps on clock-global VM:

1. Find the 'missing-diego-task' in the logs of ‘cloud_controller_clock.log’

grep 'missing-diego-task' cloud_controller_clock.log | jq .data.task_guid | sort | uniq
"415644dd-1133-49c3-b9d8-60ae87c78b28"
"43ae4c03-94ee-40e4-8b71-71ba3a507be1"
"65ee10cf-07c2-488e-bba9-8af365b5d866"
"f3a423d6-7f76-49b9-a2b3-328b4b162810"

2. Update the ccdb for above tasks. Set the status as FAILED.

Example: script is

script.rb:

task_guids = ["415644dd-1133-49c3-b9d8-60ae87c78b28", "43ae4c03-94ee-40e4-8b71-71ba3a507be1", 
"65ee10cf-07c2-488e-bba9-8af365b5d866", "f3a423d6-7f76-49b9-a2b3-328b4b162810"] TaskModel.where(guid: task_guids).each do |task| droplet_guid = task.droplet_guid droplet = DropletModel.find(guid: droplet_guid) if droplet.nil? puts "cancelling task #{task.guid}" task.this.update(state: "FAILED", failure_reason: "manually cancelled") end end   
cat script.rb | /var/vcap/jobs/cloud_controller_ng/bin/console

Notes:

  1. Problems with the “clock_global” job are not catastrophic, however, may lead to a degradation of the ability to push apps over time (Days to Weeks depending on how frequently you push apps)
  2. The bug in the sync routine mentioned above may be triggered for different reasons, other than “ERROR -- : droplet presence (Sequel::ValidationFailed)”  
  3. It is important to capture the logs before the system runs out of file handles, that is directly after a restart.

Comments

Powered by Zendesk