Pivotal Knowledge Base

Follow

PCF Diego Cells at 100% inode Utilization

Environment

Pivotal Cloud Foundry 1.10.12 and above

Symptom

Customers are using Diego cells not deployed by Elastic Runtime, such as with Isolation segments or using OSS deployment.

Running df -i reports inode usage of 100%. (or high inode utilization)

Diego deployment manifest should have cleanup_process_dirs_on_wait: true:

/var/tempest/workspaces/default/deployments/cf-b726f387316441065827.yml:
  garden:   
     
 cleanup_process_dirs_on_wait: true

This flag --cleanup-process-dirs-on-wait should be on garden when it starts:

/var/vcap/data/jobs/garden/4456fe41ab6291aefe82ef966103d435676f45ca/bin/garden_ctl:
      --cleanup-process-dirs-on-wait \

You should see this flag --cleanup-process-dirs-on-wait on gdn process when started :

ps -ef. | grep -i gdn
root      514382  514381  2 Nov18 ?        14:24:19 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind- ...  --cleanup-process-dirs-on-wait

If this is not set then deployment manifest should be updated to include: cleanup_process_dirs_on_wait: true.

Error Message:

Application crashes with the following error:

runc exec: exit status 1: exec failed: open /var/vcap/data/garden/depot/... .../.pidfile: No space left on device

Cause 

A new garden boolean cleanup_process_dirs_on_wait was introduced in the release: https://github.com/cloudfoundry/garden-runc-release/tree/v1.5.0 - this flag by default is set to false unless explicitly set in deployment. This option being disabled will leave behind stale directories which eventually lead to inodes being exhausted.

Note: Versions of Elastic Runtime that are lower than 1.10.12 will not have this boolean as it uses older than 1.5.0 garden release. (these systems will not be affected by this problem) Refer to release notes for Garden versions packaged with ERT: https://docs.pivotal.io/pivotalcf/1-10/pcf-release-notes/runtime-rn.html

Resolution

It will be necessary to update deployment manifest with boolean cleanup_process_dirs_on_wait

For example:
vi /var/tempest/workspaces/default/deployments/p-isolation-segment-XXXX.yml:
garden:   
  cleanup_process_dirs_on_wait: true

Note: that deployment manifest may vary depending what type of manifest has deployed garden. You should check all manifests for garden and verify that they have cleanup_process_dirs_on_wait set to "true".

Once the boolean value is set then execute `bosh deploy <deployment name>` in order to implement the change.

Another option is to bosh recreate Diego cells periodically until the fix is available.

Please note if you make any changes to the configuration in Ops Manager, this will overwrite manual changes to deployment files.

This issue will be fixed in an upcoming release of PCF Isolation Segment.

Additional Information

Garden specification: https://github.com/cloudfoundry/garden-runc-release/blob/develop/jobs/garden/spec#L176-L178 

Comments

Powered by Zendesk