Pivotal Knowledge Base

Follow

Why Loggregator may lose logs

Environment

Product Version
Pivotal Cloud Foundry® (PCF) Elastic Runtime 1.7.x

Synopsis

This article discusses why the Loggregator may lose messages in Pivotal Cloud Foundry. 

Description

Loggregator simply transports logs and metrics messages in Pivotal Cloud Foundry Elastic Runtime. It makes information available to users and external log management systems. Persistence of the logs is the responsibility of whatever consumes the logs. Examples would be aggregators such as ELK stacks, Splunk, or simply cf logs

Log messages not immediately extracted and persisted, are discarded. The exceptions are the small number of logs stored in a buffer and available through cf logs --recent. More details on the components of Loggregator can be found here.

Loggregator Design

Loggregator transports logs using the UDP protocol. The reason for this protocol choice is that Loggregator should be nonblocking to applications. With "fire and forget" UDP, Diego logging mechanisms never block on transmission.

Another design goal was to be performant at scale. The Loggregator supports horizontal scaling by replicating Dopplers, Traffic Controllers, and Nozzles. This "fabric" of components should deliver messages as fast as it is capable of, even if it is not scaled up to a large enough configuration to handle the entire load. Again, UDP was chosen as the protocol from the Metrons to the Dopplers to keep logs flowing to the highest degree. Dopplers simply drop the UDP packets that they are not capable of consuming.

Log Loss

The consequence of these design decisions is that the UDP implementation is not guaranteed reliable. Both UDP links can, and do, lose log messages. UDP messages can be lost in two ways: first, if the network drops the packet, and second, if the receiving component doesn't keep up when reading in the UDP messages. The second mechanism is the dominant loss cause in Loggregator and messages can be lost in two scenarios:

1) Metron --> Doppler

2) Application --> Metron

Predicted Message Loss Per-Doppler at Various Loads

 

Msgs/Sec

Loss

500

0.9%

1000

1.7%

1500

2.6%

2000

3.5%

2500

4.3%

3000

5.2%

3500

6.1%

4000

7.0%

4500

8.0%

5000

8.9%

5500

9.8%



How to monitor log loss:

Scenario 1 - For loss between Metrons and Dopplers, compare the following metrics:

  Messages sent by Metron         --> MetronAgent.DopplerForwarder.sentMessages

  Messages received by Doppler  --> DopplerServer.dropsondeListener.receivedMessageCount  

Scenario 2 - For loss within an individual Diego VM, compare the following metrics:

   Messages sent by Diego Executor --> rep.logSenderTotalMessagesRead

   Messages processed by Metron     --> MetronAgent.DopplerForwarder.sentMessages

Future Improvements to Loggregator 

The Loggregator will be upgraded to move UDP links to TCP. This will solve the issues of Scenario 1, but not of Scenario 2. However, Scenario 2 loss will no longer be invisible.  It allows explicit notification of when messages are dropped because of a saturated Metron, and that notification will include data on how many messages from each app are discarded.

A good place to keep track of Loggregator changes is its GitHub repo and the product roadmap.

Additional Information 

For more information on this, please read the logreliabilityincloudfoundryloggregatorjuly2016.pdf white paper. 

Comments

Powered by Zendesk