Pivotal Knowledge Base

Follow

My app has crashed, how do I interpret the crash log entry?

Symptoms

You have an application that has crashed.  Looking at cf logs or cf events shows a cryptic error like this.

App instance exited with guid 26799e0c-d098-48b3-8cab-313a4c75474c payload: 
{
    "instance"=>"f1cd7363-4b62-4236-582c-21ced40d4994", 
    "index"=>0,
    "reason"=>"CRASHED",
    "exit_description"=>
        "2 error(s) occurred:
        
        *2 error(s) occurred:
        
        * Exited with status 255
        * cancelled
        * cancelled", 
    "crash_count"=>1, 
    "crash_timestamp"=>1450113930812542373, 
    "version"=>"1c9b8366-56ce-4f97-911c-e3dcb740a40c"}

This article discusses the information present in the log output, what it means and how you can use this to track down application problems.

Cause

This message indicates that your application has crashed.  Application crashes can happen for a variety of reasons including the application has exceeded its memory limit, the application's health check has failed or the application itself has terminated.  In some cases, the crash log will tell you why the crash occurred, but in others, you may need to consult additional logs to narrow down on the culprit.  The remainder of this article discusses specific crash situations and their meanings.

Resolution

The message that is generated when an application process exits is a good starting point for debugging an application crash as it tells us some important bits of information.

  1. It tells us when the crash occurred.  This can be useful for correlating the crash information with other systems like monitoring and log search.
  2. It tells us how many times the app has crashed (see crash_count).
  3. It provides a brief description of the problem, which is often the most helpful part of the message.

Unfortunately, the exit_description itself can be a bit cryptic, but here's a break down of the common messages you'll see and what they mean.

Codependent step exited

Your application has exited normally and has returned an exit code of 0.  This often happens when a program finishes it's execution and is complete.  While this may seem like a normal thing, apps that run on PWS should never finish.  A well behaved application should continue to listen for HTTP requests or work indefinitely.  If you see this error, check for cases where your application might finish without any errors.

Exited with status 4 or status 64

There is nothing specific about an exit status of four, it is generic exit code that can mean a lot of things.  To narrow down the error further, you'll need to look at the circumstances and log message preceding the crash.

Look for the lines Timed out after 60s: health check never passed.  This message can occur when pushing or restarting an application and it indicates that your application did not start fast enough.  The platform puts a timeout, which defaults to 60 seconds, on an application starting up.  If it does not start within this time limit, you will see the error above and a crash with exit four.  To work around this, you could increase the timeout, which can go up to 180 seconds.  This is done with the -t argument to cf push or with the timeout attribute in a manifest.yml file.

    Ex:  cf push -t 180 my-cool-app

The other option is to adjust your application to start up within the given time limit. There's no easy way to do this and it might require application code changes.  The below list contains some of the common causes of slow application start ups.

  • Not enough memory allocated.  The CPU shares assigned to your application are proportionate to the amount of memory assigned.  A larger memory allocation will result in more CPU shares being allocated to your application, which in turn could make your application start faster.
  • Lack of entropy.  This is a common problem for Java based applications.  This KB article describes the problem more and provides a workaround.
  • Network timeouts.  Applications will often populate things like database connection pools at start up.  Establishing the connections for these pools is usually fast, however, connection failures may not be.  When there are problems, like when the connection is configured incorrectly, the application may hang until a timeout occurs.  If the connection pool is large enough and the timeout long enough, your application may not be able to start up in time.  The solution here is to make sure the connections are configured properly or to tune the timeouts so that you get errors more quickly.
  • Cache seeding.  Some applications that cache data will initialize said caches at start up.  There's nothing wrong with this, however, it can be time-consuming.  To make your start up time lower, you might consider lazily seeding caches as the data is accessed or deferring the work to a background thread that does not block the start up of your application.

Another potential error that you might see associated with an exit code of four is healthcheck failed: failure to make TCP connection: dial tcp <container-ip>:8080: getsockopt: connection refused.  This error message indicates that the health check for the application has failed.  It can happen on push or application restart if the application is not listening on the assigned port.  When this happens, make sure that your application is looking at the assigned port, which is placed into the $PORT environment variable, and listening for requests on that port.  The other possible startup issue that would generate this message is if the application or not starting up and listening on the port fast enough.  See the previous paragraphs for details on resolving that issue.

It is also possible for health checks to fail after an application has been running for a period of time.  In this case, you will see the same error: healthcheck failed: failure to make TCP connection: dial tcp <container-ip>:8080: getsockopt: connection refused.  The difference is that this means your application stopped listening for requests or did not respond to a health check request fast enough.  When this happens, the recommendation is to check the load on your application.  If you see high CPU usage this would be an indication that you need to scale your application.  If you're not seeing high CPU usage then you may have a concurrency problem in the application.  The recommendation for that scenario is to check for deadlock or blocking issues.

Lastly, it is possible to get an exit code of 64 which is a variation on exit code 4.  Exit code 4 means that the TCP health check failed because a connection was actively refused.  Exit code 64 means that a connection could not be established but instead of being refused, the connection attempt just timed out.  The difference is typically that a connection is actively refused when nothing is listening on the port, whereas a timeout can occur when something is listening but not responding fast enough or when a firewall silently drops the packets.

Exited with status 6 or 65

These error codes are the same as 4 and 64, except they indicate that an HTTP based health check has failed rather than a TCP based health check.  Again, a 6 indicates that the connection was actively refused whereas a 65 indicates that the connection timed out.

Exited with status 255 (out of memory)

This message is telling us that the application has exceeded it's memory quota and the system has killed it.  If you're seeing this issue, you'll need to take a closer look at the memory usage of your application.  Application logs may show additional information here, for example Java applications may indicate there was an OutOfMemoryError.  If application logs are not sufficient, a profiler or memory dump tool may be necessary to investigate the problem.

Exited with status X

Your application has exited in error and has returned exit code X.  This happens when the application encountered an error or exception, could not handle that and exited.  It's not possible to say from this what caused the error, but a well written application should have logged additional information to indicate why it has exited.  Look through your application logs prior to the time of the crash for more clues.

Note, the X in the error message will be an integer from 0 - 255 (unsigned char).  If your application returns a negative integer it will be converted to an unsigned char.  This means that if X is 255 the app could have returned either -1 or 255 since they are both the result when interpreted as an unsigned char.

Additional Information

When an app running on PWS crashes, PWS only gets a very small amount of information from the application.  This information is conveyed through the error code above.  In most cases, you'll need additional information to track down problems.  Here are some common ways to do that.

  • Logs.  PWS does not store your logs, we only buffer them for a short period of time.  If you do not happen to catch the crash right when it happens, it's likely that PWS will no longer have the logs that you need to debug the problem.  We strongly recommend that you setup a third party log management service to receive your logs and permanently store them.   This will allow you to easily search through them when there is a problem.
  • Monitoring and Statistics.  PWS can provide you with some high level stats about your application's memory and CPU usage, but it is often helpful to have more insight into the working of your application.  It's also helpful to track these stats over a period of time to watch for changes and trends.  For this, we recommend that you use a Monitoring service like NewRelic or AppDynamics.
  • Run multiple app instances.  For applications that require high availability or strive for 100% uptime, you'll want to run multiple instances of your application.  With multiple instances running, a crash of one instance will not take down your entire application.  The remaining instances will have an opportunity to handle any requests and prevent downtime.
  • When possible, debug application problems locally.  PWS is a place to run your production applications and like most production data centers, access to your running application is limited.  In many cases this makes debugging application problems more difficult or in some rare cases impossible.  Having a local environment where you can replicate problems, including load related problems, is strongly encouraged.

You may also be curious about the format of the crash description itself.  

For example:

    "exit_description"=>
        "2 error(s) occurred:
        
        *2 error(s) occurred:
        
        * Exited with status 255
        * cancelled
        * cancelled", 

Why are there two errors?  Why does it say "canceled"?  This is because Diego actually starts two processes in every application container.  There is your application process and there is a very small process which allows you to SSH into the application container.  When your application crashes, the SSH process is automatically canceled.  All of this output is then generated as Diego converts it's internal error structure into a human readable string.  For all intents and purposes, you can ignore the lines with "cancelled" and the fact that it's reporting two errors.  Instead focus on the "Exited with status X" to determine the cause of the crash.

As of the writing of this article, the Diego team is working to simplify and improve the output of the exit description.  There's not time frame for when this work will be complete, but a future update should remove the extra noise in the crash messages.

 

Comments

Powered by Zendesk