Pivotal Knowledge Base

Follow

App Usage Data and Events Data get corrupted after upgrading to or installing Pivotal Cloud Foundry 1.7.x

Environment

 Product & Versions affected  Version
 Elastic Runtime new installs  1.6.x - 1.7.15
 Elastic Runtime upgrades from  1.7.0 - 1.7.15
 Elastic Runtime new installs NOT affected  1.7.16+

*Bug has been fixed in versions 1.7.16+. However, the issue can still be present in later versions of PCF if the foundation was upgraded from an affected version AND issue was not addressed at the time of or before the upgrade. If you are experiencing the following symptoms and your foundation is based on a 1.7.16+ (Including 1.8.x and 1.9.x) version, it is likely that this issue was not addressed when it was upgraded and is still present. Please follow the instructions below depending on which scenario best describes your case.

Symptom

Users could experience any of the following issues:

  • Rapidly inflating usage data on the Accounting Report and Usage Report in Apps Manager.
  • Crashing app-usage-server App: The app which handles this data is found in the System ORG and System Space will repeatedly crash.
  • Push Apps Manager errand will fail during Apply Changes or bosh run errand command.
  • cf logs app-usage-server --recent will display recent errors for the app-usage-server errand.
  • cf apps will display app-usage-server-venerable App in started state.

Root Cause

This problem is caused by multiple instances of the App usage server suite running in a Pivotal Cloud Foundry (PCF) deployment. Multiple workers cause data replication, and calculations are performed on replicated data. This results in data corruption of the usage data. 

This issue has been observed in the following scenarios:

1. The user upgraded from PCF 1.6.x to PCF < 1.7.16. In this scenario, there was a bug in the App usage service deployment which moved the deployment to a new space and failed to clean up the Apps that were running in the old space. This resulted in multiple instances of the "app-usage" Apps running in both spaces.

2. The user installed a new PCF foundation on < 1.7.15. In this scenario, there was a bug in the App usage service deployment that left “venerable” Apps from a blue-green deployment in a running state. This resulted in multiple instances of the App running.

3. The user upgraded to any version of PCF  > 1.7.16 from an affected version without addressing the bug first. In this scenario, the user may see issues with app usage data integrity. This is the result of one the above scenarios not being addressed in prior versions.

Resolution

The resolution you will execute will depend on the following factors:

1. How long has the foundation been on the affected version?

Depending on the length of time that has passed since the foundation was on any of the affected versions of PCF, the integrity of the data will be affected and subsequently, the best action to take would depend on this factor. Refer to the Time Table below. 

2. Production dependency of App Usage Data: Can it be deleted? Refer to "What Gets Deleted" at the end of this document.

Time Table

Elastic Runtime Current Version

Date installed

Action

Result

v1.6.x

N/A

Upgrade to Elastic Runtime 1.7.18+

Installations that upgrade directly to ERT 1.7.18+ will not experience the issue. Proceed to upgrade directly to ERT 1.7.18+

v1.7.0 - >1.7.15

<30 days ago

Upgrade to Elastic Runtime 1.7.18+ and email us

The installation likely has data quality issues that can be resolved

<60 days ago

Upgrade to Elastic Runtime 1.7.18+ and email us

The installation likely has data quality issues that can be partially resolved, and in some cases fully resolved

60+ days ago

Upgrade to Elastic Runtime 1.7.18+ and email us

The installation likely has data quality issues that may be difficult to resolve without data loss 

Repairing the Integrity of the Foundation

For 1.7.x only: Remove the app-usage-service Space and Apps.

If you have already upgraded to a version 1.8.0+, skip this section and go to the appropriate "All Versions" section below.

NOTE: The following steps remove the app-usage-service Space and Apps which is a temporary fix and could revert when a new deployment occurs. Upgrading to ERT 1.7.18+ immediately following this procedure is strongly encouraged.

1. Log into CF API target as admin and select the system org:

  • cf target -a https://<YOUR-APP-DOMAIN-API-ENDPOINT>
  • cf target -o system

2. List out all the spaces in the system org

  • cf spaces

3. If an apps-manager Space or apps-usage-service Space exists, then delete them as needed to remove those Spaces and all Apps within. They should not be present in a PCF 1.7 installation

  • cf delete-space <SPACE-NAME>

4. Get a list of all the Apps running in the System Space

  • cf target -o system -s system
  • cf apps

5. Confirm that none of app-usage-server-venerable, app-usage-worker-venerable, or app-usage-scheduler-venerable are running. If any one of them is running, stop them:

  • cf stop <APP-NAME>

6. Validate: Re-running the CF Apps and CF Spaces should now show that the apps-usage-service has been removed and the app-usage-server-venerable application has stopped.

7. Upgrade. You should now upgrade to ERT 1.7.18+ to permanently remedy this issue.

All Versions Option 1: Repair and restore the data

Based on the factors listed above, the first step should be to try to repair the data. As previously mentioned, the length of the time on the affected version will determine if the data is likely to have integrity issues and whether or not it can be repaired.

For customers who are affected by this issue AND are using the usage service data for business-critical applications, please open a Support Ticket with the information below.

1. Obtain results of the diagnostic tool that are located at

  • https://app-usage.<system-domain>/data_status_report

2. Include all relevant details such as version history of the foundation in question as well as the output from the data_status page if available, into the ticket. 

These results will be sent to the Apps Manager team to determine what level of recovery we can provide prior to executing the next steps.

At this point, Pivotal Support should have had a chance to review the results of the output above and determined if the data can be recovered. Next, we will need a copy of the Database, which can be obtained by creating a dump of the MySQL Database.

3. Obtain the MySQL root user's credentials from your installation

  • https://<YOUR-OPSMAN-DOMAIN>/api/v0/deployed/products/cf-*/credentials/.mysql.mysql_admin_credentials

4. From the Ops Manager VM as the root user or sudo, use mysqldump to export the Database. Depending on the size of the Database, this might take a while:

  • (sudo) mysqldump -h<IPADDR-MYSQL/0 vm> -u<ROOT> -p app_usage_service> app_usage_service_export.sql

5. Upload the file using https://pivotal.sendsafely.com.

We will then take that Database and repair it using internal tools and return the repaired Database dump to the customer. The restoration should be applied by or with the assistance of Pivotal field or support staff. The restoration process is as follows:

6. Using the cf CLI, login to affected foundation

  • cf login -a <your-affected-foundation>

7. Select System Org and System Space

  • cf target -o system -s system

8. Stop all the three app Usage Applications

  • cf stop app-usage-worker
  • cf stop app-usage-scheduler
  • cf stop app-usage-server

9. Export a backup of the current app_usage_service Database using mysqldump. You'll want to make sure that the export is suitable for importing in case you need to rollback (E.g., Make sure drop statements are included).

10. Import the repaired Database we provided

  • mysql -u [username] -p app_usage_service < [database name].sql

11. Start the Usage Service applications using the cf CLI in the following order

  • cf start app-usage-server
  • cf start app-usage-scheduler
  • cf start app-usage-worker

The data should start to look better and it should be 100% caught up after the Usage Service completes a full cycle at 2 AM server time. We advise waiting a full day to verify that it has worked. 

All Versions Option 2: Purge and reseed the data

For customers whose data can’t be recovered, OR who are not using Usage Data for business-critical applications, they should purge and reseed their app usage data and app events using the following process.

**Warning** This process will completely ERASE the app_usage Database, as well as Cloud Controller’s current app events data. See Database table below for details on what will be deleted. 

1. Using the cf CLI, login to your affected foundation

  • cf login -a <your-affected-foundation>

2. Select System Org and System Space

  • cf target -o system -s system

3. Stop all three app Usage Applications

  • cf stop app-usage-worke
  • cf stop app-usage-schedule
  • cf stop app-usage-server

4. From the Ops Manager VM, connect to the MySQL server of your affected foundation

  • bosh ssh mysql/0

5. Login to the MySQL Database using the root credentials from https://<YOUR-OPSMAN-DOMAIN>/api/v0/deployed/products/cf-*/credentials/.mysql.mysql_admin_credentials

  • mysql -u root -p

6. Drop the app_usage_service Database

  • DROP DATABASE app_usage_service;

7. Recreate an empty Database called app_usage_service

  • CREATE database app_usage_service;

8. Start the Usage Service applications using the CF CLI in the following order

  • cf start app-usage-server
  • cf start app-usage-scheduler
  • cf start app-usage-worker 

You should now have the new Database populated with the indexes and tables shown below in the sample app_usage_service Database. 

What Gets Deleted when dropping the app_usage_service Database? 

Sample app_usage_service Database:

 [app_usage_service]> show tables;
 
 Tables_in_app_usage_service
 app_events
 app_events_fetcher_job_run_logs
app_usage_rollover_job_run_logs
daily_app_config_usages
delayed_jobs
 monthly_app_config_usages
 old_app_data_deleter_job_run_logs
 old_service_data_deleter_job_run_logs
 persisted_monthly_usage_summaries
 platform_app_instance_counts
 schema_migrations
 service_events
 service_events_fetcher_job_run_logs
 service_instance_usages
 system_logs
 worker_check_ins 

Common Errors

cf logs app-usage-server --recent shows errors for the app-usage-server errand failure

"Connected, dumping recent logs for app app-usage-server in org system/space system as admin...
[APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `each'
[APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `migrate'
...

[APP/PROC/WEB/0]ERR Tasks: TOP => db:migrate
...
[API/0] OUT Process has crashed with type: "web"
[API/0] OUT App instance exited with guid <> payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* 1 error(s) occurred:\n\n* Exited with status 4\n* 2 error(s) occurred:\n\n* cancelled\n* cancelled", "crash_count"=>134, "crash_timestamp"=>..., "version"=>"..."}" 

Apps Manager reports incorrect accounting data

accounting_data.png

Deploy will fail as a result of Push Apps Manager errand failure

Running errand Push Apps Manager for Pivotal Elastic Runtime:

...
+ cf start app-usage-worker + echo '+++++++++++++ USAGE DEPLOY FAILED! +++++++++++++' +++++++++++++ USAGE DEPLOY FAILED! +++++++++++++ ...
0 of 1 instances running, 1 starting 0 of 1 instances running, 1 starting
FAILED Start app timeout  Use 'cf logs app-usage-server --recent' for more information

Comments

  • Avatar
    Punal Patel

    How to "Connect to the SQL server of your affected foundation."
    1. “cf env app-usage-server” will show DATABASE_URL
    2. “mysql -h -u app_usage -p” prompts for password
    3. Enter password from DATABASE_URL
    4. mysql>show databases;

Powered by Zendesk