Pivotal Knowledge Base

Follow

HAWQ MADlib Install Check Fails with Error: "The Database System is Starting Up"

Environment

 Product  Version
 Pivotal HDB  2.x
 Pivotal HDP  2.4, 2.5

Symptom

Madlib install-check fails with the following error:

[gpadmin@sandbox ~]$ /usr/local/hawq/madlib/bin/madpack install-check -p hawq
madpack.py : INFO : Detected HAWQ version 2.1.
TEST CASE RESULT|Module: array_ops|array_ops.sql_in|PASS|Time: 1270 milliseconds
madpack.py : ERROR : Failed executing /tmp/madlib.03z8sA/bayes/test/gaussian_naive_bayes.sql_in.tmp
madpack.py : ERROR : Check the log at /tmp/madlib.03z8sA/bayes/test/gaussian_naive_bayes.sql_in.log
TEST CASE RESULT|Module: bayes|gaussian_naive_bayes.sql_in|FAIL|Time: 10038 milliseconds
madpack.py : ERROR : Failed executing /tmp/madlib.03z8sA/bayes/test/bayes.sql_in.tmp
madpack.py : ERROR : Check the log at /tmp/madlib.03z8sA/bayes/test/bayes.sql_in.log
TEST CASE RESULT|Module: bayes|bayes.sql_in|FAIL|Time: 242 milliseconds
madpack.py : ERROR : SQL command failed:
SQL: DROP SCHEMA IF EXISTS madlib_installcheck_bayes CASCADE;
psql: FATAL:  the database system is starting up

Traceback (most recent call last):
  File "/usr/local/hawq_2_1_1_0/madlib/Versions/1.9.1/bin/../madpack/madpack.py", line 1369, in <module>
    main(sys.argv[1:])
  File "/usr/local/hawq_2_1_1_0/madlib/Versions/1.9.1/bin/../madpack/madpack.py", line 1356, in main
    _internal_run_query("DROP SCHEMA IF EXISTS %s CASCADE;" % (test_schema), True)
  File "/usr/local/hawq_2_1_1_0/madlib/Versions/1.9.1/bin/../madpack/madpack.py", line 176, in _internal_run_query
    return run_query(sql, show_error, con_args)
  File "/usr/local/hawq_2_1_1_0/madlib/Versions/1.9.1/bin/../madpack/madpack.py", line 141, in run_query
    raise Exception
Exception
[gpadmin@sandbox ~]$

When the above error is seen, the master will encounter a segmentation fault similar to itself and will create a core dump:

2017-01-11 10:59:52.522303 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG5","00000","RM_PFREE at 7f52f63cd8d0 from linkedlist.c:60:cleanDQueue",,,,,,,0,,"memutilities.c",108,
2017-01-11 10:59:52.522321 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG5","00000","RM_PFREE at 7f52f63e4cf8 from linkedlist.c:60:cleanDQueue",,,,,,,0,,"memutilities.c",108,
2017-01-11 10:59:52.522338 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG5","00000","RM_PFREE at 7f52f63cfe48 from linkedlist.c:60:cleanDQueue",,,,,,,0,,"memutilities.c",108,
2017-01-11 10:59:52.522367 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG3","00000","Resource manager reads slaves file /usr/local/hawq/./etc/slaves.",,,,,,,0,,"resourcepool.c",4316,
2017-01-11 10:59:52.522376 GMT,,,p48549,th526657824,,,,0,,,seg-10000,,,,,"DEBUG4","00000","reaping dead processes",,,,,,,0,,"postmaster.c",3728,
2017-01-11 10:59:52.522404 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG3","00000","Current file change time stamp 1482313220",,,,,,,0,,"resourcepool.c",4337,
2017-01-11 10:59:52.522432 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG3","00000","Find FD 14 is read ready.",,,,,,,0,,"rmcomm_AsyncComm.c",313,
2017-01-11 10:59:52.522453 GMT,,,p48549,th526657824,,,,0,,,seg-10000,,,,,"DEBUG2","00000","server process (PID 56089) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",4748,
2017-01-11 10:59:52.522461 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG3","00000","commbuffer action mask 10, toclose 0, forced 0",,,,,,,0,,"rmcomm_AsyncComm.c",321,
2017-01-11 10:59:52.522469 GMT,,,p48549,th526657824,,,,0,,,seg-10000,,,,,"LOG","00000","server process (PID 56089) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",4748,
2017-01-11 10:59:52.522477 GMT,,,p48567,th526657824,,,,0,con4,,seg-10000,,,,,"DEBUG3","00000","FD 14 (client) is normally closed.",,,,,,,0,,"rmcomm_AsyncComm.c",356,
2017-01-11 10:59:52.522484 GMT,,,p48549,th526657824,,,,0,,,seg-10000,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",4486,
2017-01-11 10:59:52.522491 GMT,,,p48549,th526657824,,,,0,,,seg-10000,,,,,"DEBUG2","00000","sending SIGQUIT to process 56162",,,,,,,0,,"postmaster.c",4529, 

Cause 

This is caused by a configuration setting that is not accepted correctly by the Resource Manager. The master should not go into crash recovery for this issue so internal defect GPSQL-3341 has been submitted for this. 

Resolution/Workaround

1. Open the Ambari GUI.

2. Locate the configuration "hawq_rm_nvseg_perquery_perseg_limit".

3. Change the setting "hawq_rm_nvseg_perquery_perseg_limit" to the default value of 6.

4. Restart all services as per Ambari.

5. Retry to run the MADlib tests. 

Comments

Powered by Zendesk