Pivotal Knowledge Base

Follow

bond zero shows high network latency in gpcheckperf due to high hardware interrupts

Environment

  • DCA 2.1.1.0
  • DCA 2.1.0.0
  • DCA 2.0.4.0

Bug Reference

Symptom

In some cases customers are experiencing one or more of the following symptoms on the DCA Version 2 running 2.0.4.0 or later code

  • Mirror replication fails and mirrors go down
  • GPDB hangs at startup because primaries are not able to sync mirrors
  • GPDB query performances degrades over time

When running gpcheckperf benchmark there is obvious network related peformance issues reported in the bandwidth test

[root@pdca1 ~]# gpcheckperf -f /tmp/hh -r N -d /tmp
/usr/local/greenplum-db/./bin/gpcheckperf -f /tmp/hh -r N -d /tmp

-------------------
--  NETPERF TEST
-------------------

====================
==  RESULT
====================
Netperf bisection bandwidth test
sdw1 -> sdw2 = 1111.280000
sdw3 -> sdw4 = 1111.550000
sdw5 -> sdw6 = 1111.160000
sdw7 -> sdw8 = 1111.270000
sdw9 -> sdw10 = 1111.270000
sdw11 -> sdw12 = 1111.310000
sdw13 -> sdw14 = 1111.350000
sdw15 -> sdw16 = 1111.300000
sdw17 -> sdw18 = 1096.140000
sdw19 -> sdw20 = 790.800000
sdw21 -> sdw22 = 1051.670000
sdw23 -> sdw24 = 1051.780000
sdw25 -> sdw26 = 804.710000
sdw27 -> sdw28 = 1111.230000
sdw29 -> sdw30 = 1111.610000
sdw31 -> sdw32 = 1111.320000
sdw2 -> sdw1 = 1111.290000
sdw4 -> sdw3 = 1110.590000
sdw6 -> sdw5 = 1111.640000
sdw8 -> sdw7 = 1111.530000
sdw10 -> sdw9 = 1111.490000
sdw12 -> sdw11 = 1111.720000
sdw14 -> sdw13 = 1111.690000
sdw16 -> sdw15 = 1110.940000
sdw18 -> sdw17 = 1109.880000
sdw20 -> sdw19 = 1110.280000
sdw22 -> sdw21 = 1057.300000
sdw24 -> sdw23 = 1106.010000
sdw26 -> sdw25 = 260.060000
sdw28 -> sdw27 = 1111.630000
sdw30 -> sdw29 = 1111.260000
sdw32 -> sdw31 = 1111.440000

Summary:
sum = 33888.50 MB/sec
min = 260.06 MB/sec
max = 1111.72 MB/sec
avg = 1059.02 MB/sec
median = 1111.27 MB/sec

[Warning] connection between sdw19 and sdw20 is no good
[Warning] connection between sdw25 and sdw26 is no good
[Warning] connection between sdw26 and sdw25 is no good

Cause

The Issue surfaces when the driver receives a network packet that contains a bad CRC.  CRC errors are typically a result of a bad cable or bad connection between the client NIC and the switch.  We generalize them as being a hardware or environmental ( EM field disturbances see Cyclic redundancy check).  When the driver receives this CRC it begins to send millions of hardware interrupts to the kernel per second.  These fake interrupts force the kernel to spend a great deal of cpu time processing them resulting in degradation of network performance.   This is similar to how a denial of service attack would behave.

Verification

To ensure the customer reporting a performance issues is related to the issue described in this KB we need to make sure the symptoms match exactly. So far we have observed a very strong correlation between a high amount of hardware interrupts and network CRC errors. 

  • First we can quickly check and see if there are any servers reporting a high amount of interrupts per second
    => mpstat -I SUM -P ALL 1 1 | grep all | grep Average
    [ sdw9-cm] Average:     all   1344.00
    [ sdw6-cm] Average:     all   1306.93
    [ sdw7-cm] Average:     all   1479.00
    [ sdw4-cm] Average:     all 1335038.00
    [sdw14-cm] Average:     all   1348.00
    [sdw15-cm] Average:     all   1411.00
    [ sdw1-cm] Average:     all 1292665.35
    [ sdw5-cm] Average:     all   1207.00
    [ sdw2-cm] Average:     all   1198.05
    [ sdw3-cm] Average:     all   1197.03
    [sdw12-cm] Average:     all   1289.00
    [sdw13-cm] Average:     all   2311.88
    [sdw10-cm] Average:     all   1528.00
    [sdw11-cm] Average:     all   1260.00
    [sdw16-cm] Average:     all 632430.69
    [ sdw8-cm] Average:     all   1211.88
  • In the above case we see that sdw1, sdw4, and sdw16 are reporting high 600k - 1.3 million interrupts per second when the database is stopped. Even with a fully loaded system we should not see higher then 100K interrupts for a long period of time
  • If you have enough screen space you can confirm most of the interrupts are coming from eth4 and eth5 interfaces using this command
    watch -n 1 -d "cat /proc/interrupts  | egrep 'eth4|eth5'"
  • Then you can cross reference the hosts with high interrupts to nodes that have received CRC errors
    => ethtool -S eth4 | egrep rx_crc_errors
    [ sdw9-cm]      rx_crc_errors: 0
    [ sdw6-cm]      rx_crc_errors: 0
    [ sdw7-cm]      rx_crc_errors: 0
    [ sdw4-cm]      rx_crc_errors: 0
    [sdw14-cm]      rx_crc_errors: 0
    [sdw15-cm]      rx_crc_errors: 0
    [ sdw1-cm]      rx_crc_errors: 164
    [ sdw5-cm]      rx_crc_errors: 0
    [ sdw2-cm]      rx_crc_errors: 0
    [ sdw3-cm]      rx_crc_errors: 0
    [sdw12-cm]      rx_crc_errors: 0
    [sdw13-cm]      rx_crc_errors: 0
    [sdw10-cm]      rx_crc_errors: 0
    [sdw11-cm]      rx_crc_errors: 0
    [sdw16-cm]      rx_crc_errors: 0
    [ sdw8-cm]      rx_crc_errors: 0
    => ethtool -S eth5 | egrep rx_crc_errors
    [ sdw9-cm]      rx_crc_errors: 0
    [ sdw6-cm]      rx_crc_errors: 0
    [ sdw7-cm]      rx_crc_errors: 0
    [ sdw4-cm]      rx_crc_errors: 12096
    [sdw14-cm]      rx_crc_errors: 0
    [sdw15-cm]      rx_crc_errors: 0
    [ sdw1-cm]      rx_crc_errors: 0
    [ sdw5-cm]      rx_crc_errors: 0
    [ sdw2-cm]      rx_crc_errors: 0
    [ sdw3-cm]      rx_crc_errors: 0
    [sdw12-cm]      rx_crc_errors: 0
    [sdw13-cm]      rx_crc_errors: 0
    [sdw10-cm]      rx_crc_errors: 0
    [sdw11-cm]      rx_crc_errors: 0
    [sdw16-cm]      rx_crc_errors: 1
    [ sdw8-cm]      rx_crc_errors: 0

From the driver perspective it only take a single bad CRC error to trigger the interrupt storm.  We can see sdw16 only received one crc and its mpstat reports about 600k interrupts consistently.  This matches the behavior.  The more interrupts a client receives the worse the problem will get over time. 

Workarounds

When a system is exhibiting the symptoms described in this kb we know there are several ways to workaround this and restore performance

  1. Reset the network interface using any of the following techniques
    • service network restart
    • ifconfig eth4 down; ifconfig eth4 up
    • ifconfig eth5 down; ifconfig eth5 up
  2. reboot the server
  3. (Recommended) Identify the source of the CRC errors and eliminate them from the environment

Fix

The hotfix and readme can be found here ( direct download link here ).  The hotfix includes a new Qlogic driver version 3.2.63 which requires DCA version 2.1.0.0 or 2.1.1.0.  The reason why you much be on 2.1 or higher is because the new drivers minimum OS requirement is RHEl 6.5.

Comments

Powered by Zendesk