- DCA 126.96.36.199
- DCA 188.8.131.52
- DCA 184.108.40.206
In some cases customers are experiencing one or more of the following symptoms on the DCA Version 2 running 220.127.116.11 or later code
- Mirror replication fails and mirrors go down
- GPDB hangs at startup because primaries are not able to sync mirrors
- GPDB query performances degrades over time
When running gpcheckperf benchmark there is obvious network related peformance issues reported in the bandwidth test
[root@pdca1 ~]# gpcheckperf -f /tmp/hh -r N -d /tmp /usr/local/greenplum-db/./bin/gpcheckperf -f /tmp/hh -r N -d /tmp ------------------- -- NETPERF TEST ------------------- ==================== == RESULT ==================== Netperf bisection bandwidth test sdw1 -> sdw2 = 1111.280000 sdw3 -> sdw4 = 1111.550000 sdw5 -> sdw6 = 1111.160000 sdw7 -> sdw8 = 1111.270000 sdw9 -> sdw10 = 1111.270000 sdw11 -> sdw12 = 1111.310000 sdw13 -> sdw14 = 1111.350000 sdw15 -> sdw16 = 1111.300000 sdw17 -> sdw18 = 1096.140000 sdw19 -> sdw20 = 790.800000 sdw21 -> sdw22 = 1051.670000 sdw23 -> sdw24 = 1051.780000 sdw25 -> sdw26 = 804.710000 sdw27 -> sdw28 = 1111.230000 sdw29 -> sdw30 = 1111.610000 sdw31 -> sdw32 = 1111.320000 sdw2 -> sdw1 = 1111.290000 sdw4 -> sdw3 = 1110.590000 sdw6 -> sdw5 = 1111.640000 sdw8 -> sdw7 = 1111.530000 sdw10 -> sdw9 = 1111.490000 sdw12 -> sdw11 = 1111.720000 sdw14 -> sdw13 = 1111.690000 sdw16 -> sdw15 = 1110.940000 sdw18 -> sdw17 = 1109.880000 sdw20 -> sdw19 = 1110.280000 sdw22 -> sdw21 = 1057.300000 sdw24 -> sdw23 = 1106.010000 sdw26 -> sdw25 = 260.060000 sdw28 -> sdw27 = 1111.630000 sdw30 -> sdw29 = 1111.260000 sdw32 -> sdw31 = 1111.440000 Summary: sum = 33888.50 MB/sec min = 260.06 MB/sec max = 1111.72 MB/sec avg = 1059.02 MB/sec median = 1111.27 MB/sec [Warning] connection between sdw19 and sdw20 is no good [Warning] connection between sdw25 and sdw26 is no good [Warning] connection between sdw26 and sdw25 is no good
The Issue surfaces when the driver receives a network packet that contains a bad CRC. CRC errors are typically a result of a bad cable or bad connection between the client NIC and the switch. We generalize them as being a hardware or environmental ( EM field disturbances see Cyclic redundancy check). When the driver receives this CRC it begins to send millions of hardware interrupts to the kernel per second. These fake interrupts force the kernel to spend a great deal of cpu time processing them resulting in degradation of network performance. This is similar to how a denial of service attack would behave.
To ensure the customer reporting a performance issues is related to the issue described in this KB we need to make sure the symptoms match exactly. So far we have observed a very strong correlation between a high amount of hardware interrupts and network CRC errors.
- First we can quickly check and see if there are any servers reporting a high amount of interrupts per second
=> mpstat -I SUM -P ALL 1 1 | grep all | grep Average [ sdw9-cm] Average: all 1344.00 [ sdw6-cm] Average: all 1306.93 [ sdw7-cm] Average: all 1479.00 [ sdw4-cm] Average: all 1335038.00 [sdw14-cm] Average: all 1348.00 [sdw15-cm] Average: all 1411.00 [ sdw1-cm] Average: all 1292665.35 [ sdw5-cm] Average: all 1207.00 [ sdw2-cm] Average: all 1198.05 [ sdw3-cm] Average: all 1197.03 [sdw12-cm] Average: all 1289.00 [sdw13-cm] Average: all 2311.88 [sdw10-cm] Average: all 1528.00 [sdw11-cm] Average: all 1260.00 [sdw16-cm] Average: all 632430.69 [ sdw8-cm] Average: all 1211.88
- In the above case we see that sdw1, sdw4, and sdw16 are reporting high 600k - 1.3 million interrupts per second when the database is stopped. Even with a fully loaded system we should not see higher then 100K interrupts for a long period of time
- If you have enough screen space you can confirm most of the interrupts are coming from eth4 and eth5 interfaces using this command
watch -n 1 -d "cat /proc/interrupts | egrep 'eth4|eth5'"
- Then you can cross reference the hosts with high interrupts to nodes that have received CRC errors
=> ethtool -S eth4 | egrep rx_crc_errors [ sdw9-cm] rx_crc_errors: 0 [ sdw6-cm] rx_crc_errors: 0 [ sdw7-cm] rx_crc_errors: 0 [ sdw4-cm] rx_crc_errors: 0 [sdw14-cm] rx_crc_errors: 0 [sdw15-cm] rx_crc_errors: 0 [ sdw1-cm] rx_crc_errors: 164 [ sdw5-cm] rx_crc_errors: 0 [ sdw2-cm] rx_crc_errors: 0 [ sdw3-cm] rx_crc_errors: 0 [sdw12-cm] rx_crc_errors: 0 [sdw13-cm] rx_crc_errors: 0 [sdw10-cm] rx_crc_errors: 0 [sdw11-cm] rx_crc_errors: 0 [sdw16-cm] rx_crc_errors: 0 [ sdw8-cm] rx_crc_errors: 0 => ethtool -S eth5 | egrep rx_crc_errors [ sdw9-cm] rx_crc_errors: 0 [ sdw6-cm] rx_crc_errors: 0 [ sdw7-cm] rx_crc_errors: 0 [ sdw4-cm] rx_crc_errors: 12096 [sdw14-cm] rx_crc_errors: 0 [sdw15-cm] rx_crc_errors: 0 [ sdw1-cm] rx_crc_errors: 0 [ sdw5-cm] rx_crc_errors: 0 [ sdw2-cm] rx_crc_errors: 0 [ sdw3-cm] rx_crc_errors: 0 [sdw12-cm] rx_crc_errors: 0 [sdw13-cm] rx_crc_errors: 0 [sdw10-cm] rx_crc_errors: 0 [sdw11-cm] rx_crc_errors: 0 [sdw16-cm] rx_crc_errors: 1 [ sdw8-cm] rx_crc_errors: 0
From the driver perspective it only take a single bad CRC error to trigger the interrupt storm. We can see sdw16 only received one crc and its mpstat reports about 600k interrupts consistently. This matches the behavior. The more interrupts a client receives the worse the problem will get over time.
When a system is exhibiting the symptoms described in this kb we know there are several ways to workaround this and restore performance
- Reset the network interface using any of the following techniques
- service network restart
- ifconfig eth4 down; ifconfig eth4 up
- ifconfig eth5 down; ifconfig eth5 up
- reboot the server
- (Recommended) Identify the source of the CRC errors and eliminate them from the environment
The hotfix and readme can be found here ( direct download link here ). The hotfix includes a new Qlogic driver version 3.2.63 which requires DCA version 18.104.22.168 or 22.214.171.124. The reason why you much be on 2.1 or higher is because the new drivers minimum OS requirement is RHEl 6.5.