#950 Possible CPU time issues on the AWS EC2 T2 C8S machines in CentOS CI
Closed: Invalid by arrfab. Opened by mrc0mmand.

Hey!

Recently I noticed a strange issue with machines in the virt-ec2-t2-centos-8s-x86_64 pool I can't really put my finger on.

On Monday, our CI jobs utilizing the virt-ec2-t2-centos-8s-x86_64 started heavily timing out in certain tests for no apparent reason. Thinking we've maybe merged something we shouldn't I started debugging it further, but whilst doing that, all CI job started magically passing again, so I let it be. However, today the issue resurfaced with the same symptoms. This is actually the third time I noticed this, and it always "fixed" itself after several hours - for example, on Monday the last successful job was at 4:52 PM CET, then all jobs from 4:56 PM to 9:55 failed with the same symptoms. However, following day the jobs were passing without an issue (first one @ 7:28 AM).

There's nothing in any log/journal I could get my hands on. However, today I decided to take one of the problematic machines aside, and it looks like a lot of the vCPU time is actually "stolen" by the hypervisor:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)

15:39:18     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:23     all   11.32    0.00    2.80    0.55    1.41    0.23   45.37    0.00    0.00   38.32
15:39:23       0   12.02    0.00    4.39    0.38    0.76    0.76   65.84    0.00    0.00   15.84
15:39:23       1   10.28    0.00    2.39    1.28    1.47    0.37   37.98    0.00    0.00   46.24
15:39:23       2   13.15    0.00    1.98    0.36    0.90    0.00   52.79    0.00    0.00   30.81
15:39:23       3   10.22    0.00    2.87    1.43    1.43    0.36   38.35    0.00    0.00   45.34
15:39:23       4   12.57    0.00    2.67    0.00    1.90    0.19   53.71    0.00    0.00   28.95
15:39:23       5   10.24    0.00    3.35    0.37    1.86    0.00   39.85    0.00    0.00   44.32
15:39:23       6   12.45    0.00    1.83    0.37    1.47    0.18   45.79    0.00    0.00   37.91
15:39:23       7    9.65    0.00    2.97    0.19    1.48    0.00   29.50    0.00    0.00   56.22

15:39:23     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:28     all   12.40    0.00    2.68    0.00    1.67    0.26   50.64    0.00    0.00   32.36
15:39:28       0   13.38    0.00    2.23    0.00    2.23    0.19   54.65    0.00    0.00   27.32
15:39:28       1   11.13    0.00    2.97    0.00    1.67    0.37   38.22    0.00    0.00   45.64
15:39:28       2   10.59    0.00    2.69    0.00    1.62    0.54   39.86    0.00    0.00   44.70
15:39:28       3   15.89    0.00    1.36    0.00    1.94    0.00   65.12    0.00    0.00   15.70
15:39:28       4    9.56    0.00    2.94    0.00    2.21    0.00   36.76    0.00    0.00   48.53
15:39:28       5   13.52    0.00    5.57    0.00    0.20    0.20   80.52    0.00    0.00    0.00
15:39:28       6   14.29    0.00    1.28    0.00    2.38    0.00   52.56    0.00    0.00   29.49
15:39:28       7   11.15    0.00    2.52    0.00    1.08    0.72   40.83    0.00    0.00   43.71

15:39:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:33     all   12.36    0.00    2.53    0.00    1.45    0.09   55.86    0.00    0.00   27.71
15:39:33       0   12.95    0.00    1.69    0.00    1.69    0.19   53.28    0.00    0.00   30.21
15:39:33       1   14.45    0.00    1.90    0.00    1.33    0.19   64.26    0.00    0.00   17.87
15:39:33       2   11.03    0.00    2.85    0.00    1.42    0.00   47.51    0.00    0.00   37.19
15:39:33       3   11.19    0.00    2.35    0.00    2.35    0.00   46.21    0.00    0.00   37.91
15:39:33       4   11.40    0.00    2.57    0.00    1.84    0.18   47.79    0.00    0.00   36.21
15:39:33       5   12.48    0.00    4.35    0.00    0.38    0.19   72.97    0.00    0.00    9.64
15:39:33       6   13.74    0.00    1.81    0.00    1.08    0.00   62.21    0.00    0.00   21.16
15:39:33       7   11.80    0.00    2.72    0.00    1.45    0.00   53.72    0.00    0.00   30.31

15:39:33     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:38     all   10.95    0.00    3.11    0.95    2.62    0.74   54.25    0.00    0.00   27.38
15:39:38       0   10.88    0.00    3.56    0.75    2.63    0.56   55.72    0.00    0.00   25.89
15:39:38       1    9.14    0.00    3.23    0.72    2.33    0.54   47.49    0.00    0.00   36.56
15:39:38       2    9.49    0.00    3.65    2.55    2.92    1.09   52.37    0.00    0.00   27.92
15:39:38       3   11.11    0.00    3.01    1.32    3.20    0.94   52.17    0.00    0.00   28.25
15:39:38       4   12.12    0.00    2.69    0.77    1.92    0.38   51.92    0.00    0.00   30.19
15:39:38       5   12.98    0.00    2.29    0.00    2.10    0.38   61.45    0.00    0.00   20.80
15:39:38       6   12.22    0.00    2.82    0.19    2.63    0.75   58.08    0.00    0.00   23.31
15:39:38       7    9.93    0.00    3.55    1.24    3.19    1.24   55.14    0.00    0.00   25.71

15:39:38     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:43     all   12.16    0.00    3.37    0.30    1.77    0.30   59.29    0.00    0.00   22.81
15:39:43       0   12.57    0.00    2.55    0.73    1.64    0.55   56.83    0.00    0.00   25.14
15:39:43       1   14.29    0.00    1.28    0.00    2.75    0.00   61.72    0.00    0.00   19.96
15:39:43       2   12.79    0.00    2.70    0.00    1.26    0.18   56.94    0.00    0.00   26.13
15:39:43       3   15.05    0.00    2.02    0.00    2.02    0.37   65.14    0.00    0.00   15.41
15:39:43       4   11.76    0.00    2.94    0.37    2.02    0.37   55.33    0.00    0.00   27.21
15:39:43       5    8.80    0.00    5.75    0.54    1.97    0.36   55.66    0.00    0.00   26.93
15:39:43       6   10.59    0.00    4.65    0.00    1.67    0.56   58.36    0.00    0.00   24.16
15:39:43       7   11.45    0.00    5.15    0.76    0.76    0.00   64.69    0.00    0.00   17.18

15:39:43     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:48     all   10.93    0.00    3.91    0.14    1.87    0.19   53.29    0.00    0.00   29.67
15:39:48       0   16.03    0.00    1.91    0.38    1.15    0.00   73.28    0.00    0.00    7.25
15:39:48       1    6.17    0.00    7.48    0.00    1.87    0.56   49.53    0.00    0.00   34.39
15:39:48       2    8.10    0.00    3.50    0.55    3.31    0.37   27.81    0.00    0.00   56.35
15:39:48       3   10.54    0.00    3.88    0.18    1.48    0.37   53.42    0.00    0.00   30.13
15:39:48       4    8.24    0.00    3.76    0.00    2.69    0.18   35.48    0.00    0.00   49.64
15:39:48       5   18.11    0.00    0.58    0.00    0.77    0.00   75.14    0.00    0.00    5.39
15:39:48       6   12.62    0.00    4.27    0.00    1.17    0.00   62.72    0.00    0.00   19.22
15:39:48       7    8.18    0.00    5.76    0.00    2.42    0.00   51.49    0.00    0.00   32.16

15:39:48     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
15:39:53     all   11.40    0.00    3.86    0.26    1.28    0.26   51.64    0.00    0.00   31.30
15:39:53       0   13.13    0.00    2.44    0.38    1.69    0.00   58.35    0.00    0.00   24.02
15:39:53       1   11.65    0.00    2.77    0.37    2.22    0.37   45.29    0.00    0.00   37.34
15:39:53       2   13.99    0.00    1.68    0.19    1.49    0.75   51.68    0.00    0.00   30.22
15:39:53       3    8.40    0.00    5.34    0.19    1.34    0.00   47.71    0.00    0.00   37.02
15:39:53       4   10.67    0.00    3.81    0.19    1.33    0.19   42.48    0.00    0.00   41.33
15:39:53       5   16.60    0.00    1.13    0.75    0.38    0.38   65.85    0.00    0.00   14.91
15:39:53       6    7.23    0.00    9.77    0.00    0.78    0.20   70.90    0.00    0.00   11.13
15:39:53       7    9.32    0.00    4.18    0.00    0.95    0.19   31.37    0.00    0.00   53.99

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   11.65    0.00    3.17    0.32    1.72    0.30   52.92    0.00    0.00   29.92
Average:       0   12.99    0.00    2.68    0.37    1.69    0.32   59.64    0.00    0.00   22.31
Average:       1   11.00    0.00    3.14    0.34    1.95    0.34   49.16    0.00    0.00   34.06
Average:       2   11.31    0.00    2.72    0.52    1.84    0.41   47.02    0.00    0.00   36.18
Average:       3   11.75    0.00    2.97    0.45    1.96    0.29   52.45    0.00    0.00   30.11
Average:       4   10.88    0.00    3.06    0.19    1.99    0.21   46.12    0.00    0.00   37.55
Average:       5   13.19    0.00    3.30    0.24    1.11    0.22   64.23    0.00    0.00   17.71
Average:       6   11.92    0.00    3.71    0.08    1.60    0.24   58.52    0.00    0.00   23.92
Average:       7   10.22    0.00    3.82    0.32    1.63    0.32   46.71    0.00    0.00   36.99

This is all my wild speculation, but it looks like the region/cluster we're in is being quite oversaturated at some times? If so, is there anything we could to to battle this? The machine is almost unusable in such state for our tests.


Same issue just spotted on another machine:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-47-129.pool.ci.centos.org)     19/10/22    _x86_64_    (8 CPU)
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   47.46    0.23    5.14    2.99    1.58    0.23   15.75    0.00    0.00   26.62
Average:       0   45.01    0.23    5.27    3.99    1.66    0.35   14.81    0.00    0.00   28.69
Average:       1   45.90    0.21    4.85    6.58    1.53    0.21   11.78    0.00    0.00   28.95
Average:       2   46.35    0.18    6.08    7.16    1.46    0.20   13.60    0.00    0.00   24.97
Average:       3   49.17    0.26    5.19    3.09    1.57    0.26   16.04    0.00    0.00   24.41
Average:       4   46.51    0.27    5.73    0.42    1.66    0.21   18.27    0.00    0.00   26.94
Average:       5   50.20    0.23    4.62    0.23    1.55    0.20   18.10    0.00    0.00   24.85
Average:       6   45.84    0.20    4.70    1.69    1.61    0.23   16.45    0.00    0.00   29.27
Average:       7   50.67    0.23    4.72    0.73    1.61    0.21   16.97    0.00    0.00   24.85

However, in the meantime tests on n27-20-56.pool.ci.centos.org passed, and n27-29-11.pool.ci.centos.org seems to be headed the same way (the steal time there hangs around .5 - 1%).

I could also reproduce it manually on one node by running stress --cpu 8 - after some time the %steal time went from ~0% to ~80%:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   19.08    0.00    0.00    0.00    0.27    0.00   80.64    0.00    0.00    0.00
Average:       0   12.75    0.00    0.00    0.00    0.22    0.00   87.03    0.00    0.00    0.00
Average:       1   19.08    0.00    0.00    0.00    0.15    0.00   80.77    0.00    0.00    0.00
Average:       2   22.55    0.00    0.00    0.00    0.24    0.00   77.21    0.00    0.00    0.00
Average:       3   25.94    0.00    0.00    0.00    0.38    0.00   73.68    0.00    0.00    0.00
Average:       4   19.17    0.00    0.00    0.00    0.30    0.00   80.53    0.00    0.00    0.00
Average:       5   21.31    0.00    0.00    0.00    0.39    0.00   78.30    0.00    0.00    0.00
Average:       6   17.37    0.00    0.00    0.00    0.30    0.00   82.33    0.00    0.00    0.00
Average:       7   20.48    0.00    0.00    0.00    0.27    0.00   79.25    0.00    0.00    0.00

This is really unfortunate, since without nested virt we rely fully on emulation which is quite CPU intensive. Cutting >50% of that time makes the machine unusable for tests which utilize QEMU.

Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, investigation, need-more-info

Per discussion with @cgranell , we decided that it would be worth discussing that in dedicated ci-users list, as what you're seeing here isn't an infra problem, but rather how EC t2 instance is defined by AWS themselves. :

Amazon EC2 T2 instances are Burstable Performance Instances that provide a baseline level of CPU performance with the ability to burst above the baseline.

So it works "as designed" at the Cloud level (you never know where your ec2 instance will be provisioned by AWS). Feel free to start a thread for awareness/visibility on the ci-users list, so that other tenants can also participate (they don't track the centos-infra tracker)

Metadata Update from @arrfab:
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

Log in to comment on this ticket.

Metadata