Hey!
Recently I noticed a strange issue with machines in the virt-ec2-t2-centos-8s-x86_64 pool I can't really put my finger on.
virt-ec2-t2-centos-8s-x86_64
On Monday, our CI jobs utilizing the virt-ec2-t2-centos-8s-x86_64 started heavily timing out in certain tests for no apparent reason. Thinking we've maybe merged something we shouldn't I started debugging it further, but whilst doing that, all CI job started magically passing again, so I let it be. However, today the issue resurfaced with the same symptoms. This is actually the third time I noticed this, and it always "fixed" itself after several hours - for example, on Monday the last successful job was at 4:52 PM CET, then all jobs from 4:56 PM to 9:55 failed with the same symptoms. However, following day the jobs were passing without an issue (first one @ 7:28 AM).
There's nothing in any log/journal I could get my hands on. However, today I decided to take one of the problematic machines aside, and it looks like a lot of the vCPU time is actually "stolen" by the hypervisor:
# mpstat -P ALL 5 7 Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org) 19/10/22 _x86_64_ (8 CPU) 15:39:18 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:23 all 11.32 0.00 2.80 0.55 1.41 0.23 45.37 0.00 0.00 38.32 15:39:23 0 12.02 0.00 4.39 0.38 0.76 0.76 65.84 0.00 0.00 15.84 15:39:23 1 10.28 0.00 2.39 1.28 1.47 0.37 37.98 0.00 0.00 46.24 15:39:23 2 13.15 0.00 1.98 0.36 0.90 0.00 52.79 0.00 0.00 30.81 15:39:23 3 10.22 0.00 2.87 1.43 1.43 0.36 38.35 0.00 0.00 45.34 15:39:23 4 12.57 0.00 2.67 0.00 1.90 0.19 53.71 0.00 0.00 28.95 15:39:23 5 10.24 0.00 3.35 0.37 1.86 0.00 39.85 0.00 0.00 44.32 15:39:23 6 12.45 0.00 1.83 0.37 1.47 0.18 45.79 0.00 0.00 37.91 15:39:23 7 9.65 0.00 2.97 0.19 1.48 0.00 29.50 0.00 0.00 56.22 15:39:23 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:28 all 12.40 0.00 2.68 0.00 1.67 0.26 50.64 0.00 0.00 32.36 15:39:28 0 13.38 0.00 2.23 0.00 2.23 0.19 54.65 0.00 0.00 27.32 15:39:28 1 11.13 0.00 2.97 0.00 1.67 0.37 38.22 0.00 0.00 45.64 15:39:28 2 10.59 0.00 2.69 0.00 1.62 0.54 39.86 0.00 0.00 44.70 15:39:28 3 15.89 0.00 1.36 0.00 1.94 0.00 65.12 0.00 0.00 15.70 15:39:28 4 9.56 0.00 2.94 0.00 2.21 0.00 36.76 0.00 0.00 48.53 15:39:28 5 13.52 0.00 5.57 0.00 0.20 0.20 80.52 0.00 0.00 0.00 15:39:28 6 14.29 0.00 1.28 0.00 2.38 0.00 52.56 0.00 0.00 29.49 15:39:28 7 11.15 0.00 2.52 0.00 1.08 0.72 40.83 0.00 0.00 43.71 15:39:28 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:33 all 12.36 0.00 2.53 0.00 1.45 0.09 55.86 0.00 0.00 27.71 15:39:33 0 12.95 0.00 1.69 0.00 1.69 0.19 53.28 0.00 0.00 30.21 15:39:33 1 14.45 0.00 1.90 0.00 1.33 0.19 64.26 0.00 0.00 17.87 15:39:33 2 11.03 0.00 2.85 0.00 1.42 0.00 47.51 0.00 0.00 37.19 15:39:33 3 11.19 0.00 2.35 0.00 2.35 0.00 46.21 0.00 0.00 37.91 15:39:33 4 11.40 0.00 2.57 0.00 1.84 0.18 47.79 0.00 0.00 36.21 15:39:33 5 12.48 0.00 4.35 0.00 0.38 0.19 72.97 0.00 0.00 9.64 15:39:33 6 13.74 0.00 1.81 0.00 1.08 0.00 62.21 0.00 0.00 21.16 15:39:33 7 11.80 0.00 2.72 0.00 1.45 0.00 53.72 0.00 0.00 30.31 15:39:33 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:38 all 10.95 0.00 3.11 0.95 2.62 0.74 54.25 0.00 0.00 27.38 15:39:38 0 10.88 0.00 3.56 0.75 2.63 0.56 55.72 0.00 0.00 25.89 15:39:38 1 9.14 0.00 3.23 0.72 2.33 0.54 47.49 0.00 0.00 36.56 15:39:38 2 9.49 0.00 3.65 2.55 2.92 1.09 52.37 0.00 0.00 27.92 15:39:38 3 11.11 0.00 3.01 1.32 3.20 0.94 52.17 0.00 0.00 28.25 15:39:38 4 12.12 0.00 2.69 0.77 1.92 0.38 51.92 0.00 0.00 30.19 15:39:38 5 12.98 0.00 2.29 0.00 2.10 0.38 61.45 0.00 0.00 20.80 15:39:38 6 12.22 0.00 2.82 0.19 2.63 0.75 58.08 0.00 0.00 23.31 15:39:38 7 9.93 0.00 3.55 1.24 3.19 1.24 55.14 0.00 0.00 25.71 15:39:38 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:43 all 12.16 0.00 3.37 0.30 1.77 0.30 59.29 0.00 0.00 22.81 15:39:43 0 12.57 0.00 2.55 0.73 1.64 0.55 56.83 0.00 0.00 25.14 15:39:43 1 14.29 0.00 1.28 0.00 2.75 0.00 61.72 0.00 0.00 19.96 15:39:43 2 12.79 0.00 2.70 0.00 1.26 0.18 56.94 0.00 0.00 26.13 15:39:43 3 15.05 0.00 2.02 0.00 2.02 0.37 65.14 0.00 0.00 15.41 15:39:43 4 11.76 0.00 2.94 0.37 2.02 0.37 55.33 0.00 0.00 27.21 15:39:43 5 8.80 0.00 5.75 0.54 1.97 0.36 55.66 0.00 0.00 26.93 15:39:43 6 10.59 0.00 4.65 0.00 1.67 0.56 58.36 0.00 0.00 24.16 15:39:43 7 11.45 0.00 5.15 0.76 0.76 0.00 64.69 0.00 0.00 17.18 15:39:43 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:48 all 10.93 0.00 3.91 0.14 1.87 0.19 53.29 0.00 0.00 29.67 15:39:48 0 16.03 0.00 1.91 0.38 1.15 0.00 73.28 0.00 0.00 7.25 15:39:48 1 6.17 0.00 7.48 0.00 1.87 0.56 49.53 0.00 0.00 34.39 15:39:48 2 8.10 0.00 3.50 0.55 3.31 0.37 27.81 0.00 0.00 56.35 15:39:48 3 10.54 0.00 3.88 0.18 1.48 0.37 53.42 0.00 0.00 30.13 15:39:48 4 8.24 0.00 3.76 0.00 2.69 0.18 35.48 0.00 0.00 49.64 15:39:48 5 18.11 0.00 0.58 0.00 0.77 0.00 75.14 0.00 0.00 5.39 15:39:48 6 12.62 0.00 4.27 0.00 1.17 0.00 62.72 0.00 0.00 19.22 15:39:48 7 8.18 0.00 5.76 0.00 2.42 0.00 51.49 0.00 0.00 32.16 15:39:48 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 15:39:53 all 11.40 0.00 3.86 0.26 1.28 0.26 51.64 0.00 0.00 31.30 15:39:53 0 13.13 0.00 2.44 0.38 1.69 0.00 58.35 0.00 0.00 24.02 15:39:53 1 11.65 0.00 2.77 0.37 2.22 0.37 45.29 0.00 0.00 37.34 15:39:53 2 13.99 0.00 1.68 0.19 1.49 0.75 51.68 0.00 0.00 30.22 15:39:53 3 8.40 0.00 5.34 0.19 1.34 0.00 47.71 0.00 0.00 37.02 15:39:53 4 10.67 0.00 3.81 0.19 1.33 0.19 42.48 0.00 0.00 41.33 15:39:53 5 16.60 0.00 1.13 0.75 0.38 0.38 65.85 0.00 0.00 14.91 15:39:53 6 7.23 0.00 9.77 0.00 0.78 0.20 70.90 0.00 0.00 11.13 15:39:53 7 9.32 0.00 4.18 0.00 0.95 0.19 31.37 0.00 0.00 53.99 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 11.65 0.00 3.17 0.32 1.72 0.30 52.92 0.00 0.00 29.92 Average: 0 12.99 0.00 2.68 0.37 1.69 0.32 59.64 0.00 0.00 22.31 Average: 1 11.00 0.00 3.14 0.34 1.95 0.34 49.16 0.00 0.00 34.06 Average: 2 11.31 0.00 2.72 0.52 1.84 0.41 47.02 0.00 0.00 36.18 Average: 3 11.75 0.00 2.97 0.45 1.96 0.29 52.45 0.00 0.00 30.11 Average: 4 10.88 0.00 3.06 0.19 1.99 0.21 46.12 0.00 0.00 37.55 Average: 5 13.19 0.00 3.30 0.24 1.11 0.22 64.23 0.00 0.00 17.71 Average: 6 11.92 0.00 3.71 0.08 1.60 0.24 58.52 0.00 0.00 23.92 Average: 7 10.22 0.00 3.82 0.32 1.63 0.32 46.71 0.00 0.00 36.99
This is all my wild speculation, but it looks like the region/cluster we're in is being quite oversaturated at some times? If so, is there anything we could to to battle this? The machine is almost unusable in such state for our tests.
Same issue just spotted on another machine:
# mpstat -P ALL 5 7 Linux 4.18.0-408.el8.x86_64 (n27-47-129.pool.ci.centos.org) 19/10/22 _x86_64_ (8 CPU) ... Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 47.46 0.23 5.14 2.99 1.58 0.23 15.75 0.00 0.00 26.62 Average: 0 45.01 0.23 5.27 3.99 1.66 0.35 14.81 0.00 0.00 28.69 Average: 1 45.90 0.21 4.85 6.58 1.53 0.21 11.78 0.00 0.00 28.95 Average: 2 46.35 0.18 6.08 7.16 1.46 0.20 13.60 0.00 0.00 24.97 Average: 3 49.17 0.26 5.19 3.09 1.57 0.26 16.04 0.00 0.00 24.41 Average: 4 46.51 0.27 5.73 0.42 1.66 0.21 18.27 0.00 0.00 26.94 Average: 5 50.20 0.23 4.62 0.23 1.55 0.20 18.10 0.00 0.00 24.85 Average: 6 45.84 0.20 4.70 1.69 1.61 0.23 16.45 0.00 0.00 29.27 Average: 7 50.67 0.23 4.72 0.73 1.61 0.21 16.97 0.00 0.00 24.85
However, in the meantime tests on n27-20-56.pool.ci.centos.org passed, and n27-29-11.pool.ci.centos.org seems to be headed the same way (the steal time there hangs around .5 - 1%).
n27-20-56.pool.ci.centos.org
n27-29-11.pool.ci.centos.org
I could also reproduce it manually on one node by running stress --cpu 8 - after some time the %steal time went from ~0% to ~80%:
stress --cpu 8
# mpstat -P ALL 5 7 Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org) 19/10/22 _x86_64_ (8 CPU) Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 19.08 0.00 0.00 0.00 0.27 0.00 80.64 0.00 0.00 0.00 Average: 0 12.75 0.00 0.00 0.00 0.22 0.00 87.03 0.00 0.00 0.00 Average: 1 19.08 0.00 0.00 0.00 0.15 0.00 80.77 0.00 0.00 0.00 Average: 2 22.55 0.00 0.00 0.00 0.24 0.00 77.21 0.00 0.00 0.00 Average: 3 25.94 0.00 0.00 0.00 0.38 0.00 73.68 0.00 0.00 0.00 Average: 4 19.17 0.00 0.00 0.00 0.30 0.00 80.53 0.00 0.00 0.00 Average: 5 21.31 0.00 0.00 0.00 0.39 0.00 78.30 0.00 0.00 0.00 Average: 6 17.37 0.00 0.00 0.00 0.30 0.00 82.33 0.00 0.00 0.00 Average: 7 20.48 0.00 0.00 0.00 0.27 0.00 79.25 0.00 0.00 0.00
This is really unfortunate, since without nested virt we rely fully on emulation which is quite CPU intensive. Cutting >50% of that time makes the machine unusable for tests which utilize QEMU.
Metadata Update from @arrfab: - Issue tagged with: centos-ci-infra, investigation, need-more-info
Per discussion with @cgranell , we decided that it would be worth discussing that in dedicated ci-users list, as what you're seeing here isn't an infra problem, but rather how EC t2 instance is defined by AWS themselves. :
ci-users
Amazon EC2 T2 instances are Burstable Performance Instances that provide a baseline level of CPU performance with the ability to burst above the baseline.
So it works "as designed" at the Cloud level (you never know where your ec2 instance will be provisioned by AWS). Feel free to start a thread for awareness/visibility on the ci-users list, so that other tenants can also participate (they don't track the centos-infra tracker)
Metadata Update from @arrfab: - Issue close_status updated to: Invalid - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.