#1316 Duffy nodes stuck in provisioning/contextualizing loop #3
Closed: Fixed with Explanation by nphilipp. Opened by mrc0mmand.

It's that time of week again, so it looks like nodes are getting stuck in provisioning. I've been watch-ing virt-ec2-t2-centos-8s-x86_64 and metal-ec2-c5n-centos-8s-x86_64 pools for 30+ minutes, and the the # of nodes in provisioning state hasn't changed:

{
  "action": "get",
  "pool": {
    "name": "metal-ec2-c5n-centos-8s-x86_64",
    "fill_level": 3,
    "levels": {
      "provisioning": 3,
      "ready": 0,
      "contextualizing": 0,
      "deployed": 0,
      "deprovisioning": 0
    }
  }
}
{
  "action": "get",
  "pool": {
    "name": "virt-ec2-t2-centos-8s-x86_64",
    "fill_level": 10,
    "levels": {
      "provisioning": 2,
      "ready": 8,
      "contextualizing": 0,
      "deployed": 1,
      "deprovisioning": 0
    }
  }
}

@nphilipp @dkirwan can someone please take a look?

I’ve just done that and unstuck things, summary follows.

So here’s what I have found and done:

  • I looked at a recent provisioning problem which was caused by that it wanted to register the new node with a hostname already in use by another node in the database, which was stuck in deprovisioning since end of June.
  • Altogether, there were 29 nodes in the database which were stuck like that, with creation times between June 2022 and June 2023.
  • I set them all to failed and put the reason in their metadata.
  • I also set the nodes which failed to be provisioned to failed. The task backend then replenished the starved pools.

Metadata Update from @nphilipp:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

Metadata Update from @nphilipp:
- Issue assigned to nphilipp

Log in to comment on this ticket.

Metadata