#723 centguard ban question
Closed: Fixed by pingou. Opened by pingou.

Earlier today, the centguard bot has banned the entire IRC/Matrix bridge:

Here are the logs from the #centos-hyperscale room:

[03:19:45] vwbusguy[m] [~vwbusguy@fedora/vwbusguy] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:45] nb[m] [~nbm]@fedora/nb] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:45] salimma [~salimma@facebook/engineering/michel] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:45] davide [~davide@2001:470:69fc:105::c86] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:47] pingou[m] [~pingoufed@2001:470:69fc:105::1:447a] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:47] anitazha [~anitazha@2001:470:69fc:105::fd32] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:47] tdxt3d[m] [~tdmackeym@2001:470:69fc:105::1:d1e4] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:47] davdunc[m [~davdunc@2001:470:69fc:105::1:19bc] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:48] jsbillings [~jsbilling@2001:470:69fc:105::f8a2] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:49] ManuBretelle[m] [~chantrama@2001:470:69fc:105::1:4ab9] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:49] dbrandon[m] [~brandonde@2001:470:69fc:105::33ee] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:57] MarcinSkarbek[m] [~mskarbekf@2001:470:69fc:105::1:cb75] has quit IRC: Quit: Bridge terminating on SIGTERM
[03:19:57] centguard [~centguard@centos/bot/centguard] has set mode +b *!*@2001:470:69fc:105:*$##fix_your_connection

This ban affected all the people using Matrix to connect to IRC.

It's clear from the logs that this impacted quite a few people. This sounds to me as if we were banning one of libera's server after a netsplit.

Could we please configure that bot to not do this again?

Thanks


+1000 please

cc: @dcavalca @salimma @davdunc @dbrandonjohnson @chantra @anitazha @mskarbek @tdmackey

The centguard is managed by @gerdesas from community. CentOS Infra doesn't maintain this bot, so we can't do much about it.

Well, it's interacting with the CentOS IRC channels and impacts the CentOS community's ability to engage.
Could this be passed on? And maybe could we allow SIGs to not use that bot in their channels?

Left out was the fact the bot dropped the redirect 30 minutes past when the incident happened (times UTC):

07:50:00 <@centguard> [#centos-hyperscale] centguard sets [#299 -b !@2001:470:69fc:105:*$##fix_your_connection -
systemd[m]!~systemd90@2001:470:69fc:105::1:df06, 30m 0s]

Historically it takes matrix a long time to stabilize. I do know that Libera staff have been working with EMS to reduce that time; I was not here so I don't know how it responded this morning.

Config updated to address the issue when I was made aware of it:

09:34:00 < Bahhumbug> ^^config channel #centos-hyperscale supybot.plugins.ChanTracker.cyclePermit
09:34:01 <@centguard> Bahhumbug: 3
09:34:04 < Bahhumbug> ^^config channel #centos-hyperscale supybot.plugins.ChanTracker.cyclePermit -1
09:34:04 <@centguard> Bahhumbug: The operation succeeded.
09:34:15 < Bahhumbug> Fixed.

The matrix bridge had planned maintenance and restarted. Due to the way the bridge assigns IPs the bot sees it as a /64 network bouncing and sets a redirect which, sadly, affects all matrix users.

Unfortunately this happened one of the very, very few times I am not on-line at that time. Other times when the bridge was unstable and this occurred I was able to fix it within a moment or two as it makes noise in our -ops channel when it occurs, often before the bridge itself was stable again :(

All of our active ops have access to the bot and the channels to fix this type of issue and manage things; as do Fabian, bstinson, smooge and zathrus so there is Infra and Red Hat oversight. The problem is that there's no documentation for this. I'll try to make time to throw some docs around this in our wiki.

EDIT: I checked: matrix did not stabilize until 0740 from what I can tell from my logs. So the redirect really only unduly affected people for ~10 minutes :)

Thanks for the explanation, I just got unlucky timing then :)

I'm going to close this ticket as fixed then, thanks again!

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

Log in to comment on this ticket.

Metadata