ejabberd cluster node disconnects #271
Labels
No Label
service
accounts
service
discourse
service
drone-ci
service
email
service
garage
service
gitea
service
ipfs
service
mastodon
service
postgres
service
remotestorage
service
wiki
service
xmpp
bug
design
dev environment
docs
duplicate
enhancement
feature
good first issue
idea
invalid
kredits-1
kredits-2
kredits-3
on hold
ops
question
release
major
release
minor
release
patch
security
ui/ux
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: kosmos/chef#271
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Today, I encountered a cluster without the draco node again. So I have just removed
ejabberd@andromeda.kosmos.org
from the cluster, after having removed the DNS record a couple of days ago.I have no idea what caused draco to leave the cluster before, or this time. But I guess if we want to solve it, we need to be notified of when this happens at least (unless someone can find it in the logs already).
Whatever check we would do for this could also automatically re-join the cluster node, of course. But if we want to do that depends on why it's unintentionally leaving I think.
Happened again today, so I wrote a quick Ruby script, and added a cron job that is executed every minute and logging the output to
/root/cron.log
for now:Notify on ejabberd cluster node disconnectsto Investigate/fix ejabberd cluster node disconnectsNot entirely related to the cluster disconnects, but also about disconnects:
wormhole
is getting disconnected exactly every hour, on the second sharp, from when it connects to XMPP. Eventually, it fails to re-join MUC rooms then, but without an error or event on the client side. (At least not visible in the wormhole logs.)The session should keep for 24 hours, so I'm wondering if there's an error in that config somewhere.
Just a quick status update on the original issue: the log file is still empty, and the two nodes are still connected.
Log file still empty. TFW you wish something broke so you can know why...
HAProxy on draco failed a bit this morning, so the script finally kicked into action for the first time. Unfortunately, it was missing a
require "date"
to actually importDateTime
functionality. Fixed now. :/OK, so we have some logs now. However, I think even though ejabberd not recovering the cluster by itself is the same issue, I think the disconnects might be caused by something else. Anyway...
Since last week or so, our HAProxy on Draco is failing to forward connections for a short while every night (CET night / very early morning, American evening / late night). The situation seems to last for about 10 minutes every time, and sometimes it happens 2 times in a row.
When this happens, some Uptime Robot monitors of mine (Wiki and Mastodon) are catching it:
And the ejabberd monitor script logs then confirm that it was indeed pretty much everything forwarded by HAProxy, not just Web properties. As it tries to reconnect every minute, here's what that looks like:
Eventually, the connections succeed again, and the script is able to re-join the other node into the cluster.
Next steps
For ejabberd, we simply don't have to (and shouldn't) route through the HAProxy for cluster connections in the first place: #310
For HAProxy, we need to investigate what's actually happening there, so we can mitigate it: #314
Investigate/fix ejabberd cluster node disconnectsto ejabberd cluster node disconnectsThe ejabberd cluster has been replaced by new nodes connecting to each other via the private network now. However, they reconnect every few minutes for some reason, but do so automatically. So there is no need for any custom scripts anymore at least.
I think we should keep an eye on this and see if we can find out why they don't stay connected more persistently.
Just witnessed this in the logs on
ejabberd-3
:The last problem there was
ejabberd-3
waiting forejabberd-5
unsuccessfully to sync the mnesia tables. The PR I just ref'ed solved the issue.Finally closing this one for now, but will keep an eye on the situation for the next few days.