ejabberd cluster node disconnects #271

Closed
opened 2020-12-31 08:56:39 +00:00 by raucao · 9 comments
Owner

Today, I encountered a cluster without the draco node again. So I have just removed ejabberd@andromeda.kosmos.org from the cluster, after having removed the DNS record a couple of days ago.

I have no idea what caused draco to leave the cluster before, or this time. But I guess if we want to solve it, we need to be notified of when this happens at least (unless someone can find it in the logs already).

Whatever check we would do for this could also automatically re-join the cluster node, of course. But if we want to do that depends on why it's unintentionally leaving I think.

Today, I encountered a cluster without the draco node again. So I have just removed `ejabberd@andromeda.kosmos.org` from the cluster, after having removed the DNS record a couple of days ago. I have no idea what caused draco to leave the cluster before, or this time. But I guess if we want to solve it, we need to be notified of when this happens at least (unless someone can find it in the logs already). Whatever check we would do for this could also automatically re-join the cluster node, of course. But if we want to do that depends on why it's unintentionally leaving I think.
raucao added the
service
xmpp
ops
labels 2020-12-31 08:56:39 +00:00
raucao self-assigned this 2021-01-05 16:33:01 +00:00
Author
Owner

Happened again today, so I wrote a quick Ruby script, and added a cron job that is executed every minute and logging the output to /root/cron.log for now:

require "date"

cluster_nodes = `/opt/ejabberd-20.04/bin/ejabberdctl list_cluster`

node_names = cluster_nodes.gsub("'","").lines.map(&:chomp)

if node_names.include?("ejabberd@draco.kosmos.org")
  # Cluster nodes are connected, so we're good
else
  # For some reason, draco isn't connected anymore
  puts "#{DateTime.now.to_s} ejabberd@draco.kosmos.org not found in node list, re-connecting..."
  `/opt/ejabberd-20.04/bin/ejabberdctl join_cluster ejabberd@draco.kosmos.org`
end
Happened again today, so I wrote a quick Ruby script, and added a cron job that is executed every minute and logging the output to `/root/cron.log` for now: ```ruby require "date" cluster_nodes = `/opt/ejabberd-20.04/bin/ejabberdctl list_cluster` node_names = cluster_nodes.gsub("'","").lines.map(&:chomp) if node_names.include?("ejabberd@draco.kosmos.org") # Cluster nodes are connected, so we're good else # For some reason, draco isn't connected anymore puts "#{DateTime.now.to_s} ejabberd@draco.kosmos.org not found in node list, re-connecting..." `/opt/ejabberd-20.04/bin/ejabberdctl join_cluster ejabberd@draco.kosmos.org` end ```
raucao changed title from Notify on ejabberd cluster node disconnects to Investigate/fix ejabberd cluster node disconnects 2021-01-05 16:35:34 +00:00
Author
Owner

Not entirely related to the cluster disconnects, but also about disconnects:

wormhole is getting disconnected exactly every hour, on the second sharp, from when it connects to XMPP. Eventually, it fails to re-join MUC rooms then, but without an error or event on the client side. (At least not visible in the wormhole logs.)

The session should keep for 24 hours, so I'm wondering if there's an error in that config somewhere.

Not entirely related to the cluster disconnects, but also about disconnects: `wormhole` is getting disconnected exactly every hour, on the second sharp, from when it connects to XMPP. Eventually, it fails to re-join MUC rooms then, but without an error or event on the client side. (At least not visible in the wormhole logs.) The session should keep for 24 hours, so I'm wondering if there's an error in that config somewhere.
Author
Owner

Just a quick status update on the original issue: the log file is still empty, and the two nodes are still connected.

Just a quick status update on the original issue: the log file is still empty, and the two nodes are still connected.
Author
Owner

Log file still empty. TFW you wish something broke so you can know why...

Log file still empty. TFW you wish something broke so you can know why...
Author
Owner

HAProxy on draco failed a bit this morning, so the script finally kicked into action for the first time. Unfortunately, it was missing a require "date" to actually import DateTime functionality. Fixed now. :/

HAProxy on draco failed a bit this morning, so the script finally kicked into action for the first time. Unfortunately, it was missing a `require "date"` to actually import `DateTime` functionality. Fixed now. :/
Author
Owner

OK, so we have some logs now. However, I think even though ejabberd not recovering the cluster by itself is the same issue, I think the disconnects might be caused by something else. Anyway...

Since last week or so, our HAProxy on Draco is failing to forward connections for a short while every night (CET night / very early morning, American evening / late night). The situation seems to last for about 10 minutes every time, and sometimes it happens 2 times in a row.

When this happens, some Uptime Robot monitors of mine (Wiki and Mastodon) are catching it:

Screenshot

And the ejabberd monitor script logs then confirm that it was indeed pretty much everything forwarded by HAProxy, not just Web properties. As it tries to reconnect every minute, here's what that looks like:

2021-03-04T04:02:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:03:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:04:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:05:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:06:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:07:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:08:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:09:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:10:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:11:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:12:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:13:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:14:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...                                                                                │
2021-03-04T04:15:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting...

Eventually, the connections succeed again, and the script is able to re-join the other node into the cluster.

Next steps

For ejabberd, we simply don't have to (and shouldn't) route through the HAProxy for cluster connections in the first place: #310

For HAProxy, we need to investigate what's actually happening there, so we can mitigate it: #314

OK, so we have some logs now. However, I think even though ejabberd not recovering the cluster by itself is the same issue, I think the disconnects might be caused by something else. Anyway... Since last week or so, our HAProxy on Draco is failing to forward connections for a short while every night (CET night / very early morning, American evening / late night). The situation seems to last for about 10 minutes every time, and sometimes it happens 2 times in a row. When this happens, some Uptime Robot monitors of mine (Wiki and Mastodon) are catching it: ![Screenshot](https://gitea.kosmos.org/attachments/bc567f7f-cd64-4cc7-948c-3266941e9df5) And the ejabberd monitor script logs then confirm that it was indeed pretty much everything forwarded by HAProxy, not just Web properties. As it tries to reconnect every minute, here's what that looks like: ```plain 2021-03-04T04:02:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:03:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:04:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:05:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:06:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:07:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:08:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:09:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:10:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:11:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:12:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:13:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:14:02+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... │ 2021-03-04T04:15:01+00:00 ejabberd@draco.kosmos.org not found in node list, re-connecting... ``` Eventually, the connections succeed again, and the script is able to re-join the other node into the cluster. ### Next steps For ejabberd, we simply don't have to (and shouldn't) route through the HAProxy for cluster connections in the first place: #310 For HAProxy, we need to investigate what's actually happening there, so we can mitigate it: #314
raucao changed title from Investigate/fix ejabberd cluster node disconnects to ejabberd cluster node disconnects 2021-03-04 08:35:26 +00:00
raucao added this to the Current operational issues project 2021-03-04 08:35:30 +00:00
Author
Owner

The ejabberd cluster has been replaced by new nodes connecting to each other via the private network now. However, they reconnect every few minutes for some reason, but do so automatically. So there is no need for any custom scripts anymore at least.

I think we should keep an eye on this and see if we can find out why they don't stay connected more persistently.

The ejabberd cluster has been replaced by new nodes connecting to each other via the private network now. However, they reconnect every few minutes for some reason, but do so automatically. So there is no need for any custom scripts anymore at least. I think we should keep an eye on this and see if we can find out why they don't stay connected more persistently.
Author
Owner

Just witnessed this in the logs on ejabberd-3:

2022-01-19 19:49:24.309 [info] <0.65.0> SIGTERM received - shutting down                                                                                                     
2022-01-19 19:49:24.310 [error] <0.508.0> gen_server memsup terminated with reason: {port_died,normal}                                                                       
2022-01-19 19:49:24.310 [error] <0.508.0> CRASH REPORT Process memsup with 0 neighbours exited with reason: {port_died,normal} in gen_server:handle_common_reply/8 line 751  
2022-01-19 19:49:24.310 [error] <0.507.0> Supervisor os_mon_sup had child memsup started with memsup:start_link() at <0.508.0> exit with reason {port_died,normal} in context
 child_terminated                                                                                                                                                            
2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at [::]:5222 for ejabberd_c2s                                    
2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TLS connections at [::]:5223 for ejabberd_c2s                                    
2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at [::]:5269 for ejabberd_s2s_in                                 
2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TLS connections at [::]:5443 for ejabberd_http                                   
2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at 0.0.0.0:3478 for ejabberd_stun                                
2022-01-19 19:49:24.329 [info] <0.478.0>@mod_muc_room:terminate:877 Stopping MUC room puxov@muc.5apps.com                                                                    
2022-01-19 19:49:24.336 [info] <0.107.0>@ejabberd_app:stop:109 ejabberd 20.12 is stopped in the node 'ejabberd@ejabberd-3
Just witnessed this in the logs on `ejabberd-3`: ```plain 2022-01-19 19:49:24.309 [info] <0.65.0> SIGTERM received - shutting down 2022-01-19 19:49:24.310 [error] <0.508.0> gen_server memsup terminated with reason: {port_died,normal} 2022-01-19 19:49:24.310 [error] <0.508.0> CRASH REPORT Process memsup with 0 neighbours exited with reason: {port_died,normal} in gen_server:handle_common_reply/8 line 751 2022-01-19 19:49:24.310 [error] <0.507.0> Supervisor os_mon_sup had child memsup started with memsup:start_link() at <0.508.0> exit with reason {port_died,normal} in context child_terminated 2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at [::]:5222 for ejabberd_c2s 2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TLS connections at [::]:5223 for ejabberd_c2s 2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at [::]:5269 for ejabberd_s2s_in 2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TLS connections at [::]:5443 for ejabberd_http 2022-01-19 19:49:24.311 [info] <0.107.0>@ejabberd_listener:stop_listener:402 Stop accepting TCP connections at 0.0.0.0:3478 for ejabberd_stun 2022-01-19 19:49:24.329 [info] <0.478.0>@mod_muc_room:terminate:877 Stopping MUC room puxov@muc.5apps.com 2022-01-19 19:49:24.336 [info] <0.107.0>@ejabberd_app:stop:109 ejabberd 20.12 is stopped in the node 'ejabberd@ejabberd-3 ```
Author
Owner

The last problem there was ejabberd-3 waiting for ejabberd-5 unsuccessfully to sync the mnesia tables. The PR I just ref'ed solved the issue.

Finally closing this one for now, but will keep an eye on the situation for the next few days.

The last problem there was `ejabberd-3` waiting for `ejabberd-5` unsuccessfully to sync the mnesia tables. The PR I just ref'ed solved the issue. Finally closing this one for now, but will keep an eye on the situation for the next few days.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: kosmos/chef#271
No description provided.