#981 Prosody sometimes gets stuck using 100% of the CPU
Reporter
Link Mauve
Owner
MattJ
Created
Updated
Stars
★★ (2)
Tags
Status-Fixed
Priority-Medium
Type-Defect
Patch
Link Mauve
on
What steps will reproduce the problem?
1. Have the server run for long enough, maybe
What is the expected output?
Prosody should continue to run properly.
What do you see instead?
Instead, it’s stuck using 100% of a core, not serving any client or s2s, not answering on mod_admin_telnet, not doing any syscall according to strace, but churning in what seems to be Lua land.
What version of the product are you using? On what operating system?
Prosody 0.10’s tip as of 2017-06-08, 2017-06-19, 2017-06-30 and 2017-09-07, on both amd64 and ARMv7.
Please provide any additional information below.
Sorry, I don’t think I have any. :(
Link Mauve
on
This happens on both Lua 5.1 and Lua 5.2.
Link Mauve
on
This problem has been identified in util.stanza, in maptags(), called by mod_mam_muc to filter out a malicious <stanza-id/>. The amount of children and the amount of tags seemingly differed, causing it to miss its condition and go into an infinite loop.
Zash
on
The following code reproduces the problem in maptags:
local st = require "util.stanza";
local s = st.message({}, "Hello");
s.tags[1] = st.clone(s.tags[1])
s:maptags(function () end);
It happens if the top level stanza objects becomes out of sync with the subset of child nodes that are tags, kept in the 'tags' field. That should not happen with normal stanza manipulation, so the root cause is still unknown.
Changes
tags Status-Accepted
MattJ
on
Link Mauve, can you provide a list of loaded modules on the affected server? If you have more than one server, the intersection of each should be enough.
My guess is some module that isn't using util.stanza's API for manipulating stanzas.
In particular I'm aware of some very suspect code in mod_cloud_notify. I did read it a while back, and couldn't spot a bug, but it's complex enough that there's no guarantee I would. And there may well be others doing similar things...
Link Mauve
on
I never had mod_cloud_notify loaded on linkmauve.fr, and yet it exhibited the same symptoms twice (long ago).
Here are the modules we have loaded at JabberFR:
mod_roster
mod_saslauth
mod_tls
mod_dialback
mod_disco
mod_carbons
mod_pep
mod_private
mod_blocklist
mod_vcard
mod_version
mod_uptime
mod_time
mod_ping
mod_register
mod_mam
mod_admin_adhoc
mod_admin_telnet
mod_bosh
mod_websocket
mod_limits
mod_server_contact_info
mod_welcome
mod_watchregistrations
mod_block_registrations
mod_checkcerts
mod_lastlog
mod_smacks
mod_smacks_offline
mod_cloud_notify
mod_csi
mod_throttle_unsolicited
mod_firewall
mod_s2s_blacklist
mod_announce_all
mod_secure_interfaces
mod_serverinfo
mod_measure_cpu
mod_measure_memory
mod_log_auth
mod_munin
mod_measure_stanza_counts
mod_traceback
And at linkmauve.fr:
mod_roster
mod_saslauth
mod_tls
mod_dialback
mod_disco
mod_private
mod_vcard
mod_blocklist
mod_version
mod_uptime
mod_time
mod_ping
mod_pep
mod_register
mod_admin_adhoc
mod_admin_telnet
mod_bosh
mod_http_files
mod_announce
mod_welcome
mod_watchregistrations
mod_smacks
mod_smacks_offline
mod_carbons
mod_mam
mod_poke_strangers
mod_secure_interfaces
mod_server_contact_info
mod_serverinfo
"Fixed" in 7df29c5fbb9b.
A quick note for the record - the patch above had an off-by-one error, which was caught by unit tests I added.
Ideally we remove the fix once we identify the root cause. Link Mauve is running with a more verbose version of this patch, but in the meantime this commit will prevent anyone accidentally running into the same issue (whatever it is).
What steps will reproduce the problem? 1. Have the server run for long enough, maybe What is the expected output? Prosody should continue to run properly. What do you see instead? Instead, it’s stuck using 100% of a core, not serving any client or s2s, not answering on mod_admin_telnet, not doing any syscall according to strace, but churning in what seems to be Lua land. What version of the product are you using? On what operating system? Prosody 0.10’s tip as of 2017-06-08, 2017-06-19, 2017-06-30 and 2017-09-07, on both amd64 and ARMv7. Please provide any additional information below. Sorry, I don’t think I have any. :(
This happens on both Lua 5.1 and Lua 5.2.
This problem has been identified in util.stanza, in maptags(), called by mod_mam_muc to filter out a malicious <stanza-id/>. The amount of children and the amount of tags seemingly differed, causing it to miss its condition and go into an infinite loop.
The following code reproduces the problem in maptags: local st = require "util.stanza"; local s = st.message({}, "Hello"); s.tags[1] = st.clone(s.tags[1]) s:maptags(function () end); It happens if the top level stanza objects becomes out of sync with the subset of child nodes that are tags, kept in the 'tags' field. That should not happen with normal stanza manipulation, so the root cause is still unknown.
ChangesLink Mauve, can you provide a list of loaded modules on the affected server? If you have more than one server, the intersection of each should be enough. My guess is some module that isn't using util.stanza's API for manipulating stanzas. In particular I'm aware of some very suspect code in mod_cloud_notify. I did read it a while back, and couldn't spot a bug, but it's complex enough that there's no guarantee I would. And there may well be others doing similar things...
I never had mod_cloud_notify loaded on linkmauve.fr, and yet it exhibited the same symptoms twice (long ago). Here are the modules we have loaded at JabberFR: mod_roster mod_saslauth mod_tls mod_dialback mod_disco mod_carbons mod_pep mod_private mod_blocklist mod_vcard mod_version mod_uptime mod_time mod_ping mod_register mod_mam mod_admin_adhoc mod_admin_telnet mod_bosh mod_websocket mod_limits mod_server_contact_info mod_welcome mod_watchregistrations mod_block_registrations mod_checkcerts mod_lastlog mod_smacks mod_smacks_offline mod_cloud_notify mod_csi mod_throttle_unsolicited mod_firewall mod_s2s_blacklist mod_announce_all mod_secure_interfaces mod_serverinfo mod_measure_cpu mod_measure_memory mod_log_auth mod_munin mod_measure_stanza_counts mod_traceback And at linkmauve.fr: mod_roster mod_saslauth mod_tls mod_dialback mod_disco mod_private mod_vcard mod_blocklist mod_version mod_uptime mod_time mod_ping mod_pep mod_register mod_admin_adhoc mod_admin_telnet mod_bosh mod_http_files mod_announce mod_welcome mod_watchregistrations mod_smacks mod_smacks_offline mod_carbons mod_mam mod_poke_strangers mod_secure_interfaces mod_server_contact_info mod_serverinfo
https://linkmauve.fr/files/prosody-infinite-loop.patch fixes the infinite loop in question. I haven’t managed to identify the module causing the issue, but this patch at least fixes the symptoms.
"Fixed" in 7df29c5fbb9b. A quick note for the record - the patch above had an off-by-one error, which was caught by unit tests I added. Ideally we remove the fix once we identify the root cause. Link Mauve is running with a more verbose version of this patch, but in the meantime this commit will prevent anyone accidentally running into the same issue (whatever it is).
ChangesThis issue seems to have resurfaced in #1856