The built-in mod_mam of prosody 0.11 has a sophisticated logic, which message elements to store:
1. remove "useless" elements (as defined by `dont_archive_namespaces`)
2. if the message isn't empty, store it.
This is good for reducing the memory and network footprint of MAM, but it results in the following messages ending up in MAM (real examples, JIDs pseudonymized):
<message type='chat' xml:lang='en' firstname.lastname@example.org/poezio' id='b867fc06497a494e945a84b73d565c0a' email@example.com'>
<origin-id id='b867fc06497a494e945a84b73d565c0a' xmlns='urn:xmpp:sid:0'/>
<message type='chat' xml:lang='en' firstname.lastname@example.org/yaxim' id='7c0c163405834bfdb605eaf09333caa3' email@example.com/occupant'>
<origin-id id='7c0c163405834bfdb605eaf09333caa3' xmlns='urn:xmpp:sid:0'/>
<message type='chat' xml:lang='en' firstname.lastname@example.org/poezio' id='a0acfeda-d0ed-4f65-a091-b8c5bbb6a31e-21AA7' email@example.com/occupant'>
<message type='chat' xmlns='jabber:client' firstname.lastname@example.org/yaxim' id='Kr510-58' email@example.com/jitsi-31plbtd'>
Now obviously, stripping origin-id and thread ID from archived messages would be counter-productive. Therefore I suggest the following approach instead:
1. make a copy of the message
2. strip "useless" elements from the copy
3. if the copy isn't empty, store the original
Or, if you want to benefit from the storage reduction of the original solution, introduce a second variable `strip_archive_namespaces`:
1. strip elements that match `strip_archive_namespaces` (e.g. chat states)
2. make a copy of the message
2. strip `dont_archive_namespaces` elements from the copy (e.g. thread, origin-id, muc-x)
3. if the copy isn't empty, store the "original" from after step 1
There is still a small issue with this overall approach, regarding mediated MUC invitations <https://xmpp.org/extensions/xep-0045.html#invite-mediated>:
If you strip the <x/> element, the invitaiton contained within will be gone as well. However, hopefully, all mediated invitation implementations will also provide a legacy body so the message would end up in MAM nevertheless. At least we can only hope that.
Update: these messages make up roughly 12% of the data in my server's MAM:
sqlite> select count(value) from prosodyarchive where host="yax.im" and store="archive2";
sqlite> select count(value) from prosodyarchive where host="yax.im" and store="archive2" and not value like "%<body%" and not value like "%<received%" and length(value) <300;
A lot of changes have been made to the "should we archive this" decision function in trunk, it now works as follows:
- Do not store messages of type headline since these are supposed to
be for transient notifications, most often PEP events.
- Do store messages of type error, since if you sent a message it is
of interest to know that delivery failed.
- Do not store messages of type groupchat, as MUCs will send one
message per joined resource and most often provides their own MAM.
- Follow [XEP-0334: Message Processing Hints] advising for or against
- Do store messages with a <body> and/or <subject> element, as these
carry messages for users.
- Do store encrypted messages for the same reason as with <body>, as
indicated by [XEP-0380: Explicit Message Encryption].
- Do store messages with [XEP-0184: Message Delivery Receipts]
requests, and the receipts themselves, as if something is important
enough to need such a receipt it is probably important enough to
- Do store messages with [XEP-0333: Chat Markers], for the same
reasons as with receipts.
- Do store MUC invites, both mediated and direct.
- Do store messages with an [XEP-0353: Jingle Message Initiation]
payload, as users will want to know if they had missed calls.
- Anything not covered by this point is not stored.
After this, dont_archive_namespaces is applied and as before, if the result is empty (shouldn't happen anymore unless things matched by the above rules are set to be stripped) the message gets dropped. The name is unfortunate, perhaps we should rename it?
I believe under these new rules the messages you list would not have been stored. Is that enough to consider this issue fixed?
( Instead you get oh so many chat markers! )