Georg's Talk on Message routing

See: 2018-Summit Day one

The slides from Ge0rG's talk: https://op-co.de/tmp/whats-wrong-with-xmpp-2017.pdf

Sessions

 * Initial design
 * Long living stable TCP connection (stationary client and server)
 * Expensive setup (many RTTs for hello, TLS auth, session bind)
 * Bolt-on optimizations
 * roster versioning (delta transmission, still full presence flood)
 * Stream Management (limited to ~5mins, simulates availability, destruction of zombie sessions is challenging WRT messages)
 * Problems
 * Stale/dead &quot;consume&quot; messages

BIND2: Idea is to merge as much of the session setup stuff into one stanza.

Message Routing

 * It's broken
 * The fixes (carbons, MAM) are broken
 * Multi-client/mobile is broken
 * The big picure
 * Message routing 2.0

Georg: historically, message routing was based on assumption that messages are ephemeral.

As soon as user receives stored messages, they're deleted from server. Messages are sent to &quot;most available&quot; client. Eligible for message delivery based on priority in presence element.

If client is not online initially, nothing is received. Messages are sent to offline storage. First client to come online receives messages. Other clients receive nothing when coming online.

Routing rules dependent on message type: groupchat, normal, chat, headline, error and on whether the messages are sent to a bare or full chat.


 * chat: typical IM
 * normal: kind of email
 * headline: ticker-type or popup
 * groupchat: reserved for MUC
 * error

Except type &quot;chat&quot; which means messages are rerouted to most available client.

After some years we have realized this doesn't correspond to what users expect. Users want messages on all devices. So we implemented carbons.

To cover the gap of offline clients, we've implemented MAM. When you reconnect to the server, you can query the server for archived messages.

Top table is from two RFCs. Caps ones are erroneous.

&quot;most available&quot; = implementation defined. &quot;non-negative presence&quot; - can be none, one or any subset of clients


 * *ejabberd: &quot;most available&quot; because it broke some clients

Message type problems


 * Routing depends on message type
 * Routing depends on full/bare to-JID
 * No message type for &quot;system&quot; messages like MAM results
 * XEPs and developers don't specify message type /use type incorrectly (e.g. XEP-0184)
 * Focus-stealing headline pop-up = UX nightmare
 * Delivery receipts don't mention message type as all

Message ID problems


 * Sender-defined identifier for a message
 * Optional
 * May contain low entropy - &quot;1&quot;, &quot;2&quot;, &quot;3&quot; etc.
 * May be rewritten by MUCs
 * XEP-0359: Unique and Stable Stanza IDs
 * MUST be unique and high-entropy value
 * Who is responsible for generation?
 * &lt;stanza-id&gt; (server) vs. &lt;origin-id&gt; (client) vs message &quot;id&quot;
 * MAM is heavily dependent on this. Can use &lt;stanza-ids&gt; to make archive queries. Originally message id doesn't play a role anymore. However, when sending a message to a MUC, you don't know whether the &lt;stanza-id&gt; or &lt;origin-id&gt; will be kept.
 * Last Message Correction, 0184 AKCs - Which ID to reference?
 * Delivery receipts - Which ID to reference?
 * Both these XEPs refer to message &quot;id&quot; and not to the XEP-0359 stanza id.

Conceptual problems


 * Users expect full history on all devices
 * Users expect &quot;smart&quot; notifications
 * Carbon-copy rules are vague, incomplete and overly complex
 * MAM rules are incomplete, moderately complex and different from Carbons
 * MAM does not give you the MAM-ID (unique &lt;stanza-id&gt; of sent messages)
 * MAM / offline message /live messages -&gt; race condition on session setup!
 * Battery saving (CSI) and push have different rules again.

Nightmare to debug these inconsistent rules. Would be great to have a consistent single ruleset.

Practical problems


 * Problems with Carbons / MAM / offline messages
 * Messages get rerouted /carbon copied to incapable clients
 * Different rules -&gt; disjoint histories on devices
 * Message gets CC'ed, receipt/error respond doesn't
 * Won't get error response on mobile client after sending message on desktop cient. Could be problematic
 * MAM + MUC = madness (3rd party hosted, NS versions!?)
 * OTR = madness


 * Interop problems with other XEPs:
 * Which entity shall send 0184 ACKs? One client? All clients? MAM?
 * Implicit assumption: Body-less messages = ephemeral = unimportant (Chat Markers, CHat States, AKCS, OMEMO, ...)

MAM archive query:

"For MUC need to query before joining, which requires membership or making messages globally available." When receiving CC'd messages, then you might not know whether from MUC service or not, and would have to do disco query first to figure out whether the JID sending the message is MUC or private chat.

Delivery receipts:

"First client that receives message, sends receipt. What if all clients are offline? Then message is archived. First client that queries archive, should that client send receipt? The client synchronizing should keep track of receipts for messages and send receipts for messages which don't have receipts. Which means receipts have to be archived as well." The big picture


 * This is a chat history database synchronisation problem
 * All &quot;full-sync&quot; clients should eventually arrive at the same chat history state, whether they were online, offline or in CSI battery saving mode.
 * Unified rules needed for all Message Routing XEPs:
 * Carbons, MAM, CSI, PUSH, ..., error responses
 * What affects Message Routing (for IM)?
 * Message persistence (will affect the chat history DB?)
 * Message urgency (for immediate SI and PUSH pass-through)
 * We need explicit encoding for persistence and urgency

Solution Message Routing 2.0


 * Persistence: Change Semantics of Bare-JID/Full-JID recipient
 * Bare-JID = persistent, delivered to all clients (allowed by RFC6121)
 * Full-JID = ephemeral, single clients, never re-routed (conflicts RFC6121)
 * Need for ephemeral all-online-clients routing (Bare-JID + &lt;no-archive/&gt;?)


 * Burn (Message) resource locking with fire!
 * Creates nasty race conditions
 * Or at least send all persistent messages to Bare-JID and ignore client caps for them


 * Urgency: strawman proposal, discussion needed!
 * by default: persistent = urgent
 * use Hints XEP for deviations in either direction
 * How to handle MUC/MIX mentions (E2EE vs. meta-data leak)?

For encrypted MUC/MIX we might not want to send messages to all clients immediately, unless the user is mentioned.

What we need are server-side or account side notification preferences.


 * Incompatible with current clients / servers
 * Requires changes in client-server communication
 * Leverage bind2 -&gt; session2
 * Apply new Bare-/Full-JID semantics
 * &quot;MAM subscription&quot; instead of Carbons? (+sent-msg ID reflection)
 * Requires transition logic for legacy XMPP clients/servers
 * Use &lt;no-archive/&gt; + &lt;private/&gt; on messages leaving session2 domain
 * Apply combined legacy{Carbons, MAM, CSI, Push} wisdom on messages entering session2 domain, derive Urgency and Persistence