I am delighted to read that Microsoft is working on an MQTT implementation in its Azure Service Bus, as shown in Clemens Vasters’ blog post. I welcome the addition of MQTT to Azure Service Bus and I am keen on a positive discussion on Clemens’ observations as a contribution to the future development of MQTT by the OASIS Technical Committee that is working on MQTT. Unfortunately Clemens’ article includes a number of criticisms of MQTT that I feel are inaccurate. I will address these below in, I hope, the spirit of positive debate.
Many of his points relate to the overall style of MQTT or are concerned with particular details such as whether it should have 1 or 2 byte length counts or the order in which fields appear in the CONNECT packet. However it makes three specific criticisms which I feel need to be answered. These being that:
- MQTT has characteristics that make it unsuitable for Internet of Things use cases
- It has features that are difficult, or perhaps impossible, to implement in an Internet Scale Messaging Server
- It is missing a number of features that you would expect in a Messaging Protocol
In order to keep this post relatively short, I will concentrate on these three points. Before doing that, I need to clarify a few points on the protocol itself.
i) Scope and intended use of MQTT
It's important to note that MQTT is not intended to be an enterprise messaging protocol, where bandwidth, processing power & network reliability can be taken as a given. It’s therefore not surprising that it lacks some of the features that you find in a protocol like AMQP. Instead MQTT was designed for constrained devices and networks. While constraints have evolved over the last fifteen years, factors such as cost per byte and client implementation size remain critically important for networks and applications powering the Internet of Things. Every additional feature adds to the overall complexity of the specification, and typically adds to the network cost of running the protocol and processing power required for implementations.
It’s also worth noting that MQTT has been used successfully in Machine 2 Machine (m2m) applications in many industries for over 15 years and its usage continues to grow rapidly. Rather than throw this up in the air and produce a radically new, untried protocol, the strategy being followed by OASIS is to standardise what already works, with minor adjustments to aid interoperability. Enhancements to the protocol can be added in future versions of the specification after they have had an appropriate amount of field testing. That has of necessity limited the amount of change that has occurred between Version 3.1 and 3.1.1, the proposed standard. I would encourage Clemens and Microsoft to engage actively with the future development of MQTT through the OASIS TC.
ii) Error handling
Clemens states that all errors (apart from problems on CONNECT) result in the server summarily dropping the network connection without giving any indication to the client as to why it is doing this. This is not the case, since:
- In the 3.1.1 MQTT specification, the SUBACK response includes a failure response code. The server is required to set this if the SUBSCRIBE request fails. This includes the case where the client is not authorised to subscribe to the topic in question.
- The server is permitted to accept an unauthorised PUBLISH message and not drop the network connection, provided that it doesn’t forward the unauthorised message on to any subscribers.
It is true that the specification requires the connection be dropped if the server detects a protocol violation or if a transient error occurs, although in the transient error case it might be acceptable for a server to wait a bit before determining that the error has occurred.
The article makes much of the fact that the protocol doesn’t return error codes to the client, except in the case of CONNACK. Additional error reporting is being considered by the MQTT community for future versions of MQTT, but it is something that today’s MQTT applications are able to cope without. Why is that? We need to remember that MQTT is designed to handle large numbers of lightweight (often unattended) clients. If the protocol were to send detailed error information to the clients, it’s unlikely that they would be able to do much with it other than send it back in to a central Problem Determination system. It’s more practical for error logging and diagnosis to be done centrally by the server, so in many cases there’s no real need to pass the information to the clients.
It also states that a mobile client cannot distinguish between a transient error on the server and a network dropout. A client that is interested in the difference might be able to get this information from the mobile, but in any case in practice the distinction between weak network, socket collapse or server hiccups is usually irrelevant - you either have a connection or you don’t. Clients are not required to ‘aggressively reconnect’ - they would use some sensible exponential back-off approach.
iii) Durability of Session state data
Clemens mentions ambiguity in the specification where it says “some Sessions last only as long as the Network Connection, others can span multiple consecutive Network Connections.”
The situation is as follows: there are two sorts of Session, which I will call Durable Sessions and Non-Durable Sessions (admittedly it might be clearer if the specifcation were to use these terms explicitly). A client requests a Non-Durable Session by connecting with CleanSession = 1. If it does this, it accepts that the Session is transient and could terminate at any time. This means that a server is indeed entitled to kill a Non-Durable Session if it wishes, though presumably customers would think twice about using a server that does this a lot. If the client connects with CleanSession = 0 it is requesting or resuming a Durable Session, in which case the server is in general required to preserve the state across multiple connections, and run "offline-subscriptions".
Having said that, there are some words in Section 4.1 which do allow the server to kill a Durable Session and delete the state data that relates to that Session. The comments in that section indicate that this is something that should happen only in exceptional circumstances, or as a result of (manual or automated) administrative action. This intended for use in cases where the client has been inactive for a long time, or where it is thought to have been compromised.
It is important to note that MQTT is aimed at a wide variety of different implementation platforms, ranging from very lightweight clients and servers, which might not have any non-volatile storage, through to highly-reliable fault-tolerant servers. As a protocol spec, it’s not MQTT’s role to dictate what degree of reliability an implementation should provide.
It is not alone in this regard - for example the WS-ReliableMessaging specification (part-authored by Microsoft) says: “Persistence considerations related to an endpoint’s ability to satisfy the delivery assurances defined below are the responsibility of the implementation and do not affect the wire protocol. As such, they are out of scope of this specification.”
So with those pieces addressed by way of background, let’s take a look at Clemens’ non-stylistic points.
1) Suitability of MQTT for IoT use cases
The article claims that MQTT is not a good protocol for connecting IoT devices across long-haul connection links, so it’s important to examine the material concerns that it raises:
“Lack of reporting of errors to the client”
My views on this are shared in ii) above.
“Transient errors force a disconnect, which results in a renegotiation of a TLS connection and this is costly”
I would hope that, although they will occur, transient errors are likely to be rare - probably rarer than network failures. In any case TLS has a special path that allows a Session to be resumed quickly without going through the full TLS handshake.
“The 64k in-flight message limitation causes problems on high-bandwidth, high-latency connections”
The article gives an example of this, but it involves constant streaming of high volumes of data to a single client, with that data broken up into a large number of small MQTT messages. This isn't a normal use case for MQTT - but if you want to do this kind of thing, I see no reason why you could break the data into a smaller number of larger messages and that would avoid the issue.
“The ’solicit push’ model of publish/subscribe is unsuitable - a pull model is better”
The push vs pull debate is somewhat a matter of taste, and the article does concede that some people find the absence of polling attractive. Solicit-push is not unique to MQTT, it appears on other specs, for example JMS or WS-Eventing. If we had Flow control in MQTT then maybe some of this concern would go away.
“The specification is too liberal with regards to reliability requirements”
ii) & iii) above discuss this point
None of the above points should in any way be show-stoppers for either implementation of MQTT servers or client solution implementations, but in any case I imagine that the OASIS TC would be happy to look at them further for consideration in future releases. In addition the use case quoted remains relatively niche for most IoT scenarios.
2) Difficulty of implementing an Internet Scale MQTT Server
The article says that the ‘solicit push’ model makes it hard, if not impossible, to implement QoS 1 or QoS 2 reliability on top of an existing multi-protocol broker such as that provided by the Azure Service Bus. I'm not sure whether it is ’solicit push’per se, or the ‘roll-forward’ semantics of QoS 1 and QoS 2 that’s the real issue here.
It’s relatively easy to do a single-server MQTT implementation - we have numerous examples of people who have done this successfully and it is possible to do a single server implementation that scales to over a million connected clients.
It is admittedly harder to do a distributed multi-server implementation of MQTT, but there are implementations available. If there are things that can be done to the protocol, to make it easier to implement really high-scale distributed servers, without over-complicating the clients, then the OASIS TC should certainly look at them. However in the meantime they are implementations out there that customers can use effectively in solutions today.
3) Missing Features you would expect in a Messaging Protocol
The article mentions a number of features that are not in today’s MQTT specification. A number of these are already under discussion in the MQTT community. They weren't included in the 3.1.1 specification as they represent reasonably significant extensions to 3.1. I can’t speak for the TC but they look like reasonable candidates for a future version of MQTT:
• Standardised way to add application-specific headers to a PUBLISH packet.
• Standardised way to communicate the payload type or encoding.
• Message time-to-live mechanism.
• Negative Acknowlegment for PUBLISH - this could include a ’please try later’ response that is sent by the server when it identifies a transient error.
• Session timeout.
• Standardised way to chunk (segment) an application message across multiple PUBLISH packets.
• Explicit Flow control, in addition to that provided by TCP/IP itself.
In addition the article states that the specification should warn implementations that they have to authorise use of Client Identifiers, as otherwise it is possible for a malicious application to hijack someone else’s id and get access to their session. This is something that implementers are generally aware of, but it does no harm to draw attention to it.
The article mentions a number of other features. I expect that the TC would be happy to consider them, but as the article itself observes, there is a danger that if it were to add too many additional features MQTT would end up encroaching on AMQP and lose the values of conciseness and simplicity that it currently possesses. These include:
• Application specific extensions other than in the PUBLISH packet
• Vendor-specific extensions to the protocol
• Standardised payload encoding schemes
• Support for multiplexing within a single connection
So to close, although Clemens points out a number of possible enhancements, many of which are already under consideration, I see nothing in his blog article to substantiate his claim that MQTT is unsuitable for Internet of Things use cases or that majorly inhibits implementation of MQTT servers at internet scale. In fact MQTT’s popularity continues to increase due to many of the factors he calls out as inhibitors. It is a simple but elegant protocol, deliberately so for both ease of client implementations, low footprint and limited bandwidth use cases. As such I am confident MQTT is well positioned for Internet of Things use cases both today and as the protocol evolves in the future. That said, I’m sure the OASIS TC would welcome Microsoft on board to help shape the future of this incredibly valuable standard.
Peter Niblett, STSM - IBM Connectivity, Mobile and Smarter Planet