As I described in the standardization post, the model used in WebRTC for real-time, browser-based applications does not envision that the browser will contain all the functions needed to function as a telephone or video conferencing unit. Instead, is specifies the browser will contain the functions that are needed to run a Web application which would work in conjunction with back-end servers to implement telephony functions as required. According to this, WebRTC is meant to implement the media plane but to leave the signalling plane up to the application. Different applications may prefer to use different protocols, such as SIP or something custom to the particular application. In this approach, the key information that needs to be exchanged is the multimedia session description, which specifies the configuration necessary to establish the media plane. In other words, WebRTC does not specify a particular signalling model other than the generic need to exchange SDP media descriptions in the offer/answer fashion. However, the browser is totally decoupled from the actual mechanism by which these offers and answers are communicated to the remote side.
Signalling alternatives for WebRTC is a frequent subject of discussion. After some email exchanges with Emil Ivov (@emilivov) from Jitsi and Enrico Marocco (@emarok) from Telecom Italia, both active contributors to the IETF, Enrico put together the following thoughts on Signalling options for WebRTC Applications. Enrico also recently published this presentation (embedded below) providing a very good architectural introduction to WebRTC.
Signalling Options for WebRTC Applications
Enrico Marocco – Telecom Italia
Signalling is an essential part of any WebRTC application. In fact, it is an essential part of any interactive application that needs a continuous exchange of events with some remote entity — for example for chat, gaming, real-time collaboration, but also for seemingly basic features such as user-interface dynamic updates and form auto-completion.
Contrary to common Internet applications, the web environment has two peculiarities that affect signalling in a rather significant way.
The first challenge consists in the fact that HTTP — the one and only protocol web clients speak — is inherently mono-directional. Even if hacks that allow bi-directional communication over HTTP have existed for many years (and have gained substantial popularity with the spread of AJAX technologies), it is only very recently that the industry has found a point of convergence in the WebSocket extension which has become an official Internet standard. However, even if in the long run WebSocket will most likely become the default transport channel for web signalling, today the mechanism does not offer the reliability robust applications require. In fact, despite being available in some of the most popular web browsers, it still suffers from a general lack of support in very common HTTP intermediaries such as corporate proxies and transparent caching/optimisation servers.
WebSocket over TLS on port 443, indistinguishable from regular HTTPS, solves the problem only in part. In fact, in addition to the operational effort required for acquiring and managing TLS certificates — negligible in many cases — both clients and servers still need to deal with frequent and unpredictable timeout-induced disconnections (to get a rough idea of how easily a proxy can be misconfigured, one could go through the exercise of counting how many parameters influence timeouts in the popular Squid proxy server).
For these reasons, at this point in time a combination of standard WebSocket and well-established COMET-like hacks seems like the most pragmatic approach. Libraries such as socket.io, that selectively pick the best transport and mask complex fall-back and reconnection logics, as well as the Google App Engine Channel API or the Amazon Simple Notification Service, are not going to lose their appeal to WebRTC application developers any time soon.
Such a paradigm change — at first disturbing for some of us, the “old-dogs” of the communication industry — translates indeed into quite useful flexibility. In particular, fewer syntax and semantics constraints enable a more advanced usage of the signalling channel that is tailored to fit the specific requirements of the application. For instance, it enables straightforward signalling of application specific events (e.g. for triggering user-interface updates) along with the SDP session descriptions required for establishing media connections.
At the risk of comparing apples and oranges, the specification of a simple SIP extension for carrying the equivalent of the User-to-user Information ISDN field — mostly used in call center systems for displaying custom information on agents’ screens — has taken years of standardization work and is still nowhere near to complete.
JSON over Multiple Transports
The most intuitive signalling means for WebRTC applications is the transmission of JSON objects over the best available bi-directional transport — WebSocket, or, alternatively, some combination of COMET-like mechanisms. This, for one, is the approach adopted by Google in the early and currently most popular applications.
The other advantage of using JSON is that it does not impose any semantic on applications, thus allowing the communicating endpoints to exchange any kind of information — SDP blobs for establishing media connections, as well as custom events specific of the application logic. Additionally, by not forcing any particular identity scheme, it allows all kinds of user identification mechanisms . For example, this allows identity to be based on simple usernames, email addresses, or, of course, on existing communication services identities such as phone numbers, Skype names, and SIP/XMPP URIs.
Coupling JSON with a library that takes care of establishing and maintaining a reliable bi-directional channel with the signalling handling server is thus a simple, and at the same time effective, way for implementing signalling in WebRTC applications. However, such simplicity comes with the associated cost of a custom gateway whenever there will be a need to interconnect the web application with an external communication service.
SIP over WebSocket
In order to overcome the need for application-specific custom gateways, part of the telecommunications industry is looking with favor upon a fully-standardized approach based on the tunnelling of SIP — the signalling protocol of IP telephony networks — over WebSocket. SIP transports already exist for UDP, TCP, TLS and SCTP. If not exactly easy, the task of adding WebSocket support to an existing implementation is at least a well-known domain. By making existing SIP infrastructure accessible over WebSocket, service providers would be able to open their network to the web universe.
In particular, the delays with non-trickle ICE connectivity establishment happen when the user endpoint is configured with one or more network interfaces that cannot reach the STUN and TURN servers. This is a common situation with multi-homed devices such as smartphones that simultaneously connect to 3G/4G and WiFi networks, but also with laptops running VPNs, virtual machines, or simply configured with non-reachable IPv6 address. As a reference point, although with absolutely no scientific relevance, the sipML5 live demo running on a box with an active OpenVPN instance (at the very same time this article is being written) takes more than ten seconds to fire the initial INVITE out. Disconnecting the VPN takes the delay down to less than one second.
Signalling technologies will evolve. It is unlikely that there will be a clear winner. However there will probably be losers.