As I described in the standardization post, the model used in WebRTC for real-time, browser-based applications does not envision that the browser will contain all the functions needed to function as a telephone or video conferencing unit. Instead, is specifies the browser will contain the functions that are needed to run a Web application which would work in conjunction with back-end servers to implement telephony functions as required. According to this, WebRTC is meant to implement the media plane but to leave the signalling plane up to the application. Different applications may prefer to use different protocols, such as SIP or something custom to the particular application. In this approach, the key information that needs to be exchanged is the multimedia session description, which specifies the configuration necessary to establish the media plane. In other words, WebRTC does not specify a particular signalling model other than the generic need to exchange SDP media descriptions in the offer/answer fashion. However, the browser is totally decoupled from the actual mechanism by which these offers and answers are communicated to the remote side.
Signalling alternatives for WebRTC is a frequent subject of discussion. After some email exchanges with Emil Ivov (@emilivov) from Jitsi and Enrico Marocco (@emarok) from Telecom Italia, both active contributors to the IETF, Enrico put together the following thoughts on Signalling options for WebRTC Applications. Enrico also recently published this presentation (embedded below) providing a very good architectural introduction to WebRTC.
{“intro-by”, “victor”}
Signalling Options for WebRTC Applications
Enrico Marocco – Telecom Italia
Signalling is an essential part of any WebRTC application. In fact, it is an essential part of any interactive application that needs a continuous exchange of events with some remote entity — for example for chat, gaming, real-time collaboration, but also for seemingly basic features such as user-interface dynamic updates and form auto-completion.
Contrary to common Internet applications, the web environment has two peculiarities that affect signalling in a rather significant way.
The first challenge consists in the fact that HTTP — the one and only protocol web clients speak — is inherently mono-directional. Even if hacks that allow bi-directional communication over HTTP have existed for many years (and have gained substantial popularity with the spread of AJAX technologies), it is only very recently that the industry has found a point of convergence in the WebSocket extension which has become an official Internet standard. However, even if in the long run WebSocket will most likely become the default transport channel for web signalling, today the mechanism does not offer the reliability robust applications require. In fact, despite being available in some of the most popular web browsers, it still suffers from a general lack of support in very common HTTP intermediaries such as corporate proxies and transparent caching/optimisation servers.
WebSocket over TLS on port 443, indistinguishable from regular HTTPS, solves the problem only in part. In fact, in addition to the operational effort required for acquiring and managing TLS certificates — negligible in many cases — both clients and servers still need to deal with frequent and unpredictable timeout-induced disconnections (to get a rough idea of how easily a proxy can be misconfigured, one could go through the exercise of counting how many parameters influence timeouts in the popular Squid proxy server).
For these reasons, at this point in time a combination of standard WebSocket and well-established COMET-like hacks seems like the most pragmatic approach. Libraries such as socket.io, that selectively pick the best transport and mask complex fall-back and reconnection logics, as well as the Google App Engine Channel API or the Amazon Simple Notification Service, are not going to lose their appeal to WebRTC application developers any time soon.
The second peculiarity that differentiates WebRTC applications from other communication services is the substantial lack of a need for a standard protocol for client-server signalling. Networks based on SIP or XMPP technologies are supposed to have clients and servers from different manufacturers interoperate. Web clients, on the other hand, run client logic distributed (in form of JavaScript code) by the very same domain they are connected to. When a new version of the application is deployed, a simple page refresh, possibly triggered by the server itself, is all that is required to update the client as well.
Such a paradigm change — at first disturbing for some of us, the “old-dogs” of the communication industry — translates indeed into quite useful flexibility. In particular, fewer syntax and semantics constraints enable a more advanced usage of the signalling channel that is tailored to fit the specific requirements of the application. For instance, it enables straightforward signalling of application specific events (e.g. for triggering user-interface updates) along with the SDP session descriptions required for establishing media connections.
At the risk of comparing apples and oranges, the specification of a simple SIP extension for carrying the equivalent of the User-to-user Information ISDN field — mostly used in call center systems for displaying custom information on agents’ screens — has taken years of standardization work and is still nowhere near to complete.
JSON over Multiple Transports
The most intuitive signalling means for WebRTC applications is the transmission of JSON objects over the best available bi-directional transport — WebSocket, or, alternatively, some combination of COMET-like mechanisms. This, for one, is the approach adopted by Google in the early and currently most popular applications.
JSON is a syntactic subset of JavaScript and has the great advantage of being natively interpreted by web browsers. Since all data structures and state information in web applications are stored in objects that have a direct unambiguous mapping to JSON, it is undoubtedly the on-the-wire format requiring the least effort for encoding, parsing and processing.
The other advantage of using JSON is that it does not impose any semantic on applications, thus allowing the communicating endpoints to exchange any kind of information — SDP blobs for establishing media connections, as well as custom events specific of the application logic. Additionally, by not forcing any particular identity scheme, it allows all kinds of user identification mechanisms . For example, this allows identity to be based on simple usernames, email addresses, or, of course, on existing communication services identities such as phone numbers, Skype names, and SIP/XMPP URIs.
Coupling JSON with a library that takes care of establishing and maintaining a reliable bi-directional channel with the signalling handling server is thus a simple, and at the same time effective, way for implementing signalling in WebRTC applications. However, such simplicity comes with the associated cost of a custom gateway whenever there will be a need to interconnect the web application with an external communication service.
SIP over WebSocket
In order to overcome the need for application-specific custom gateways, part of the telecommunications industry is looking with favor upon a fully-standardized approach based on the tunnelling of SIP — the signalling protocol of IP telephony networks — over WebSocket. SIP transports already exist for UDP, TCP, TLS and SCTP. If not exactly easy, the task of adding WebSocket support to an existing implementation is at least a well-known domain. By making existing SIP infrastructure accessible over WebSocket, service providers would be able to open their network to the web universe.
Despite that fact that parsing and encoding of SIP messages in JavaScript is suboptimal from a performance standpoint, opensource frameworks such as QoffeeSIP, sipML5 and jsSIP have shown that it is certainly feasible with an imperceptible impact on the user experience. The issues of such an approach are of different nature. On the one hand, relying solely on WebSocket as a transport protocol — Hobson’s choice for a solution that needs to be “standard” — is going to be a big obstacle to deployments in those environments where HTTP middleboxes (e.g. corporate proxies or transparent content optimization systems) do not support it. On the other hand, the SIP protocol is not designed — and not easily adaptable — to make use of the Trickle ICE optimization essential for minimizing connectivity establishment time. In quite common situations, it can lead to delays intolerable for the end user.
In particular, the delays with non-trickle ICE connectivity establishment happen when the user endpoint is configured with one or more network interfaces that cannot reach the STUN and TURN servers. This is a common situation with multi-homed devices such as smartphones that simultaneously connect to 3G/4G and WiFi networks, but also with laptops running VPNs, virtual machines, or simply configured with non-reachable IPv6 address. As a reference point, although with absolutely no scientific relevance, the sipML5 live demo running on a box with an active OpenVPN instance (at the very same time this article is being written) takes more than ten seconds to fire the initial INVITE out. Disconnecting the VPN takes the delay down to less than one second.
A Winner?
Of course other signalling options are possible — XMPP-based signalling is frequently discussed in WebRTC-related forums — and will certainly emerge. At the end of the day, most of the signalling and gateway servers will come with client-side toolkits that will mask the underlying protocols and that, especially in the case of JavaScript libraries, will enable smooth transition from an underperforming release to the next and more efficient one.
Signalling technologies will evolve. It is unlikely that there will be a clear winner. However there will probably be losers.
{“author”,”Enrico Marocco”}
Gustavo says
Nice article guys. It brought to my mind this Rosenberg’s sentence : “the need for having inter-provider standards is gone”
In my opinion the main decision to made is if you should use a standard protocol (SIP or XMPP) or create your own ad-hoc protocol. If you have a running infrastructure based on SIP or XMPP probably the best choice is to continue using them also from browsers to reduce the complexity and maintenance cost of protocol translators, but if you are creating a new service/infrastructure it could be a good choice to create your own ad-hoc simple JSON based protocol for your specific use case.
Iñaki Baz Castillo says
Congratulations for the article.
About Trikle-ICE for SIP there are, at least, two efforts in the IETF:
– http://tools.ietf.org/html/draft-ivov-dispatch-sdpfrag-01
– http://tools.ietf.org/html/draft-ivov-mmusic-trickle-ice-sip-00
Once this subject is solved, SIP will be more suitable for WebRTC but still not the best possible protocol at all (nor it is the most optimal or reliable RTC protocol over UDP or TCP). It is 100% feasible for a developer to design a custom and minimal signaling protocol, perfectly optimized for a specific website and service. That has never been the goal of SIP.
Enrico Marocco says
Thing is that adding Trickle ICE to SIP is quite tricky. One reason is that, unless you can make the (very strong) assumption that the other endpoint supports it, you can do trickle only on the receiving side. Another reason is that it basically mandates either PRACK, or some retransmission-based hack. Not going to happen in the next couple of years.
Victor Pascual says
Most of the scenarios where I’m seeing SIP over Websockets being used are basically WebRTC-to-SIP ones. In that case, the interworking function is usually anchoring media (note standard ICE, DTLS-SRTP, etc. are not supported by most of the SIP equipment deployed out there) and in the same way would be terminating ICE Trickle. I don’t see this as a blocking issue.
Anton Roman says
Nice discussion Enrico, signalling is usually an underestimated portion of WebRTC. After having tested Quobis’ WebRTC Client, both in a pure web fashion but also with several gateway vendors interfacing towards “legacy” networks, under different network scenarios and conditions, I’d like to share the following considerations — focusing mainly on the WebRTC to SIP interworking scenario:
– Using one protocol or another really depends on the scenario you are dealing with. One needs to consider whether interworking towards existing networks/domains is required, what are the protocols being used there, etc. Even, in some integrations we had to use different protocols for different services, namely one signalling protocol for audio/video sessions, a second for IM/Presence and a third for some private service. Because of all this, we decided to implement an abstraction layer in our client architecture so we could support multiple signalling protocols (popular choices seem to be SIPoWS, different flavors of JSONoWS and REST APIs) and add new ones without requiring and application re-design. This way we decouple the client and WebRTC core from the signalling libraries and hence reduce the cost of integrations in different networks and with different gateway vendors.
– Using SIPoWS when one wants to interconnect towards “legacy” SIP networks can make things easier. In this case, following a standards-based approach, the signalling gateway simply needs to perform transport layer interworking. Beyond that, at the application layer, we’ve seen customers willing to include specific SIP headers in the client itself and have them to traverse the gateway function transparently. This is really easy using SIPoWS — we’ve been involved in a couple of cases where JSON was used and for each new tiny header/parameter/value we wanted to include, the gateway vendor had to provide a new workspace image hence impacting the trial progress (they mentioned in the future that could be doable via config though).
– Some operators see “non-standard signalling” as a potential lock-in with their gateway vendor. Yes, there are standard APIs but those are rarely implemented. If in the future an operator decides to switch vendors or simply add a new one (e.g. for a new service), using standard signalling really makes things easier
– Yes, Websocket is a “new” technology but it’s rapidly gaining popularity and implementation support. From what we have experienced in the field, and having tested in both enterprise and carrier networks, usually there’re no issues traversing proxies or other HTTP entities using WSS. In any case, customers do prefer to use encrypted signalling in most of these scenarios. When it comes to timeouts/disconnections, yes, we experienced some of these in the past but was easily fixed via configuration. In any case, we expect websocket related issues to disappear at the same pace implementations get some maturity — note weboscket isn’t used only for webrtc but also for many other HTTP services, and in fact will play an important role in HTTP-based communications
– When it comes to Trickle ICE and SIP, yes, it’s far from trivial but in a controlled scenario (like the interworking one) is something perfectly doable.
Any feedback is welcome 🙂
Enrico Marocco says
Sure when one can assume that the endpoint is always talking to a media anchoring interworking function, one can also make quite a lot of assumptions on the protocol and protocol extensions supported. And I agree in that case there is actually a need for a standardised interface, to indeed avoid vendor lock-in. Whether such interface is better achieved at the protocol level, or at the JavaScript level (e.g. AT&T’s orca.js), that’s going to be a matter of discussion for a while I believe.
Dean Elwood says
>>Because of all this, we decided to implement an abstraction layer in our client architecture so we could support multiple signalling protocols (popular choices seem to be SIPoWS, different flavors of JSONoWS and REST APIs) and add new ones without requiring and application re-design<<
My team has done exactly the same and abstracted signalling with a modular "drop in signalling stack of your choice" approach.
I think this will become a common model for a while.
Dean Elwood says
Sorry, forgot to add – fabulous article. Thank you.
Jeroen van Bemmel says
The overhead of mapping Trickle ICE to SIP is quite large, as each request requires a response ( as opposed to some WebSocket based proprietary option which could send unidirectional messages ). Microsoft tried to add ‘BENOTIFY’ in the past, but that never got very far as far as I know
This brings interesting questions on the scalability of WebRTC applications. It can be made to work, but does it work for millions of clients connected to the cloud? For example, Chrome by default enables TCP keep-alive on the WebSocket connection, sending a 60-byte packet every 45 seconds. For a handful of clients that is not an issue, but in a scenario with 1 BHCA this could represent 50% of the total signalling traffic
James Rafferty says
Nice article, Enrico. It does a good job of summarizing why signaling in the WebRTC realm is still heavy lifting and is likely to evolve. I think your concluding statement said it well:”… At the end of the day, most of the signalling and gateway servers will come with client-side toolkits that will mask the underlying protocols…”. The attached presentation provided helpful background. As for UUI in SIP, we will hopefully be done soon. 🙂
Enrico Marocco says
Eh, James, that’s a plague the two of us are both cause and victim of 🙂
Philipp Hancke says
Enrico, xmpp-based based signalling for webrtc has been around for quite a while at https://github.com/ESTOS/strophe.jingle.
Signalling is compatible with Jitsi, with partial media interoperability 😉
Lawrence Byrd says
Excellent and detailed article – this “WebRTC signaling” question is clearly a hot topic! My view is that the essential value of the “no signalling defined” position is that it opens up all kinds of innovation for alternative connectivity approaches that may or may not bear much resemblance to heavier classical or SIP-style signaling. While developers can choose to use SIP or XMPP, they can also choose not to. I was struck by PubNub‘s demo at WebRTC Expo Atlanta which showed how a global real-time publish/subscribe network, that may be in use for all sorts of other application purposes, could easily absorb the needs of WebRTC signaling. You might see PubNub as “reliable global WebSockets on steroids”. PubNub also reports on their VoIP customer RebTel (with 13 million users) essentially replacing their internal use of SIP with PubNub publish/subscribe. So alternative approaches are not necessarily “heavy lifting” (James), plus, btw, things like UUI information transfer are trivial in many of these alternatives :).
I am struck by the fact that Twitter, Facebook, Instagram, LinkedIn, Pinterest, Chatter, Jive, Box and many others are all doing global “signaling” of some sort at some level of “real time” and yet their internal global connectivity architectures look nothing like SIP. This is partly because they also have all sorts of other streaming, searching, liking, following and other semantics that traditional signaling has never dealt with, and run at a scale that would completely boggle an average enterprise SIP framework (and I am sure no Web developer wants to start casually adding IMS :).
For reference, other recent WebRTC signaling discussions include Tsahi Levent-Levi’s blog and at my article at WebRTC World.
Stephane Tuffin says
Thanks for such a good article and discussion.
Using SIPoWS to access a Telco Gateway can raise some security concerns as such Gateways could be rather transparent (even good B2BUA can be quite transparent); the Telco network behind the Gateway is often a very sensitive asset and the JavaScipt running in the Browser offers less guarantees about its innocuity than previous SIP endpoints.
For Web apps that do not need to access legacy networks, PubNub and the likes do indeed seem to offer far more agility than plain SIP. By the way, a benchmark of such “Full Web” frameworks would be a great topic for a future article!
Another powerful aspect of WebRTC signalling that strikes my mind is the possibility to split Identity Provider stuff from signalling plane stuff. One could wonder how much of the WebRTC Identity Provider is going to be implemented by Browser makers and Players with relevant Identities.
Victor Pascual says
Update: Our ‘SIP over Websockets’ IETF draft has just been approved for publication as an RFC http://tools.ietf.org/html/draft-ietf-sipcore-sip-websocket-10
John Riordan says
SIP.js is another opensource SIP over WebSocket framework http://sipjs.com
Victor Pascual says
Thanks for sharing John — It’s great to see not only proprietary implementations for RFC7118 but also opensource code including sipml5, QoffeeSIP, sip-js, JsSIP and yours. In fact, I believe your implementation is a fork from JsSIP, right? Thanks again!