One of our first posts was a Wireshark analysis of Amazon’s Mayday service to see if it was actually using WebRTC. In the very early days of WebRTC, verifying a major deployment like this was an important milestone for the WebRTC community. More recently, Philipp Hancke – aka Fippo – did several great posts analyzing Google Hangouts and Mozilla’s Hello service in Firefox. These analyses validate that WebRTC can be successfully deployed by major companies at scale. They also provide valuable insight for developers and architects on how to build a WebRTC service.
These posts are awesome and of course we want more.
I am happy to say many more are coming. In an effort to disseminate factual information about WebRTC, Google’s WebRTC team has asked &yet – Fippo’s employer – to write a series of publicly available, in-depth, reverse engineering and trace analysis reports. Philipp has agreed to write summary posts outlining the findings and implications for the WebRTC community here at webrtcHacks. This analysis is very time consuming. Making it consumable for a broad audience is even more intensive, so webrtcHacks is happy to help with this effort in our usual impartial, non-commercial fashion.
Please see below for Fippo’s deconstruction of WhatsApp voice calling.
{“editor”: “chad hart“}
After some rumors (e.g. on TechCrunch), WhatsApp recently launched voice calls for Android. This spurred some interest in the WebRTC world with the usual suspects like Tsahi Levent-Levi chiming in and starting a heated debate. Unfortunately, the comment box on Tsahi’s BlogGeek.Me blog was too narrow for my comments so I came back here to webrtchacks.
At that point, I had considered doing an analysis of some mobile services already and, thanks to support from the Google WebRTC team, I was able to spend a number of days looking at Wireshark traces from WhatsApp in a variety of scenarios.
Initially, I was merely trying to validate the capture setup (to be explained in a future blog post) but it turned out that there is quite a lot of interesting information here and even some lessons for WebRTC. So I ended up writing a full fifteen page report which you can get here. It is a long story of packets (available for download here) which will be very boring if you are not an engineer so let me try to summarize the key points here.
Summary
WhatsApp is using the PJSIP library to implement Voice over IP (VoIP) functionality. The captures shows no signs of DTLS, which suggests the use of SDES encryption (see here for Victor’s past post on this). Even though STUN is used, the binding requests do not contain ICE-specific attributes. RTP and RTCP are multiplexed on the same port.
The audio codec can not be fully determined. The sampling rate is 16kHz, the codec bandwidth of about 20kbit/s and the bandwidth was the same when muted. Update: after a MITM attack on the signaling channel (see the comments below) this is now known to be Opus/SILK.
An inspection of the binary using the strings tool shows both PJSIP and several strings hinting at the use of elements from the webrtc.org voice engine such as the acoustic echo cancellation (AEC), AECM, gain control (AGC), noise suppression and the high-pass filter.
Comparison with WebRTC
Feature | WebRTC/RTCWeb Specifications | |
SDES | MUST NOT offer SDES | probably uses SDES |
ICE | RFC 5245 | no ICE, STUN connectivity checks |
TURN usage | used as last resort | uses a similar mechanism first |
Audio codec | Opus or G.711 | Opus/SILK, 16khz with 20kbps bitrate |
Switching from a relayed session to a p2p session
The most impressive thing I found is the optimization for a fast call setup by using a relay initially and then switching to a peer-to-peer session. This also opens up the possibility for a future multi-party VoIP call which would certainly be supported by this architecture. The relay server is called “conf bridge” in the binary.
Lets look at the first session to illustrate this (see the PDF for the full, lengthy description):
- The session kicks off (in packet #70) by sending TURN ALLOCATE requests to eight different servers. This request doesn’t use any standard STUN attributes which is easy to miss.
- After getting a response the client is exchanging some signaling traffic with the signaling server, so this is basically gathering a relayed candidate and sending an offer to the peer.
- Packet #132 shows the client sending something to one of those TURN servers. This turns out to be an RTCP packet, followed by some RTP packets, which can be seen by using Wiresharks “decode as” functionality. This is somewhat unusual and misleading, as it is not using standard TURN functionality like send or data indications. Instead, it just does raw RTP on that.
- Packet #146 shows the first RTP packet from the peer. For about three seconds, the RTP traffic is relayed over this server.
- In the mean time, packet #294 shows the client sending a STUN binding request to the peer’s public IP address. Using a filter (ip.addr eq 172.16.42.124 and ip.addr eq 83.209.197.82) and (udp.port eq 45395 and udp.port eq 35574) clearly shows this traffic.
- The first response is received in packet #300.
- Now something really interesting happens. The client switches the destination of the RTP stream between packets #298 and #305. By decoding those as RTP we can see that the RTP sequence number increases just by one. See this screenshot:
Now, if we have decoded everything as RTP (which is something Wireshark doesn’t get right by default so it needs a little help), we can change the filter to rtp.ssrc == 0x0088a82d and see this clearly. The intent here is to try a connection that is almost guaranteed to work first (I used a similar rationale in the minimal viable SDP post recently even) and then switch to a peer-to-peer connection in order to minimize the load on the TURN servers.
Wow, that is pretty slick. It likely reduces the call setup time the user perceives. Let me repeat that: this is a hack which makes the user experience better!
By how much is hard to quantify. Only a large-scale measurement of both this approach and the standard approach can answer that.
Lessons for WebRTC
In WebRTC, we can do something similar, but it is a little more effort right now. We can setup the call with iceTransports: ‘relay’ which will skip host and server-reflexive candidates. Also, using a relay helps to guarantee the connetion will work (in conditions where WebRTC will work at all).
There are some drawbacks to this approach in terms of round-trip-times due to TURN’s permission mechanism. Basically when creating a TURN-relayed candidate the following happens (in Chrome; Firefox’s behavior differs slightly):
- Chrome tries to create an allocation without authentication
- the TURN server asks for authentication
- Chrome retries to create an allocation with authentication
- the TURN server tells chrome the address and port of the candidate.
- Chrome signals the candidate to the JavaScript layer via the onicecandidate callback. That is two full round-trip times.
- after adding a remote candidate, Chrome will create a TURN permission on the server before the server will relay traffic from the peer. This is a security mechanism described here.
- now STUN binding requests can happen over the relayed address. This uses TURN send and data indications. These add the peer’s address and port to each packet received.
- when agreeing on a candidate, Chrome creates a TURN channel for the peer’s address which is more efficient in terms of overhead.
Compared to this, the proprietary mechanism used by Whatsapp saves a number of roundtrips.
this is a hack which makes the user experience better!
If we started with just relay candidates, then, since this hides the IP addresses of the parties involved from each other, we might even establish the relayed connection and do the DTLS handshake before the callee accepts the call. This is known as transport warmup, it reduces the perceived time until media starts flowing.
Once the relayed connection is established, we can call setConfiguration (formerly known as updateIce; which is currently not implemented) to remove the restriction to relay candidates and do an ICE restart by calling createOffer again with the iceRestart flag set to true. This would trigger an ICE restart which might determine that a P2P connection can be established.
Despite updateIce not being implemented, we can still switch from a relay to peer-to-peer today. ICE restarts work in Chrome so the only bit we’re missing is the iceTransports ‘relay’ which just generates relay candidates. Now the same effect can be simulated in Javascript by dropping any non-relay candidates during the first iteration. It was pretty easy to implement this behaviour in my favorite sdp munging sample. The switch from relayed to P2P just works. The code is committed here.
While ICE restart is inefficient currently, the actual media switch (which is hard) happens very seamlessly.
In my humble opinion
Whatsapp’s usage of STUN and RTP seems a little out of date. Arguably, the way STUN is used is very straightforward and makes things like implementing the switch from relayed calls to P2P mode easier. But ICE provides methods to accomplish the same thing, in a more robust way. Using a custom TURN-like functionality that delivers raw RTP from the conference bridge saves some bytes’ overhead for TURN channels, but that overhead is typically negligible.
Not using DTLS-SRTP with ciphers capable of perfect forward secrecy is a pretty big issue in terms of privacy. SDES is known to have drawbacks and can be decrypted retroactively if the key (which is transmitted via the signaling server) is known. Note that the signaling exchange might still be protected the same way it is done for text messages.
In terms of user experience, the mid-call blocking of P2P showed that this scenario had been considered which shows quite some thought. Echo cancellation is a serious problem though. The webrtc.org echo cancellation is capable of a much better job and seems to be included in the binary already. Maybe the team there would even offer their help in exchange for an acknowledgement… or awesome chocolate.
{“author”: “Philipp Hancke“}
va says
These articles are GREAT. Fantastic info and totally appreciate the thoughtfulness / thoroughness of your analysis.
Adam Roach says
I’m curious about the assertion that WhatsApp uses SDES, yet also interops with Firefox — Firefox never implemented SDES. The only way to exchange media at all with Firefox’s WebRTC implementation is using DTLS-SRTP.
Philipp Hancke says
hey Adam, where do you see that assertion? Interoperating with anything WebRTC will be quite hard without ICE to start with.
Vipul Rastogi says
Just love they way they handled most visible problem of WebRTC. Also their approach to anchor media first and p2p later means they already have very big infra which they can leverage further for multi-party conf.
Very recently I worked for one Communication giant who is changing their legacy feature server to work in this way but in legacy protocols. They otherwise use to host media all the time. Can you believe it ?
I certainly think Whatsapp has done great job in identifying right implementation and now they should API their offering for wide adoption/customization etc.
Ben says
The codec priority is Opus > AMR > PCMU, but I don’t know what could prevent Opus from being used.
How did you test silence? Maybe it was still sending noise, comfort of otherwise? Did it sound fully muted on the other side?
Tom van der Geer says
Hi, great article!! I’ve did some tests myself with the WhatsApp call feature, but from a signalling perspective. Using the node-whatsapi project, which implements the WhatsApp protocol for NodeJS, I was able to make a trace of an incoming call from an Android WhatsApp client. See the trace here: https://gist.github.com/tvandergeer/ecc1380641801d4c7c0f (obfuscated some privacy details)
Some conclusions/assumptions from this:
* They use the OPUS codec
* Several relays are tested for their latency. And then it most likely picks the one with the least latency
* The srtp tag indicates that SRTP is being used using a pre-shared key (192 bits) => SDES?
* Based on the presence of the p2p tag I assume that it will also attempt to connect the media directly besides using a relay
Disclaimer: I’m a contributor to the node-whatsapi project
Philipp Hancke says
Tom,
just noticed this. I have been pondering whether to do a MITM attack on the signaling channel to find an answer but estimated it would have taken days.
Thank you! It’s good to know for sure.
hira says
that’s nice article with great elaboration. thanks for your sharing.
Jose Muanes Pinto says
Hi all
I’m a student and I do not have knowledge in this matter (WebRTC) but I would like to learn about it.
So sorry if my question is no proper or idiot, but I would like to know: IS whatsapp using WebRTC or not, IMHO reading this great article I could see that Whatsapp use some elemnents of WebRTC.
So someone here could help, and give me an answer?
Thank you very much for your attention and time.
Jose M Pinto
Philipp Hancke says
Hi Jose,
it would not have classified as WebRTC back in 2015, mostly because of the lack of DTLS for encryption and STUN. I’ve never had the time to look in detail at the changes they did for video calling though.
Chad Hart says
I was asked this same question recently and here is how I responded:
WhatsApp uses WebRTC’s getUserMedia on its web interface (https://web.whatsapp.com/). Their native OSX appears to be based on Electron and looks to use the same.
We have not analyzed their new video calling functionality. I just did some quick scans and there is some evidence that they might at least use pieces of WebRTC for that (they include the WebRTC open source software notice). More analysis would be required to confirm how it is used.
Whatsapp main voice calling functionality is not based on WebRTC, but Fippo’s analysis showed they did use some pieces of the WebRTC library.
s says
Hello
Do you have any new findings regarding the most recent changes of WhatsApp, how it’s uses encryption XMPP or TLS. Your 2015 posting was very interesting, however I am looking for something more recent. Thank you.
Philipp Hancke says
a future reader (or me) might appreciate or enjoy this beautiful analysis done in early 2020:
https://medium.com/@schirrmacher/analyzing-whatsapp-calls-176a9e776213