Many in the industry, including myself, reference Amazon’s Kindle Fire HDX Mayday button as using WebRTC or at least as something that is WebRTC like. The Kindle Fire HDX is not available everywhere, so if you have not seen this the Android Authority has a good video of this feature here.
We started considering 2 main approaches:
- Root the Kindle and search for signs of the WebRTC code
- Use Wireshark and examine what the app is sending over the wire to see if it does WebRTC-like things
The first approach is tricky and may not yield any useful information depending on how the app was built. The second approach is relatively simple, but yields less information since WebRTC streams should be encrypted and therefore difficult to decode unless you are the NSA.
I convinced an old friend of mine that recently purchased an Amazon Kindle Fire HDX to help investigate the Mayday button. We looked into rooting it for approach #1 but he was not willing to take the risk of bricking his brand new device. Therefore we went with approach #2.
You can find the complete capture here. He also took a video to help synchronize what happens on the screen with the trace (we removed the audio to help keep my friend anonymous). If you don’t know Wireshark and want to follow along you should probably look at some tutorials first.
What are we looking for?
As mentioned above, WebRTC uses encryption so we shouldn’t be able to see the signaling contents or listen/see the RTP media. However, there are a bunch of things that happen over the wire in a typical WebRTC session – like a Chrome to Chrome browser session:
- Some kind of signaling to a signaling server – this could be Victor’s RFC7118 or any other web-based signaling
- ICE negotiations with STUN & TURN – other VoIP technologies are much more likely to use Hosted NAT traversal from a SBC
- SRTP – 4 SRTP streams communicating to a media peer or TURN server (assuming bi-directional audio & video) – one in each direction for voice and one in each direction for video; this should be the majority of traffic
- WebRTC mandates DTLS for SRTP, so we should see DTLS negotiations on the same ports as our media
- RTP Bundling – the WebRTC RTP usage draft requires the use of stating “Implementations are REQUIRED to support transport of all RTP media streams, independent of media type, in a single RTP session according to [I-D.ietf-avtcore-multi-media-rtp-session]” meaning each of the above RTP streams should be on the same IP address, Port, and Transport (UDP or TCP)
- RTCP multiplexing – RTP and RTCP traffic should share the same ports according to the latest WebRTC RTP specs: “support for multiplexing RTP data packets and RTCP control packets on a single port for each RTP session is REQUIRED, as specified in [RFC5761]”
If you put this all together, this should look something like:
Resolving Our Topology
I am not a Wireshark decoding expert (maybe I am now), but I noticed a few things on my initial scans:
- Lots of STUN and TURN traffic
- A couple Amazon servers involved in RTP traffic
- TLS negotiations to encrypt traffic
It took me several iterations to figure out the topology, some of which I will walk through below. Wireshark has various tools to help identify the various entities in the capture, including:
- statistics->end points
These tools show various stats to help figure out what is going on.
Here is a quick summary of what I found by digging deeper into the conversations identified above:
|User||User Port||Network||Network Port||Direction||Contents|
|Kindle||59522||TURN Server||3478||<->||RTP Dynamic 100 & 116, RTCP|
Keep in mind we actually took a couple of separate traces – I am only referring to the final clean one we did in this post but we noticed the same patterns in both.
It is hard to hide any traffic that may be happening in the background unrelated to the Mayday button, but trace reveals a regular flow of traffic from www.amazon.com and Kconnect-us.amazon.com. Kconnect-us.amazon.com has variable body sizes with a minimum size of 60 bytes. I am guessing this is some kind of Kindle control channel that is always running.
There is also a bi-directional TLS stream that is setup between the Kindle and the edge server at the beginning of the call.
Let’s start with RTP since that should be the vast majority of traffic. We know audio was sent and received from the agent and that the agent was sending a video stream, so I expected at least 3 RTP streams – probably 4 since we also know the agent can see the screen and a common approach is to encapsulate this with a video codec. There is also a lot of UDP traffic going through this TURN server that is RTP. I did a few tricks to get all the RTP streams to show correctly:
- Setting protocol preferences for RTP to “Try to decode RTP outside of conversations” and treating RTP version 0 packes as STUN packets
- Manually decoding all traffic on ports 29548 as RTP
- Turned off RTP-EVENT decoding – I am not sure why, but Wireshark was showing a lot of RFC2833 events coming from the TURN server; the full range of telephony tones were clearly not being sent to the Kindle so I turned this decoding off and this filled in the missing packets in the stream
After doing this I could see 4 distinct RTP streams under Telephony->RTP->Show All Streams to see what I had for RTP:
Clicking on the Analyze button reveals more useful information. There are 2 G.711 streams – one from our Kindle to an “edge” sever labled na4.edge-gacd.amazon.com and one in the reverse direction. These each last 188 seconds, which corresponds to our call times.
The other ones are labeled RTPType-100 and RTPType-166. RTPType-100 and above corresponds to dynamic payload types. These are typically used for video, so this should be no surprise. The Analyze button indicates the these streams last 183 seconds. If you watch the video you will notice there is a 4-5 second lag before the video starts, so this makes sense.
Mayday uses rfc5766-turn-server
As you remember from Reid’s posts on NAT Traversal and Emil’s Tricke-ICE review, WebRTC sessions use the ICE framework with STUN and TURN to traverse NATs. These packets show a lot of useful information.
Wireshark labels STUN and TURN packets the same, but if you search on “TURN” it will bring you to Frame # 1042. Examining this packet revealed it was the Citrix-188.8.131.52 ‘Harding Grim’ TURN server:
A quick search on this reveals a header file containing this as a #define value in the source code for the popular rfc5766-turn-server by Oleg Moskalenko (see here for our Q&A with Oleg). Oleg – rfc5766-turn-server is awesome you deserve some credit for being part of Amazon’s Mayday service!
As mentioned above, WebRTC multiplexes RTCP on the same port as RTP. Wireshark does not have great handling of this. Sorting our RTP stream list (you can’t see the many individual RTCP packets in the graphic above) reveals the 2 video streams use RTCP multiplexing – look for all the “Reserved for RTCP Conflict Avoidance” packets.
RTCP on the 2 G711 streams needed a bit more tweaking to get them to show correctly. When it is not multiplexed, RTCP is usually sent on a 1 port higher than the RTP port, and it looks like that is the situation with ports 29458 and 29459. I did a “decode as” for RTCP on 29459 to show this.
It looks as it RTCP is multiplexed with RTP on the Kindle end, so there was not much I could do there to force Wireshark to resolve all these packets as one RTCP stream.
The agent drew on the screen between 20:46:26 and 20:46:31. I could not find any obvious candidate flows for this – many possibilities exist. If you watch the video, you can see the the drawing appears in sections – not as flowing bush strokes. This implies the image may be sent in several bundles as opposed to a continuous stream of individual pixels to draw.
What does it all mean?
Why are the video streams going though a TURN server but not the audio? Why aren’t all the streams going to the same location. Why isn’t RTCP multiplexing used on the edge server but it is on the Kindle? What is controlling the signaling?
This would all have me confused, except I have a customer with a very similar topology. This customer is using WebRTC to video enable an existing contact center. They already had a SBC to:
- control SIP signaling,
- manage the audio streams,
- help with NAT traversal, and
- connect to the existing contact center infrastructure.
WebRTC is just being used to add video between the end points and the agents there. I believe that is probably the case here too:
Amazon runs a massive call center based on normal audio telephony. The infrastructure for agent Work Force Optimization (WFO) is critical for skills based routing, performance monitoring, and logging. This infrastructure is generally tightly integrated with the existing audio systems. It is reasonable to think that Amazon would not want to rebuild all of this from scratch for WebRTC and would prefer to leverage as much existing infrastructure as possible. Connecting the audio to these existing systems and using WebRTC to add video would allow them to do that. This also explains the lag between the video and audio. WFO systems are usually triggered off the start of audio – it would make sense that Amazon would start that as soon as possible.
So is this WebRTC?
So let’s refer back to our original list to see what WebRTC-like traits are present:
- Web-based signaling – very plausible but very hard to confirm without unecrypting
- ICE negotiations with STUN & TURN – yes, lots of that along with a rfc5766-turn-server implementation
- SRTP – check
- SRTP-DTLS – nope; it is possible Amazon is using SDES. This would not be surprising since, as Victor covered in this post, DTLS was only mandated last August
- RTP bundle – not applicable here; the audio and video are going to separate places so there is not really an opportunity to combine them in a single IP:Port
- RTCP multiplexing – yep, everywhere but the Edge server/SBC which is not surprising
I am not going to impose a definition of WebRTC on you here, but I think there is a very high probability that this is WebRTC for the video with a lower probability on the audio. I think the more interesting, and perhaps unexpected, result is the layering of what looks very much like a WebRTC video flow on-top of a more traditional VoIP audio connection. I was also pleasantly surprised to see the use of Oleg’s rfc5766-turn-server. It will be interesting to follow how/if this service changes as WebRTC matures.
Got Wireshark skillz? Please take a look at the trace yourself and comment below.
I’ll be at Enterprise Connect in Orlando next week if you happen to be there.