Earlier this month Fippo published a post analyzing Slack’s new WebRTC implementation. He did not have direct access or a team account to do a thorough deep dive – not to mention he is supposed to be taking some off this month. That left many with some open questions? Is there more to the TURN network? How does multi-party calling work? How exactly is Slack using the Janus gateway? Fortunately WebRTC has an awesomely active and capable community that quickly picked up the slack (pun intended).
Last week Yoshimasa Iwase published a great post giving slack a deeper dive. Unfortunately Google translate did not give the Japanese translation justice, so I asked him if we could provide a translation with some added details here. Iwase-san has been deeply involved with WebRTC for several years at NTT Communications and as editor at HTML5Experts.jp (which is another great resource you should check out.) He has also helped organize WebRTC events in Japan and if you’ve seen him speak you won’t question he knows his stuff.
Check out his in depth analysis below.
{“editor”: “chad hart“}
Introduction
Inspired by the by Philipp Hancke’s (Fippo) Slack article, I started to analyze how Slack uses WebRTC more deeply. I really wanted to understand Slack’s WebRTC architecture to see if it could help inform other WebRTC engineers on how to build their own service or system. Slack has over 2 million active users at the time of writing, so seeing how they built their service should provide useful insight into how to build your own scalable WebRTC service.
Analysis method
As Philipp Hancke describes a in the Blackbox Exploration series of analysis, there are some useful tools to analyze WebRTC service. I chose the same tools:
- Chrome webrtc-internals to see information like SDP and candidates and dump logs – Slack only supports Chrome so I couldn’t check about:webrtc in Firefox
- JavaScript – their files are minified but we can check some functions like “RTCPeerConnection” anyway
- Wireshark capture
Below is the result of my analysis.
Slack doesn’t use P2P
It’s common to use P2P topology in 1 to 1 communication since this optimizes user experience and minimizes the number of servers you need to maintain. Despite this fact Slack forces all stream to use TURN server. In addition to TURN they use janus as an endpoint.
Here is the toplogy picture of their usage:
Even though you are going to Slack WebRTC chat with someone sitting next to you on the same LAN, the communication path will still follow the above topology. Their TURN server is deployed on AWS. As far as I tested, there’s no TURN server in Tokyo region and my WebRTC client (Chrome) is connected to TURN server in Singapore region, which causes unnecessary latency. I’ll describe their TURN deployment in more detail next.
How do they force us to user TURN?
The answer is found in SDP. Let’s check the SDP Offer at first and then the Answer.
Offer from Janus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
type: offer, sdp: v=0 o=- 643193054244 643193054243 IN IP4 127.0.0.1 s=Room with no name.. t=0 0 a=group:BUNDLE audio a=msid-semantic: WMS janus ...*1 m=audio 1 RTP/SAVPF 111 ...*2 c=IN IP4 10.21.82.27 a=mid:audio a=sendonly a=rtcp-mux a=ice-ufrag:X/x9 a=ice-pwd:P8XwtXqt3z7yK0VDthjMmT a=ice-options:trickle a=fingerprint:sha-256 C5:5F:DA:7D:84:47:B1:BF:6B:55:16:62:48:31:3E:D3:F1:7B:25:89:92:4A:4B:4D:4D:D9:D5:AF:EA:D8:15:44 a=setup:actpass a=connection:new a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level a=rtpmap:111 opus/48000/2 a=fmtp:111 minptime=10; useinbandfec=1; usedtx=1 a=ssrc:711812372 cname:janusaudio a=ssrc:711812372 msid:janus janusa0 a=ssrc:711812372 mslabel:janus a=ssrc:711812372 label:janusa0 a=candidate:4 1 udp 2013266431 10.21.82.27 12003 typ host ...*3 a=candidate:8 1 udp 2013266431 172.31.1.90 12003 typ host ...*3 a=candidate:4 2 udp 2013266430 10.21.82.27 12004 typ host ...*3 a=candidate:8 2 udp 2013266430 172.31.1.90 12004 typ host ...*3 |
From *1, Apparently Slack uses Janus as a server side WebRTC endpoint. Janus is WebRTC Gateway and can behaves as MCU, SFU, and other modes with its plugin architecture. (See Lorenzo Miniero’s post for more on WebRTC Gateways)
*2 shows “RTP/SAVPF” is still used. In fact Lorenzo fixed this into “UDP/TLS/RTP/SAVPF” immediately after Fippo’s post. After Slack updates their Janus, we should see the SDP updated here.
From *3, Slack sends us some ICE candidates in vanilla ICE style, which is in common in MCU or SFU usage since candidates are obvious to MCU or SFU. The key point here is that the IP address are all private ones. The means we can’t connect each other in P2P topology.
How do we connect these Private IP
The answer is found in the ICE candidates received. If you check chrome://webrtc-internals you will find host/srflx/relay as usual. (srflx should be absent when your machine has global IP address)
“host” and “srflx” are meaningless because these address can’t connect *3’s IPs. “relay” – a.k.a. the TURN address – is the most important ICE candidate and here is the concrete result:
1 |
sdpMid: audio, sdpMLineIndex: 0, candidate: candidate:4184247995 1 udp 41754367 52.77.208.161 52017 typ relay raddr X.X.X.X rport 50512 generation 0 ufrag MQyVfDIb5jH9WrUh |
Obviously you can find the global IP address, “52.77.208.161”. This is the TURN server’s address Slack deployed in their system. Reverse looking up this global IP shows …
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ dig -x 52.77.208.161 ; <<>> DiG 9.8.3-P1 <<>> -x 52.77.208.161 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27428 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;161.208.77.52.in-addr.arpa. IN PTR ;; ANSWER SECTION: 161.208.77.52.in-addr.arpa. 300 IN PTR ec2-52-77-208-161.ap-southeast-1.compute.amazonaws.com. |
TURN server are deployed on AWS. Because AWS EC2 instance both global IP and private IP in VPC, TURN server should have private IP. This private IP can connect to *3 private IP.
Slack’s TURN server
Speaking of TURN server, Slack doesn’t use the active coturn project but the former RFC5766 TURN SERVER. This result found at Wireshark dump:
The TURN server’s version is 3.2.3.96 and actually this is old version. The latest version of coturn server is ”3.2.5.9” according to GitHub.
In addition to version here are some information about Slack’s TURN server:
- URN of the TURN server is turn:slack-calls9.slack-core.com:22466 and it seems that there are about 23 TURN servers deployed around the world. You can check this by lookup from slack-calls1.slack-core.com to slack-calls23.slack-core.com
- I was able to find TURN servers deployed at least in Singapore, Ireland, and Northern California
- TURN authentication information, user name and password, are dynamically obtained from JavaScript
Why do they force users to TURN server?
There might be some reasons:
- To reduce call set up time like FaceTime and WhatsApp does
- To make use of TURN function like authentication
That’s all I could find on their TURN setup, so go back to their SDP.
Answer from Janus
Janus’ SDP Answer was as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
type: answer, sdp: v=0 o=- 643218903360 643218903359 IN IP4 127.0.0.1 s=Room with no name.. t=0 0 a=group:BUNDLE audio a=msid-semantic: WMS janus m=audio 1 RTP/SAVPF 111 c=IN IP4 10.21.82.27 a=mid:audio a=recvonly ...*4 a=rtcp-mux a=ice-ufrag:4c6U a=ice-pwd:PvqUXiHLeUIO7qgcKeVHhd a=ice-options:trickle a=fingerprint:sha-256 C5:5F:DA:7D:84:47:B1:BF:6B:55:16:62:48:31:3E:D3:F1:7B:25:89:92:4A:4B:4D:4D:D9:D5:AF:EA:D8:15:4 a=setup:active a=connection:new a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level a=rtpmap:111 opus/48000/2 a=fmtp:111 minptime=10; useinbandfec=1; usedtx=1 a=candidate:17 1 udp 2013266431 10.21.82.27 12016 typ host a=candidate:34 1 udp 2013266431 172.31.1.90 12016 typ host |
Basically the structure is almost same as the offer. However, one of the points I’d like to describe is *4. (Actually in the offer you can see a=sendonly attribute but I just skipped it for simple explanation) This means there are two media streams between Client and Janus. One is sendonly and the other one is recvonly.
The explanation above is depicted as following image:
If Janus behaves as MCU, mixing the streams from each party in the stream, there should should only be a single stream down to a client (Browser or Desktop application). On the other hand, if Janus behaves as SFU the down stream will be the same number of the clients because SFU doesn’t combine the streams, it just distributes them. I will investigate their multi-party calling model next.
How are they doing multiparty?
Slack provides multiparty voice chat to paying users. My team uses the paid service so we were able to do a 3-person voice chat experiment. The scenario of the experiment is pretty simple:
1. Member-A using Chrome starts to create a room
2. Member-B and Member-C using Desktop application join the room
After this experiment I got SDPs, dump data from chrome://webrtc-internals, and packet capture. Here is analysis result.
Slack’s Janus behaves as SFU
As I wrote before, Janus itself is a WebRTC Gateway and provides many functions with various plug-in. One of these plug-ins is for a videoconferencing SFU. Our analysis indicates Slack uses Janus as SFU. (I don’t know which SFU plugin is used: Janus officially provides or Slack originally developed)
Anyway, Here is the image of multiparty topology:
How did I find this topology?
The answer is also found in SDPs. In this experiment My browser Chrome got two recvonly offers from Janus.
1st offer
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
type: offer, sdp: v=0 o=- 1934855961425 1934855961425 IN IP4 127.0.0.1 s=Room with no name.. t=0 0 a=group:BUNDLE audio a=msid-semantic: WMS janus m=audio 1 RTP/SAVPF 111 c=IN IP4 10.21.119.210 a=mid:audio a=sendonly a=rtcp-mux a=ice-ufrag:TSS3 a=ice-pwd:nNIA1e4IPKi80wTmQXELfH a=ice-options:trickle a=fingerprint:sha-256 C5:5F:DA:7D:84:47:B1:BF:6B:55:16:62:48:31:3E:D3:F1:7B:25:89:92:4A:4B:4D:4D:D9:D5:AF:EA:D8:15:44 a=setup:actpass a=connection:new a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level a=rtpmap:111 opus/48000/2 a=fmtp:111 minptime=10; useinbandfec=1; usedtx=1 a=ssrc:1847733899 cname:janusaudio a=ssrc:1847733899 msid:janus janusa0 a=ssrc:1847733899 mslabel:janus a=ssrc:1847733899 label:janusa0 a=candidate:6 1 udp 2013266431 10.21.119.210 12005 typ host a=candidate:12 1 udp 2013266431 172.31.0.210 12005 typ host a=candidate:6 2 udp 2013266430 10.21.119.210 12006 typ host a=candidate:12 2 udp 2013266430 172.31.0.210 12006 typ host |
2nd offer
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
type: offer, sdp: v=0 o=- 1934856623162 1934856623161 IN IP4 127.0.0.1 s=Room with no name.. t=0 0 a=group:BUNDLE audio a=msid-semantic: WMS janus m=audio 1 RTP/SAVPF 111 c=IN IP4 10.21.119.210 a=mid:audio a=sendonly a=rtcp-mux a=ice-ufrag:tFU7 a=ice-pwd:sQ1tpfAOrpm1okszeBmy39 a=ice-options:trickle a=fingerprint:sha-256 C5:5F:DA:7D:84:47:B1:BF:6B:55:16:62:48:31:3E:D3:F1:7B:25:89:92:4A:4B:4D:4D:D9:D5:AF:EA:D8:15:44 a=setup:actpass a=connection:new a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level a=rtpmap:111 opus/48000/2 a=fmtp:111 minptime=10; useinbandfec=1; usedtx=1 a=ssrc:3006424101 cname:janusaudio a=ssrc:3006424101 msid:janus janusa0 a=ssrc:3006424101 mslabel:janus a=ssrc:3006424101 label:janusa0 a=candidate:15 1 udp 2013266431 10.21.119.210 12014 typ host a=candidate:30 1 udp 2013266431 172.31.0.210 12014 typ host a=candidate:15 2 udp 2013266430 10.21.119.210 12015 typ host a=candidate:30 2 udp 2013266430 172.31.0.210 12015 typ host |
As you see, both SDPs have a=sendonly, which means 2 media streams going to my browser Chrome and means Slack’s Janus doesn’t behave as MCU.
These SDPs show another interesting fact.
Multistream with Plan B or Unified Plan is not supported
There are a couple ways to send multiple streams: (This is well explained at Dr. Alex’s presentation slide)
- a. Multicast
- Multiple PeerConnection with single Mediastream
- Single PeerConnection with multiple Mediastream
- b. Simulcast
- c. SVC Encoding
[1.a] is available from the beginning of WebRTC, but consumes much resources such as ports. For [1.b] there are two major ways to realize: one is Plan B based and the other one is Unifid Plan based. (For your information, it seems Chrominium started to implement Unified Plan)
The MultiStream used in Slack is [1.a].
Signaling
Slack signaling is based on ScreenHero. You can find this from minified JavaScript code. Some snippet is like this:
1 2 3 |
_getServer("screenhero.rooms.join", ...) // ... // _getServer("screenhero.rooms.create", ...) |
Room based signaling is used which is common in multiparty conference scenarios.
Conclusion
In conclusion, here are some of the main highlights:
- No peer-to-peer – all calls are relayed by an AWS-based TURN network using an older TURN server project
- Using Janus as a SFU for providing multi-party calls
- Signaling looks to be room based from ScreenHero
As one of WebRTC engineers and one of Slack users, it’s kind of disappointing that Slack doesn’t fully utilize TURN feature, such as ‘TURN/TCP’ and ‘TURN/TLS’. Since Slack has over 2 million active users, with this TURN configuration many users under strict network environment are unable to use voice chat for now. I hope they’ll change their setting in the near future in order to give the Slack users better UX.
Slack already announced they’re going to add video chat feature so I’ll check their WebRTC usage when that is available and hopefully be back here to comment again.
{“author”: “Yoshimasa Iwase“}
Aswath Rao says
I am not convinced that the standards should encourage, let alone recommend use of TURN/TCP & TURN/TLS, if the LAN policy is to expressly prohibit RTP traffic. Such tunneling mechanism is a renegade act and the correct way is for the affected users to lobby their local admins to relax the policy.
Furthermore, we need to develop comm protocol between the applications and Middleboxes so that dynamic pin holes can be created. Use of these TURN schemes take the easy way out by pushing the problem under the rug.
Yoshimasa Iwase says
It makes sense that relaxing the policy under the strict network such as enterprise world is a good solution. My opinion I forgot to write in the article is that the best path is P2P(no relay). If Slack allows us to collect host candidate it’s better.
Jeremy Noring says
Having deployed into dozens of large companies, lobbying local admins to “relax the policy” is often like arguing with a kitchen table. It simply will not happen.
The reality is deploying inside of big enterprise requires TURN/TCP and HTTP CONNECT support for bypassing proxies. I am 100% convinced that to have broad WebRTC adoption, these absolutely must be part of the standard.
Jeremy Noring says
The main reason Slack is forcing traffic through TURN is because otherwise they end up in a full-mesh configuration for multi-party chats.
For example, if all traffic is p2p and it’s a five person conference, each participant is sending four streams, and receiving four streams. This simply doesn’t scale well given that most people have limited upload bandwidth. With a media server in the mix, each party pushes a *single* stream, and receives four streams: much less stress on an individual user’s upload connection. For audio-only, you might get away with full mesh.
Furthermore, because they are an enterprise product, the reality is p2p is likely impossible. A TURN server isn’t a nicety; it’s a requirement. In many regards, it becomes easier to force all traffic through TURN than even bother with the complexities of ICE negotiation.
Lastly, if recording content is of any interest, there is no good option besides MITM’ing traffic and recording at an MCU/SFU.
Lorenzo Miniero says
I’m not sure I understand your comment, here. TURN does NOT allow you to change a full-mesh topology: that’s what SFUs are for.
The only thing TURN allows you to do is relaying media between two peers even in situations where this could be troublesome (e.g., see Symmetric NATs). In a full-mesh conference you’d still need to send/encode your stream multiple times even with TURN involved (you’d just have N relays transporting all of them).
What you explain is addressed by SFUs, which allow you to publish your media just once, and then take care of making the stream available to all interested parties themselves.
Jeremy Noring says
Apologies–let me explain. I skipped a part and that was probably confusing.
With any SFU, there still has to be a signaling channel. Almost without fail, that signaling channel is going to be some HTTPS-ish protocol over 443 (anything else is prone to failure for the same reasons WebRTC is prone to failure); for example, licode uses socket-io. However, this means the signaling server itself is likely listening on 443, which means: a separate TURN server is ultimately necessary.
In other words, SFU for enterprise environments necessitates the use of a separate TURN server to listen on 443. And at that point, rather than bother with complicated ICE negotiation, it’s a lot easier to simply force all traffic through the TURN server.
Jeremy Noring says
Or, more simply put: once you’re dealing with enterprise and need to use an SFU, it’s a lot easier to just pump everything through TURN than bother with the corner cases.
In any event, my original post is phrased incorrectly–sorry for being unclear.
Draken says
Its easy to understand why they FORCE turn servers. Exposing client IPs is becoming a big no no in software dev. With the rise of DDOSing, and the ease it can be done with the services standing by, exposing clients IPs in on way, its a bad idea.
Even Skype isnt p2p anymore.