Time for another opinionated post. This time on… end-to-end encryption (e2ee). Zoom apparently claims it supports e2ee while it can not satisfy that promise. Is WebRTC any better?
Zoom does not have End to End Encryption
Let’s get to the bottom of things fast: Boo Zoom!
I reviewed how Zoom’s implements their web client last year.
I’m not really surprised of their general lack of e2ee given that their web client did not provide any encryption on top of TLS or WebRTC’s DataChannel. For reasons we will discuss below, this means they weren’t doing any obvious e2ee there.
Update (April 2nd): Zoom published a blog post saying are using e2ee in the main use-case. Which sounds great but how is that auditable, how are keys managed and what prevents them from switching it off at any time?
Is WebRTC Any Better?
Now that we’re done with finger pointing, how does the situation look in WebRTC land?
WebRTC is encrypted. By default. You can’t turn it off. It’s clearly secure! Sadly, the situation is a bit more complex.
Encrypting Real Time Media
WebRTC uses DTLS-SRTP for encryption. In a nutshell that means there is a (D)TLS handshake and then the encryption keys are derived from that. That uses self-signed certificates which are signalled in the SDP. This is a fair bit better than no encryption or SDES.
DTLS vs. SDES
The slides from the 2013 IETF meeting in Berlin discuss the topic of DTLS vs SDES in quite some detail and we also have a post on that decision if you want more history there.
There are two things to note here:
- DTLS requires an active attack. It is possible (using chrome://webrtc-internals or Firefox about:webrtc) to get hold of the remote DTLS fingerprint of a peer you’re connected to. But that is quite hard for the average user. It is possible to use end-to-end encryption for the signaling messages which then establish a binding between an identity and the fingerprint.
This even applies if your traffic is routed through a TURN server, which by design does not know the encryption keys negotiated via DTLS. - It is encrypted to the peer. Now in the multiparty case that peer is often a SFU. The same applies to Zoom. I looked at their native stuff a couple of years back and the payload of the UDP packet seemed pretty random which suggests a similar level of encryption.
Selective Forwarding Units (SFU)
Now there is a thing about SFUs. This is the defacto architecture used to relay media in the cloud when you need to scale a video conference past a few users. They need to do some fancy things with RTCP, the control protocol for media in order to work. Oscar Divorra described the details here and Gustavo and Sergio go into the details of layering here
They also need access to a tiny bit of information about the frame, in particular whether it is a keyframe in order to make simulcast work. You can see some of this here.
This can be solved by a technique called “frame marking” which pulls that bit of information out into an unencrypted header extension. The same goes for server-side speaker detection when it comes to audio.
Note it is a different story for 1:1 calls or calls that employ a peer-to-peer mesh architecture. These do offer e2ee by default – noting the DTLS caveats above.
WebRTC Insertable Streams to the rescue
Unlike an MCU an SFU does not need or want access to the unencrypted media. But they get it because there is no alternative yet. However, this is about to change with the Insertable Streams API that is being implemented by the Chrome WebRTC team right now:
It has been available in the native webrtc.org API for a while but Chrome bindings were missing. It is far from ready and needs considerably more testing. There were some pretty glaring bugs like not working in the other direction (fixed in less than 24 hours which was much appreciated). The bar is rising here but there is still quite some effort to be done before it is ready.
So yes, Zoom does not have end-to-end encryption. Quite often, WebRTC doesn’t either – not yet at least. If you are using a WebRTC service check their terms of service and privacy policy and make sure that you understand what they are saying about this. Hopefully we will see this change soon as WebRTC Insertable Streams matures.
Disclosure: I had a coffee with Eric Yuan, CEO of Zoom in early 2019 after he read (and hopefully enjoyed) the original post on how Zoom avoids WebRTC. He paid for the coffee and gave me nice swag even.
{“author”: “Philipp Hancke“}
Tubito Rodriguez says
Ah jeeze. Another reactionary article. Would have been more original if published before this zoom fiasco. Abysmal.
Chad Hart says
webrtcHacks has been publishing details on WebRTC and other RTC services since 2013, including our Blackbox series that attempts to reverse engineer how many popular RTC services work. We are happy a broader audience is taking some of the details of RTC technology more seriously and encourage readers to look through our archives.
Paul O'Brien says
This is the correct response. Well said Chad.
Peter Loron says
How about Jitsi or Signal? Have those been evaluated? Thanks!
Philipp Hancke says
we’re not making recommendations 🙂
Jitsi is using a well-known and understood SFU architecture and has been covered here many times on various aspects. They do lack e2ee because they do terminate encryption but that is something Emil Ivov is well aware of. You can always run your own server and audit the code in theory. Its a matter of delegating trust.
Signal has a very competent lead (moxie) and recently added a very experienced WebRTC expert to the team. While I haven’t met moxie I trust he knows his domain.
erotavlas says
What about the PERC project by IETF? It will bring e2ee conference.
Philipp Hancke says
see https://webrtchacks.com/true-end-to-end-encryption-with-webrtc-insertable-streams/ or the linked Jitsi post 🙂
Tim Makarios says
In the Zoom blog post you link to in your April 2nd update, they claim “in a meeting where all of the participants are using Zoom clients, and the meeting is not being recorded, we … do not decrypt it at any point before it reaches the receiving clients”. But people are saying that they do decrypt streams in order to scale their quality to each client’s available bandwidth. See, for example, https://twitter.com/maxhawkins/status/1248139006887342080 . This would make sense of Zoom clients coping with more incoming video streams than, say, Jitsi Meet clients.
So which is true? Was Zoom still spreading falsehoods when they clarified their non-use of end-to-end encryption? Or have they found some way to downscale video without decrypting it? Or are large Zoom video meetings only successful when everyone’s computers and connections are capable of handling many incoming high-definition video streams simultaneously?
Philipp Hancke says
There are two architecture approaches:
– mesh where every client sends to every other client; that works up to four clients, maybe up to eight with a lot of tuning) and
– SFU where a server redistributes. This typically involves simulcast (https://webrtchacks.com/sfu-simulcast/). SFUs terminate DTLS encryption (which is hop-by-hop) and so far do have access to the media. There are way to avoid this by encrypting the media with a key that is not known to the SFU which is what we did in https://jitsi.org/blog/e2ee/ after getting access to a new chrome api.
Zoom can do the same and they fully control both client and server. What they do in reality…
Tim Makarios says
Thanks for your reply.
As for what Zoom actually does, they claim they can “Bring HD video and audio to your meetings with support for up to 1000 video participants and 49 videos on screen”; which makes me wonder how many laptops on wifi connections would be able to handle putting 49 incoming HD video streams on screen in real time.
So if that Zoom claim isn’t itself misleading (and from what I’ve heard, Zoom meetings actually do cope with many participants, though I haven’t seen it in action myself), then that makes it seem much more likely that (at least in large meetings) they are in fact decrypting the incoming video streams to process them in some way before sending them to the clients.
The remaining possibility is that they might have some way of downscaling the video streams without decrypting them. Despite what I wrote in a comment awaiting moderation on Matthew Green’s blog ( https://blog.cryptographyengineering.com/2020/04/03/does-zoom-use-end-to-end-encryption/?unapproved=4240&moderation-hash=f55e0727d50a730e93eb40022e208eec#comment-4240 ), I no longer think it would require efficient fully homomorphic encryption to achieve this; you might be able to do it with a carefully designed compression scheme (think of something like FLIF) that would allow the SFU to simply drop specific parts of an encrypted video stream (for low-bandwidth recipients) without preventing it from being decoded into a lower-quality video stream.
Also, it seems to me that the question of whether Zoom decrypts video streams to downscale them in large meetings could be answered by some packet inspection of the kind Citizen Lab did for their April 3 report on Zoom’s encryption: Someone could check whether the vast majority of the encrypted data received by a client in a large meeting is identical to encrypted data sent by other clients.
As for Jitsi Meet, I think I’d find it easier to persuade my friends to use my Jitsi Meet server, rather than Zoom, if it did do downscaling at the server, thus making it less resource-intensive for clients. Having said that, I get excited by the prospect of end-to-end-encrypted group chats, too, and I recognize that there are limited resources available for Jitsi Meet development.
Philipp Hancke says
Well, the trick here is that since you can’t display 49 times 720 (1280×720 pixels usually) on a normal screen any the clients never receive that so many streams in that quality.
With simulcast each participants sends (unless muted) three resolutions, 720p, 640×360 and 320×180 (zoom might do more or less or different ratios) to the SFU.
From other participants the client knows in which size they are displayed and can ask the SFU to only forward only a resolution matching that size.
What you receive therefore depends a lot on the layout. For a “large presenter with thumbnail” you’ll get 720p plus a couple of 320×180, for the “brady bunch you might got for 25 times 640×360. https://jitsi.org/blog/new-feature-brady-bunch-style-layout/ explains this a bit.
There are more optimizations like dropping the framerate for thumbnail. Same goes for doing speaker detection (easy when people explicitly mute but possible with just audio levels; the jitsi folks have an adaption of an algorithm that works with just the levels)
The way SFUs are designed makes these routing decisions largely on the RTP header and only needs a single byte of the payload at most. See https://jitsi.org/blog/e2ee/ for a nice demo. The way we designed it there was even leaving the SFU unaware that the content is encrypted with a second layer.
Tim Makarios says
Interesting. I hadn’t thought of getting the clients to send multiple streams. I guess I’ve been influenced by my experience of Jitsi Meet maxing out my decade-old laptop’s CPU in any meetings with more than a handful of participants, even in low bandwidth mode (so that when I’m speaking, the other participants report that it sounds as if my microphone is intermittently cutting out) — my inclination was to try to get the client to do as little as possible.
But the way Jitsi Meet does it is probably suitable for most people’s machines (at least in richer places), even if it does require the client to do a little more work than downscaling at the server would.