Introduction to capture handle - a new Chrome Origin Trial that lets a WebRTC screen sharing application communicate with the tab it is capturing. Examples use case discussed include detecting self-capture, improving the use of collaboration apps that are screen shared, and optimizing stream parameters of the captured content.

Pion seemingly came out of nowhere to become one of the biggest and most active WebRTC communities. Pion is a Go-based set of WebRTC projects. Golang is an interesting language, but it is not among the most popular programming languages out there, so what is so special about Pion? Why are there so many developers involved in this project? 

To learn more about this project and how it came to be among the most active WebRTC organizations, I interviewed its founder – Sean Dubois. We discuss Sean’s background and how be got started in RTC, so see the interview for his background.  I really wanted to understand why he decided to build a new WebRTC project and why he continues to spend so much of his free time on it. ...  Continue reading

Chrome recently added the option of adding redundancy to audio streams using the RED format as defined in RFC 2198, and Fippo wrote about the process and implementation in a previous article. You should catch-up on that post, but to summarize quickly RED works by adding redundant payloads with different timestamps in the same packet. If you lose a packet in a lossy network then chances are another successfully received packet will have the missing data resulting in better audio quality.

That was in a simplified one-to-one scenario, but audio quality issues often have the most impact on larger multi-party calls. As a follow-up to Fippo’s post, Jitsi Architect and Improving Scale and Media Quality with Cascading SFUs author Boris Grozev walks us through his design and tests for adding audio redundancy to a more complex environment with many peers routing media through a Selective Forwarding Unit (SFU).

{“editor”, “chad hart“}

Fippo covered how to add redundancy packets in standard peer-to-peer calls without any middle boxes like a Selective Forwarding Unit (SFU).  What happens when you stick in a SFU in the middle? There are a couple more things to consider.

  • How do we handle conferences where clients have different RED capabilities? It may be the case that only a subset of the participants in a conference support RED. In fact this will often be the case today since RED is a relatively new addition to WebRTC/Chromium/Chrome.
  • Which streams should have redundancy? Should we add redundancy for all audio streams at the cost of additional overhead, or just the currently active speaker (or 2-3 speakers)?
  • Which legs should have redundancy? In multi-SFU cascading scenarios, do we need to add redundancy for the SFU-SFU streams?
  •  ...  Continue reading

    Back in April 2020 a Citizenlab reported on Zoom’s rather weak encryption and stated that Zoom uses the SILK codec for audio. Sadly, the article did not contain the raw data to validate that and let me look at it further. Thankfully Natalie Silvanovich from Googles Project Zero helped me out using the Frida tracing tool and provided a short dump of some raw SILK frames. Analysis of this inspired me to take a look at how WebRTC handles audio. In terms of perception, audio quality is much more critical for the perceived quality of a call as we tend to notice even small glitches. Mere ten seconds of this audio analysis were enough to set me off on quite an adventure investigating possible improvements to the audio quality provided by WebRTC.


    Want to keep up on our latest posts? Please click here to subscribe to our mailing list if you have not already. We only email post updates. You can also follow us on twitter at @webrtcHacks for blog updates.

    Software as a Service, Infrastructure as a Service, Platform as a Service, Communications Platform as a Service, Video Conferencing as a Service, but what about Gaming as a Service? There have been a few attempts at Cloud Gaming, most notably Google’s recently launched Stadia. Stadia is no stranger to WebRTC, but can others leverage WebRTC in the same way?

    Thanh Nguyen set out to see if this was possible with his open source project, CloudRetro. CloudRetro is based on the popular go-based WebRTC library, pion (thanks to Sean of Pion for helping review here). In this post, Thanh gives an architectural review of how he build the project along with some of the benefits and challanges he ran into along the way.

    {“editor”, “chad hart“}


    Last year, Google announced Stadia, and It blew my mind. That idea is so unique and innovative. I kept questioning how it is even possible with the current state of technology. The motivation to demystify its technology spurred me to create an open-source version of Cloud Gaming. The result is fantastic. I would like to share my one year adventure working on this project in the article below.

    TLDR: the short slide version with highlights

    Why Cloud Gaming is the future

    I believe Cloud Gaming will soon become the new generation of not only games but also other fields of computer science. Cloud Gaming is the pinnacle of the client/server model. It maximizes backend control and minimizes frontend work by putting game logic on a remote server and streaming images/audio to the client. The server handles heavy processing so the client is no longer limited by hardware constraints.

    Looking at Google Stadia, it essentially allows you to play AAA games (i.e. high-end blockbuster games) on an interface like YouTube. The same methodology can be applied to other heavy offline applications like Operating System or 2D/3D graphic design, etc… so that we can run them consistently on low spec devices across many platforms.

    The future of this technology: imagine running Microsoft Windows 10 on a Chrome browser?

    Cloud Gaming is technically challenging

    Gaming is one of the rare applications that require continuous fast user reaction. If we click a page that takes a 2-second delay once in a while, it is still acceptable. Live broadcast video streams typically run many seconds behind, but still offer acceptable usability. However, if a game is delayed frequently for 500ms, it becomes unplayable. The target is to achieve extremely low latency to ensure the gap between game input and media is as small as possible. Therefore, the traditional Video streaming approach is not applicable here.

    Cloud Gaming common pattern

    The Open Source CloudRetro Project

    I decided to create a POC of Cloud-Gaming so that I can verify whether it is possible with these tight network restrictions. I picked Golang for my POC because it is the language I am most familiar with and it turned out to work well for many other reasons. Go is simple with fast development speed. Go channels are excellent when dealing with concurrency and stream manipulation.

    The project is Open source Web-based Cloud Gaming Service for Retro game. The goal is to bring the most comfortable gaming experience and introduce network gameplay like online multiplayer to traditional retro games.

    You can reference the entire project repo here:

    CloudRetro Functionality

    CloudRetro used Retro games to demonstrate the power of Cloud Gaming. It enables many unique gaming experiences.

    • Portable Gaming experience
      • Instant play when the page is opened; no download, no install
      • Running on browser, mobile, so you don’t need any software to launch

      Gaming sessions can be shared across multiple devices and stored on cloud storage for next time
      Game is both streamed and playable and multiple users can join the same game:

      • Crowdplay like TwitchPlayPokemon but more real time and seamless
      • Online multiplayer for offline games without network setting. Samurai Showdown is now playable with 2 players over the network in CloudRetro


      Requirement and Tech Stack

      Below is the list of requirements I set before starting the project.

      1. Singleplayer:

      This requirement sounds not relevant and straightforward, but it’s one of my key findings that makes cloud gaming stand away from traditional streaming services. If we focus on singleplayer, we can get rid of a centralized server or CDN because we don’t need to distribute the stream session to massive users. Rather than uploading streams to an ingest server or passing packets to a centralized WebSocket server, the service streams to the user directly over a WebRTC peer connection.

      2. Low Latency media stream

      When I research about Stadia, some articles are mentioning the application of WebRTC. I figured out WebRTC is a remarkable technology and fits this cloud gaming use case nicely. WebRTC is a project that provides web browsers and mobile applications with Real-Time Communication via simple API. It enables peer communication and is optimized for media and has built-in standard codecs like VP8 and H264.

      I prioritized delivering the smoothest experience to users over keeping high-quality graphics. Some loss is acceptable in the algorithm. On Google Stadia, there is an additional step to reduce image size on the server, and image frames are rescaled to higher quality before rendering to peers.

      3. Distributed infrastructure with geographic routing.

      No matter how optimized the compression algorithm and the code is, network is still the crucial factor contributing the most to latency. The architecture needs to have a mechanism to pair the closest server to the user to reduce Round Trip Time (RTT). The architecture contains a single coordinator and multiple streaming servers distributed around the world: US West, US East, Europe, Singapore, China. All streaming servers are fully isolated. The system can adjust its distribution when a server joins or leaves the network. Hence, under high traffic, adding more servers allows horizontal scaling.

      4. Browser Compatible

      Cloud Gaming shines the best when it demands the least from users. This means being able to run on a browser. Browsers help bring the most comfortable gaming experience to users by removing software and hardware installs. Browsers also help provide cross-platform flexibility across mobile and desktop. Fortunately, WebRTC has excellent support across different browsers.

      5. Clear separation of Game interface and service

      I see the cloud gaming service as a platform. One should be able to plug in any to the platform. Currently, I integrated LibRetro with the Cloud Gaming service because LibRetro offers a beautiful gaming emulator interface for retro games like SNES, GBA, PS.

      6. Room based mechanism for multiplayer, crowd play and deep-link to the game

      CloudRetro enables many novel gameplays like CrowdPlay and Online MultiPlayer for retro games. If multiple users open the same deep-link on different machines, they will see the same running game as a video stream and even be able to join the game as any player.

      Moreover, Game states are stored on cloud storage. This lets users continue their game at any time on any different device.

      7. Horizontal scaling

      As every SAAS nowadays, it must be designed to be horizontally scalable. The coordinator-worker design enables adding more workers to serve more traffic.

      8. Cloud Agnostic

      CloudRetro infrastructure is hosted on various cloud providers (Digital Ocean, Alibaba, custom provider) to target different regions. I dockerize the infrastructure and configure network settings through bash script to avoid dependency on any one cloud provider. Combining this with WebRTC’s NAT traversal, we can gain the flexibility to deploy CloudRetro on any cloud platform and even on any user’s machines.

      Architectural design

      Worker: (or streaming server as referred above) spawns games, runs encoding pipeline, and streams the encoded media to users. Worker instances are distributed around the world, and each worker can handle multiple user sessions concurrently.

      Coordinator: in charge of pairing the new user with the most appropriate worker for streaming. The coordinator interacts with workers over WebSocket.

      Game state storage: central remote storage for all game states. This storage enables some essential functionalities such as remote saving/loading.

      User flow

      When a new user opens CloudRetro at steps 1 and 2 shown in the image below, the coordinator is requested for the frontend page along with the list of available workers. After that at step 3, the client calculates latencies to all candidates using an HTTP ping request. This list of latencies is later sent back to the coordinator so that it can determine the most suitable worker to serve the user. At step 4 below, the game is spawned. A WebRTC stream connection is established between the user and the designated worker.

      Inside the worker

      Inside a worker, game and streaming pipelines are kept isolated and exchange information via an interface. Currently, this communication is done by in-memory transmission over Golang Channels in the same process. Further segregation is the next goal –  i.e., running the game independently in a different process.

      The main pieces are:

      • WebRTC: Client-facing component where user input comes in and the encoded media from the server goes out.
      • Game Emulator: The game component. Thanks to Libretro library, the system is capable of running the game inside the same process and internally hooking media and input flow. In-game frames are captured and sent to the encoder.
      • Image/Audio Encoder: The encoding pipeline, where it accepts media frames, encodes in the background, and outputs encoded images/audio.


      CloudRetro relies on WebRTC as the backbone, so before going into details about my implementation in Golang, the first section is dedicated to introducing WebRTC technology. It is an awesome technology that greatly helps me achieve sub-second latency streaming.


      WebRTC is designed to enable high-quality peer-to-peer connections on native mobile and browsers with simple APIs.

      NAT Traversal

      WebRTC is renowned for its NAT Traversal functionality. WebRTC is designed for peer-to-peer communication.  It aims to find the most suitable direct route avoiding NAT gateways and firewalls for peer communication via a process named ICE. As part of this process, the WebRTC APIs find your public IP Address using STUN servers and fallback to a relay server (TURN) when direct communication cannot be established.

      However, CloudRetro doesn’t fully utilize this capability. Its peer connections are not between users and users but between users and cloud servers. The server side of the model has fewer restrictions on direct communication than a typical user device. This allows doingthings like pre-opening inbound ports or using public IP addresses directly as the server is not behind NAT.

      I previously had ambitions of developing the project to become a game distribution platform for Cloud Gaming. The idea was to let game creators contribute games and streaming resources. Users would be paired with game creators’ providers directly. In this decentralized manner, CloudRetro is just a medium to connect third-party streaming resources with users, so it is more scalable when the burden of hosting does not rely on CloudRetro anymore. WebRTC NAT Traversal will play an important role when it eases the peer connection initialization on third-party streaming resources, making it effortless for the creator to join the network.

      Video Compression

      Video compression is an indispensable part of the pipeline that greatly contributes to a smooth streaming experience. Even though It is not compulsory to fully know all of VP8/H264’s video coding details, understanding its concepts helps to demystify streaming speed parameters, debug unexpected behavior, and tune the latency.

      Video Compression for a streaming service is challenging because the algorithm needs to ensure the total encoding time + network transmission + decoding time is as small as possible. In addition, the encoding process needs to be in serial order and continuous. Some traditional encoding trade-offs are not applicable – like trading long encoding time for smaller file size and decoding time or compressing without order.

      The idea of video compression is to omit non-essential bits of information while keeping an understandable level of fidelity for users. In addition to encoding individual static image frames, the algorithm made an inference for the current frame from previous and future frames, so only the difference is sent. As you see in the the Pacman example below, only the differential dots are transferred.

      Audio Compression

      Similarly, the audio compression algorithm omits data that cannot be perceived by humans. The audio codec with the best performance currently is Opus. Opus is designed to transmit audio wave over an ordered datagram protocol such as RTP (Real-time Transport Protocol). It produces lower latency than (mp3, aac) with higher quality. The delay is usually around 5~66.5 ms

      Pion, WebRTC in Golang

      Pion is an open-source project that brings WebRTC to Golang. Rather than simply wrapping the native C++ WebRTC libraries, Pion is a native Golang implementation for better performance, better Golang integration, and version control on constitutive WebRTC protocols.

      The library also provides sub-second latency streaming with many great built-ins. It has its own implementation of STUN, DTLS, SCTP, etc… and some experiments of QUIC and WebAssembly. This open-source library itself is really a good source of learning with a great document, network protocol implementations and cool examples.

      The Pion community, led by a very passionate creator, is lively and has many quality discussions about WebRTC. If you are interested in this technology, please join – you will learn many new things.

      Write CloudRetro in Golang

      Go Channel In Action

      Thanks to Go’s beautiful channel design, event streaming and concurrency problems are greatly simplified. As in the diagram, there are multiple components running parallely in different GoRoutines. Each component manages its own state and communicates over channels. Golang’s select statement enforces that one atomic event is processed each game tick. This means locking is unnecessary with this design. For example, when a user saves, a completed snapshot of the game state is required. This state needs to remain uninterrupted by running input until the save is complete. During each game tick, the backend can only process either save operation or input operation, so it is concurrently safe.

      Fan-in / Fan-out

      This Golang pattern perfectly matches my use-case for CrowdPlay and Multiple Player. Following this pattern, all user inputs in the same room are fanned-in to a central input channel.  Game media is then fanned-out to all users in the same room. Hence, we achieve game state sharing between multiple gaming sessions from different users.

      Synchronization between different sessions

      Downsides of Golang

      Golang isn’t perfect. Channel is slow. Compared to lock, Go channel is just a simpler way to handle concurrency and streaming events, but channel does not give the best performance. There is a complex locking logic under a channel. Hence, I made some adjustments in the implementation by reapplying locks and atomic value in replacement of channels to optimize the performance.

      In addition, the Golang garbage collector is uncontrollable, so sometimes there are some suspicious long pauses. This greatly hurts the realtime-ness of this streaming application.


      The project uses some existing Golang open-source VP8/H264 Library for media compression and Libretro for Game emulators. All of these libraries are just wrappers of C library in Go by using CGO. There are some drawbacks that you can refer to this blog post by Dave. The issues I’m facing are:

      • Crash in CGO is not caught, even with Golang Recovery
      • Unable to define performance bottleneck when we cannot detect granular issues under CGO

      Conclusion ...  Continue reading

    As you may have heard, Whatsapp discovered a security issue in their client which was actively exploited in the wild. The exploit did not require the target to pick up the call which is really scary.
    Since there are not many facts to go on, lets do some tea reading…

    The security advisory issued by Facebook says

    A buffer overflow vulnerability in WhatsApp VOIP stack allowed remote code execution via specially crafted series of SRTCP packets sent to a target phone number.

    This is not much detail, investigations are probably still ongoing. I would very much like to hear a post-mortem how WhatsApp detected the abuse.

    We know there is an issue with SRTCP, the control protocol used with media transmission. This can mean two things:

    1. there is an issue with how RTCP packets are decrypted, i.e. at the SRTCP layer
    2. there is an issue in the RTCP parser

    SRTCP is quite straightforward so a bug in the RTCP parser is more likely. As I said last year, I was surprised Natalie Silvanovichs fuzzer (named “Fred” because why not?) did not find any issues in the RTCP parser.

    We actually have a bit of hard facts provided by the binary diff Checkpoint Research wherein they analyzed how the patched version is different.

    They found two interesting things:

    • there is an additional size check in the RTCP module, ensuring less than 1480 bytes
    • RTCP is processed before the call is answered

    Lets talk about RTCP

    RTCP, the Realtime Control Protocol, is a rather complicated protocol described in RFC 3550. It provides feedback about how the RTP media stream is doing such as packet loss. A UDP datagram can multiplex multiple individual RTCP packets into what is called a compound packet. When a RTCP compound packet is encrypted using SRTCP, all of the packets are encrypted together with a single authentication tag that is usually 12 bytes long.
    To make demuxing compound packets possible, each individual RTCP packet specifies its length in a 16 bit field. For example a sender report packet starts like this:

    The length field is defined as

    length: 16 bits
    The length of this RTCP packet in 32-bit words minus one,
    including the header and any padding. (The offset of one makes
    zero a valid length and avoids a possible infinite loop in
    scanning a compound RTCP packet, while counting 32-bit words
    avoids a validity check for a multiple of 4.)

    which is… rather complicated. It particular this definition means that the RTCP parser MUST validate the length field against the length of the datagram and the remaining bytes in the packet. Some RTCP packets even have additional internal length fields.

    For the first packet in a compound packet length validation is usually done by the library implementing SRTCP like libSRTP. Mind you that WhatsApp probably uses PJSIP and PJMEDIA, or at least they did in back in 2015 when I took a look.

    The length check for the second packet needs to be done by the RTCP library. I would not be surprised if this is where things went south. Been there, done that. And it remains a bit unclear whether the length field is validated against the remaining bytes. 1480 seems like a very odd number to check for though. At first I thought this made sense since it was 1492 minus the 12 bytes for the SRTCP tag but the maximum payload size of UDP turned out to be 1472 bytes, not 1492. So now I end up being confused again…

    Don’t process data from strangers

    There is another issue here. As the New York Times article said it looks like the victims received calls they never answered.

    Checkpoint’s analysis ...  Continue reading

    A while ago we looked at how Zoom was avoiding WebRTC by using WebAssembly to ship their own audio and video codecs instead of using the ones built into the browser’s WebRTC.  I found an interesting branch in Google’s main (and sadly mostly abandoned) WebRTC sample application apprtc this past January. The branch is named wartc… a name which is going to stick as warts!

    The repo contains a number of experiments related to compiling the library as WebAssembly and evaluating the performance. From the rapid timeline, this looks to have been a hackathon project.

    Project Architecture

    The project includes a few pieces:

    • encoding individual images using libwebp as a baseline
    • using libvpx for encoding video streams
    • using WebRTC for audio and video

    Transferring the resulting data is done using the WebRTC DataChannel. The high-level architecture is shown below:

    I decided to apply the Web Assembly (aka wasm) techniques to webrtc sample pages since apprtc is a bit cumbersome to set up. The sample pages are easier framework to work with and don’t need a signaling server. Actually you do not need any server of your own since you can simply run them from github pages.

    You can find the source code here.

    Basic techniques

    Before we get to WebAssembly, first we need to walk through the common techniques to capture and render RTC audio and video. We will use WebAudio to grab and play raw audio samples and the HTML5 canvas element to grab frames from a video and render images.

    Let’s look at each.


    Grab audio from a stream using WebAudio

    To grab audio from a MediaStream one can use a WebAudio MediaStreamSource and a ScriptProcessor node. After connecting these, call the ScriptProcessorNode’s onaudioprocess with an object containing the raw samples to be sent over the DataChannel.

    Render audio using WebAudio

    Rendering audio is a bit more complicated. The solution that seems to have come up during the hackathon is quite a good hack. It creates an AudioContext with a square input, connects it to a ScriptProcessorNode and pulls data from a buffer (which is fed by the data channel) at a high frequency. Then the ScriptProcessorNode is connected the AudioContext’s destination node which will play out things without needing an element, similar to what we have seen Zoom do.

    Try these here. Make sure to open about:webrtc or chrome://webrtc-internals to see the WebRTC connection in action.


    Grab images from a canvas element

    Grabbing image data from a canvas is quite simple. After creating a canvas element with the desired width and height, we draw the video element to the canvas context and then call getImageData  to get the data which we can then process further.

    Draw images to a canvas element

    Drawing to a canvas is equally simple by creating a frame of the right size and then setting the frames data to the data we want to draw. When this is done at a high enough frequency the result looks quite smooth. To optimize the process, this could be coordinated with the rendering using the requestAnimationFrame method.

    Try these here

    Encoding images using libwebp

    Encoding images using libwebp is a baseline experiment. Each frame is encoded as an individual image, with no reference to other frames like in a video codec. If this example would not deliver acceptable visual quality, it would not make sense to expand the experiment to more advanced stage.
    The code is a very simple extension of the basic code that grabs and renders frame. The only difference is a synchronous call to

    The DataChannel limits the transmitted object size to  65kb. Anything larger needs to be fragmented, which means more code. In our sample we use a 320×240 resolution. At this low resolution, we come in below 65kb and do not need to fragment and reassemble the data.

    We show a side-by-side example of this and WebRTC here. The visual quality is comparable but the webp stream seems to be slightly delayed. This is probably only visible in a side-by-side comparison.

    In terms of bitrate this can easily go up to 1.5mbps — for a 320×[email protected] stream (and using a decent CPU. WebRTC clearly delivers the same quality at a much lower bitrate. Also you will notice, the page gets throttled when in the background and setInterval  is no longer executed at a high frequency.

    Encoding a video stream using libvpx

    The WebP example encodes a series of individual images. But we are really encoding a series of images that are following each other in time and therefore repeat a lot of content that is easily compressed. Using libvpx, an actual video encoding/decoding library that contains the VP8 and VP9 codecs has some benefits as we will see.

    Once again, the basic techniques for grabbing and rendering the frames remains the same. The encoder and decoder are run in a WebWorker this time which means we need to use postMessage  to send and receive frames.
    Sending is done with

    and the encoder will send a message containing the encoded frame:

    Our data is bigger than 65kbps this time so we need to fragment it before sending it over the DataChannel:

    and reassemble on the other side.

    Looking at this in comparison with WebRTC we get pretty close already, it is hard to spot a visual difference. The bitrate is much lower as well, only using about 500kbps compared to the 1.5mbps of the webp sample.

    Note that if you want to tinker with libvpx is it probably better to use Surma’s webm-wasm which also gives you the source code used to built the WebAssembly module.

    Easy simulation of packet loss

    There is an interesting experiment one can do here: drop packets.

    The easiest way to do so is to introduce a some code that stops  vpxenc._onmessage from sending some data:

    The decoder will still attempt to decode the stream but there are quite severe artifacts until the next keyframe arrives. WebRTC (which is based on RTP) has built-in mechanism to recover from packet loss by either requesting the sender to resend a lost packet (using a RTCP NACK) or send a new keyframe (using RTCP PLI). This can be added to the protocol that runs on top of the DataChannel of course or the WebAssembly could simply emit and process RTP and RTCP packets like Zoom does.

    Compiling WebRTC to WebAssembly

    Find the sample here.

    Last but not least the hackathon team managed to compile as WebAssembly. Which is no small feat. Note that synchronization between audio and video is a very hard problem that we have ignored so far. WebRTC does this for us magically.

    We are focusing on audio this time as this was the only thing which we got to work. The structure is a bit different from the previous examples – this time we are using a Transport object that is defined in WebAssembly

    The sendPacket function is called with each RTP packet which is already guaranteed to be smaller than 65k which means we do not need to split it up ourselves.

    This turned out to not work very well – audio was severely distorted. One of the problems is confusion between the platform/WebAudio sampling rate of 44.1khz and the Opus sampling rate rate of 48khz. This is fairly easy to fix though by replacing a couple of references to 480 by 441.

    In addition, the encoder seems to lock up at times, for reasons which are not really possible to debug without access to the source code used to build this. Simply recording the packets and playing them out in the decoder was working better. Clearly this needs a bit more work.


    The Hackathon results are very interesting. They show what is possible today, even without WebRTC NV’s lower-level API’s and give us a much better idea at what problems still need to be solved there. In particular the current ways of accessing raw data and feeding it back into the engine show above are cumbersome and inefficient at present. It would of course be great if the Google folks would be a bit more open about these results which are quite interesting (nudge 🙂 ) but… the repository is public at least.

    Zooms Web client still achieves a better result…

    {“author”: “

    Philipp Hancke ...  Continue reading


    QUIC-based DataChannels are being considered as an alternative to the current SCTP-based transport. The WebRTC folks at Google are experimenting  with it:

    Let’s test this out. We’ll do a simple single-page example similar to the

    WebRTC datachannel sample that transfers text ...  Continue reading

    Deploying media servers for WebRTC has two major challenges, scaling beyond a single server as well as optimizing the media latency for all users in the conference. While simple sharding approaches like “send all users in conference X to server Y” are easy to scale horizontally, they are far from optimal in terms of the media latency which is a key factor in the user experience. Distributing a conference to a network of servers located close to the users and interconnected with each other on a reliable backbone promises a solution to both problems at the same time. Boris Grozev from the Jitsi team describes the cascading SFU problem in-depth and shows their approach as well as some of the challenges they ran into.

    {“editor”: “Philipp Hancke“}

    Real-time communication applications are very sensitive to network conditions such as throughput, delay, and loss. Lower bitrates lead to lower video quality and longer network latency leads to a longer end-to-end delay in audio and video. Packet loss can lead to “choppy” audio and video freezes due to video frame skipping.

    Because of this it is important to select an optimal path between the endpoints in a conference. When there are only two participants this is relatively straightforward – the ICE protocol is used by WebRTC to establish a connection between the two endpoints to exchange multimedia. The two endpoints connect directly if possible, and otherwise use a TURN relay server in less typical situations. WebRTC supports resolving a domain name to get the TURN server address, which makes it easy to select a local TURN server based on DNS, for example by using AWS Route53’s routing options.

    However, when a conference has more participants routed through a centralized media server the situation is much more complex. Many WebRTC services like Hangouts,, Slack, and our own, use a Selective Forwarding Units (SFU) to more efficiently relay audio and video among 3 or more participants.

    The Star Problem

    In this case all endpoints connect to a central server (in a star topology) with which they exchange multimedia. It should be obvious that selecting the location of the server has a huge impact on user experience — if all participants in the conference are located in the US, using a server in Sydney is not a good idea.

    Most services use a simple approach which works well a lot of the time — they select a server close to the first participant in the conference. However, there are some cases where this isn’t optimal. For example, suppose we have the three participants as shown in the diagram above – two are based on the East Coast of the US and the third is in Australia. If the Australian participant (Caller C) joins the conference first, this algorithm selects the server in Australia (Server 2), but Server 1 in the US is a better choice since it is closer to the majority of participants.

    Scenarios such as these are not common, but they do happen. Assuming the order in which participants join is random, this happens in ⅓ of conferences with 3 participants where one is in a remote location.

    Another scenario which happens more often is illustrated in the diagram below: we have two groups of participants in two locations. In this case the order of joining doesn’t matter, we will always have some pairs of users that are close to each other, but their media has to go through a server in a remote location. For example, in the image below there are 2 Australian callers (C&D) and 2 US callers (A&B) .

    Switching to Server 1 is non-optimal for Callers C&D. Server 2 is non-optimal for callers A&B. Whether we use Server 1 or Server 2 there will be some participants connected through a non-optimal remote server.

    What if we weren’t limited to using one server? We could have every participant connected to a local server, we just have to interconnect the servers.

    Solution: Cascading

    Postponing the question of how do we actually interconnect the servers, let’s first look at what effect this has on the conference.

    The SFU connection from C to D hasn’t changed – that still goes through Server 2. For the connection between A and B we use Server 1 instead of Server 2 as in the previous diagram  which is obviously better. The interesting part is actually the connection from A to C (or any of the others, for which the effects are analogous). Instead of using A<=>Server 2<=>C we use A<=>Server 1<=>Server 2<=>C.

    Non-intuitive trip time impacts

    Connecting SFU bridges like this has advantages and disadvantages. On the one hand, our results show that in such situations the end-to-end round-trip-time is higher when we add additional hops. On the other hand, reducing the round trip time from the client to the first server that it is connected to has an advantage on its own, because we can perform stream repair with lower latency on a hop-by-hop basis.

    How does that work? WebRTC uses RTP, usually over UDP, to transfer media. This means that the transport is not reliable. When a UDP packet is lost in the network, it is up to the application to either ignore/conceal the loss, or request a retransmission using an RTCP NACK packet. For example the application might chose to ignore lost audio packets, and request retransmission for some but not all video packets (depending on whether they are required for decoding of subsequent frames or not).

    With cascaded bridges, these retransmissions can be limited to a local server. For example, in the A-S1-S2-C path, if a packet is lost between A and S1, S1 will notice and request retransmission. If a packet is lost between S2 and C, C will request retransmission and S2 will respond from its cache. And if a packet is lost between two servers, the receiving server can request a retransmission.

    Clients use a jitter buffer to delay the playback of video, in order to allow for delayed or retransmitted packets to arrive. The size of this buffer changes dynamically based in part on the round-trip time. When retransmissions are performed hop-by-hop, the latency is lower, and therefore the jitter buffer can be shorter, leading to lower overall delay.

    In short, even though the end-to-end round-trip-time is higher with an extra server, this could lead to lower end-to-end media delay (but we have yet to explore this effect in practice).

    Implementing a Cascading SFU

    So how do we implement this in Jitsi Meet, and how do we deploy it on

    Signaling vs. Media

    Let us look at signaling first. Since its inception, Jitsi Meet has separated the concept of a signaling server (which is now Jicofo) and a media server/SFU (jitsi-videobridge). This separation allowed us to implement support for cascaded bridges relatively easily. For one thing, we could just keep all the signaling logic in a central place — Jicofo. Second, we already had the protocol for signaling between Jicofo and Jitsi Videobridge (COLIBRI). We only had to add a small extension to it. We already had support for multiple SFUs connected to one signaling server (for load balancing). Now we had to add the option for one SFU to connect to multiple signaling servers.

    We ended up with two independent pools of servers — one pool of jicofo instances and one pool of jitsi-videobridge instances. The diagram below illustrates part of this.

    The second part of our system is the bridge-to-bridge communication. We wanted to keep this part as simple as possible, and therefore we decided to not do any explicit signaling between the bridges. All signaling happens between jicofo and jitsi-videobridge, and the connection between two bridges is only used for audio/video and data channel messages coming from clients.

    The Octo protocol

    To coordinate this communication we came up with the Octo protocol, which wraps RTP packets in a simple fixed-length header, and allows to transport string messages. In the current implementation, the bridges are connected to each other in a full mesh, but the design allows for other topologies as well. For example, using a central relay server (a star of bridges), or a tree structure for each bridge.

    Footnote: Note that instead of prepending the Octo header it could be added as an RTP header extension, making the streams between bridges pure (S)RTP. Future versions of Octo might use this approach

    Second footnote: Octo doesn’t really stand for anything. We were initially planning to use a central relay, and for some reason it reminded us of an octopus, so we kept that name for the project.

    In the Jitsi Videobridge terminology, when a bridge is part of a multi-bridge conference, it has a an additional Octo channel (actually one channel for audio and one for video). This channel is responsible for forwarding the media to all other bridges, as well as receiving media from all other bridges. Each bridge binds to a single port for Octo (4096 by default), which is why we need the conference ID field to be able to handle multiple conferences at once.

    Currently the protocol does not have its own security mechanism and we delegate that responsibility to lower layers. This is something that we want to work on next, but for the time being the bridges need to be in a secure network (we use a separate AWS VPC).

    Use with Simulcast

    One of the distinguishing features of Jitsi Meet is simulcast where each participant sends multiple streams of different bitrates and the bridge helps select the ones that are needed. We wanted to make sure that this continues to work robustly, so we forward all of the simulcast streams between the bridges. This allows for quicker switching between streams (because the local bridge doesn’t have to request a new stream). However, it is not optimal in terms of bridge-to-bridge traffic, because some of the streams are often not used and just consume exra bandwidth for no benefit.

    Active Speaker Selection

    We also wanted to continue to support following the active speaker in a conference (giving them the most real estate). This turned out to be easy — we just have each bridge do the dominant speaker identification independently, and notify its local clients (this is also the approach others have used). This means that the calculation is done multiple times, but it is not expensive, and allows us to simplify things (e.g. we don’t have to decide which bridge does DSI, and worry about routing the messages).

    Bridge Selection

    With the current implementation, the bridge selection algorithm is simple. When a new participant joins, Jicofo needs to decide which bridge to allocate to it. It does so based on the region of the client and the regions and load of the bridges available to it. If there is an available bridge in the same region as the client, it’s used. Otherwise, one of the existing conference bridges is used.

    For documentation about setting up Octo, see here.

    Deploying Cascading SFU’s

    We have now enabled geographical bridge cascading, as described above, on

    For this deployment we run all machines in Amazon AWS. We have servers (both signaling and media) in six regions:

    • us-east-1 (N. Virginia),
    • us-west-2 (Oregon),
    • eu-west-1 (Ireland),
    • eu-central-1 (Frankfurt),
    • ap-se-1 (Singapore) and
    • ap-se-2 (Sydney).

    We use a layer of geolocated HAProxy instances which help to determine which region a client is coming from. The domain is managed by Route53 and resolves to an HAProxy instance, which adds its own region to the HTTP headers of the request it forwards. This header is then used to set the value of the config.deploymentInfo.userRegion  variable made available to the client via the /config.js  file.

    For diagnostics and to demonstrate this feature, the user interface on shows how many bridges are in use, and where each participant is connected to. Scrolling over the top left part of your local thumbnail shows you the number of servers and the region of the server you are connected to. Scrolling over a remote thumbnail shows you the region of the server the remote participant is connected to, and the end-to-end round trip time between your browser and theirs (as E2E RTT).


    We initially launched Octo as an A/B test on in August. The initial results looked good and it is now enabled for everyone. We have a lot of data to go through and we are planning to look at how well Octo performs in detail and write more about it. We are also planning to use this work as the first stepping stone towards supporting larger conferences (for which a single SFU is not sufficient). So stay tuned for more about this in the coming months.

    If you have any questions or comments, you can drop us a message on our

    community forums ...  Continue reading