Search Results

Search Results for: medooze

Pion seemingly came out of nowhere to become one of the biggest and most active WebRTC communities. Pion is a Go-based set of WebRTC projects. Golang is an interesting language, but it is not among the most popular programming languages out there, so what is so special about Pion? Why are there so many developers involved in this project? 

To learn more about this project and how it came to be among the most active WebRTC organizations, I interviewed its founder – Sean Dubois. We discuss Sean’s background and how be got started in RTC, so see the interview for his background.  I really wanted to understand why he decided to build a new WebRTC project and why he continues to spend so much of his free time on it. ...  Continue reading

If you plan to have multiple participants in your WebRTC calls then you will probably end up using a Selective Forwarding Unit (SFU).  Capacity planning for SFU’s can be difficult – there are estimates to be made for where they should be placed, how much bandwidth they will consume, and what kind of servers you need.

To help network architects and WebRTC engineers make some of these decisions, webrtcHacks contributor Dr. Alex Gouaillard and his team at CoSMo Software put together a load test suite to measure load vs. video quality. They published their results for all of the major open source WebRTC SFU’s. This suite based is the Karoshi Interoperability Testing Engine (KITE) Google funded and uses on to show interoperability status. The CoSMo team also developed a machine learning based video quality assessment framework optimized for real time communications scenarios.

First an important word of caution – asking what kind of SFU is the best is kind of like asking what car is best. If you only want fast then you should get a Formula 1 car but that won’t help you take the kids to school. Vendors never get excited about these kinds of tests because it boils down their functionality into just a few performance metrics. These metrics may not have been a major part of their design criterion and a lot of times they just aren’t that important. For WebRTC SFU’s in particular, just because you can load a lot of streams on an SFU, there may be many resiliency, user behavior, and cost optimization reasons for not doing that. Load also tests don’t take a deep look at the end-to-end user experience, ease of development, or all the other functional elements that go into a successful service. Lastly, a published report like this represents a single point in time – these systems are always improving so result might be better today.

That being said, I personally have had many cases where I wish I had this kind of data when building out cost models. Alex and his team have done a lot of thorough work here and this is great sign for maturity in the WebRTC open source ecosystem. I personally reached out to each of the SFU development teams mentioned here to ensure they were each represented fairly. This test setup is certainly not perfect, but I do think it will be a useful reference for the community.

Please read on for Alex’s test setup and analysis summary.

{“editor”: “chad hart“}


One recurring question on the discuss-webrtc mailing list is “What is the best SFU”. This invariably produces a response of “Mine obviously” from the various SFU vendors and teams. Obviously, they cannot all be right at the same time!

You can check the full thread here. Chad Hart, then with Dialogic answered kindly recognizing the problem and expressed a need:

In any case, I think we need a global (same applied to all) reproducible and unbiased (source code available, and every vendor can tune their installation if they want) benchmark, for several scalability metrics.

Three years later my team and I have built such a benchmark system. I will explain how this system works and show some of our initial results below.

The Problem

Several SFU vendors provide load testing tools. Janus has Jattack. Jitsi has jitsi-hammer and even published some of their results. Jitsi in particular has done a great job with transparency and provides reliable data and enough information to reproduce the results. However, not all vendors have these tools and fewer still make them fully publicly available.  In addition, each tool is designed to answer slightly different questions for their own environments such as:

  • How many streams can a single server instance of chosen type and given bandwidth limit handle?
  • How many users can I support on the same instance?
  • How many users can I support in a single conference?
  • Etc.…

There was just no way to make a real comparative study – one that is independent reproducible, and unbiased. The inherent ambiguity also opened the door for some unsavory behavior from some who realized they could get away with any claim because no one could actually check them. We wanted to produce some results that one does not have to take on faith and that could be peer-reviewed.

What use cases?

To have a good answer to “What is the best SFU?” you need to explain what you are planning to use it for.

We chose to work on the two use cases that seemed to gather the most attention, or at least those which were generating the most traffic on discuss-webrtc:

  1. Video conferencing – many to many, all equals, one participant speaking at a time (hopefully) ,
  2. Media streaming – one-to many, unidirectional

Most video conferencing questions are focused on single server instance. Having 20+ people in a given conference is usually plenty for most. Studies like this one show that in most social cases most of the calls are 1-1, and the average is around 3. , This configuration fits very well a single small instance in any public cloud provider (as long as you get a 1Gbps NIC). You can then use very simple load balancing and horizontal scalability techniques since the ratio of senders to viewers is rarely high. Media streaming, on the other hand, typically involves streaming from a single source to thousands or tens of thousands of viewers. This requires a multi-server hierarchy.

We wanted to accommodate different testing scenarios and implement them in the same fashion across several WebRTC Servers so that the only difference is the system being tested, and the results are not biased.

For purposes of this post I will focus on the video conferencing scenario. For those that are interested, we are finalizing our media streaming test results and plan to present them  at Streaming Media West on November 14th.

The test suite

In collaboration with Google and many others, we developed KITE, a testing engine that would allow us to support all kinds of clients – browsers and native across mobile or desktop – and all kind of test scenarios easily. It is used to test WebRTC implementation everyday across browsers as seen on

Selecting a test client

Load testing is typically done with a single client to control for client impacts. Ideally you can run many instances of the test client in parallel in a single virtual machine (VM). Since this is WebRTC, it makes sense to use one of the browsers. Edge and Safari are limited to a single process, which does not make they very suitable. Additionally, Safari only runs MacOS or iOS, which only runs on Apple hardware. It is relatively easy to spawn a million VMs on AWS if you’re running Windows or Linux. It’s quite a bit more difficult, and costly, to setup one million Macs, iPhones, or iPads for testing (Note, I am still dreaming about this though).

That leaves you with Chrome or Firefox which allow multiple instances just fine. It is our opinion that the implementation of webdriver for Chrome is easier to manage with fewer flags and plugins (i.e. H264) to handle, so we chose to use Chrome.

Systems Under Test

We tested the following SFUs:

To help make sure each SFU showed its best results, we contacted the teams behind each of these projects. We offered to let them set up the server themselves or connect to the servers and check-up their settings. We also shared the results so they could comment. That made sure we properly configured each system to handle optimally for our test.

Interestingly enough, during the life of this study we found quite a few bugs and worked with the teams to improve their solutions. This is discussed more in detail in the last section.

Test Setup

We used the following methodology to increase traffic to a high load. First we populated each video conference rooms with one user at a time until it reached 7 total users. We repeated this process until the total target number of users was reached.  close to 500 simultaneous users.

The diagram below shows the elements in the testbed:


Most people interested in scalability questions will measure the CPU, RAM, and bandwidth footprints of the server as the “load” (streams, users, rooms…) ramps up. That is a traditional way of doing things that supposes that the quality of the streams, their bitrate… all stay equal.

WebRTC’s encoding engine makes this much more complex. WebRTC includes bandwidth estimation, bitrate adaptation and overall congestion control mechanism, one cannot assume streams will remain unmodified across the experiment. In addition to  the usual metrics, the tester also needs to record client-side metrics like sent bitrate, bandwidth estimation results and latency. It is also important to keep an eye on the video quality, as it can degrade way before you saturate the CPU, RAM and/or bandwidth of the server.

On the client side, we ended up measuring the following:

  • Rate of success and failures (frozen video, or no video)
  • Sender and receiver bitrates
  • Latency
  • Video quality (more on that in the next section)

Measuring different metrics on the server side can be as easy as pooling the getStats API yourself or integrating a solution like In our case, we measured:

  • CPU footprint,
  • RAM footprint,
  • Ingress and egress bandwidth in and out,
  • number of streams,
  • along with a few of other less relevant metrics.

The metrics above were not published in the Scientific article because of space limitation, but should be released in a subsequent Research Report.

All of these metrics are simple to produce and measure with the exception of video quality. What is an objective measure of video quality? Several proxies for video quality exist such as Google rendering time, received frames, bandwidth usage, but none of these gave an accurate measure.

Video quality metric

Ideally a video quality metric would be visually obvious when impairments are present.  This would allow one to measure the relative benefits of resilient techniques, such as like Scalable Video Coding (SVC), where conceptually the output video has a looser correlation with jitter, packet loss, etc. than other encoding methods. See the below video from Agora for a good example of a visual comparison:

After doing some quick research on a way to automate this kind of visual quality measurement, we realized that nobody had developed a method to assess the video quality as well as a human would in the absence of reference media with a  real-time stream. So, we went on to develop our own metric leveraging Machine Learning with neural networks. This allowed for real-time, on-the-fly video quality assessment. As an added benefit, it can be used without recording customer media, which is a sometimes a legal or privacy issue.

The specifics of this mechanism is beyond the scope of this article but you can read more about the video quality algorithm here. The specifics of this AI-based algorithm have been submitted for publication and will be made public as soon as it is accepted.

Show me the money results

We set up the following five open-source WebRTC SFUs, using the latest source code downloaded from their respective public GitHub repositories (except for Kurento/OpenVidu, for which the Docker container was used):

Each was setup in a separate but identical Virtual Machine and with default configuration.


First a few disclaimers. All teams have seen and commented on the result of their SFUs.

The Kurento Media Server team is aware that their server is currently crashing early and we are working with them to address this. On Kurento/OpenVidu, we tested max 140 streams (since it crashes so early).

In addition, there is a known bug in libnice, which affected both Kurento/OpenVidu and Janus during our initial tests.  After a libnice patch was applied as advised by the Janus team, their results significantly improved.  However, the re-test with the patch on Kurento/OpenVidu actually proved even worse. Our conclusion was that there are other issues with Kurento. We are in contact with them and working on fixes so, the Kurento/OpenVidu results might improve soon.

The latest version of Jitsi Videobridge (up to the point of this publication) always became unstable at exactly 240 users. The Jitsi team is aware of that and working on the problem. They have however pointed out that their general advice is to rely on horizontal scaling with a larger number of smaller instances as described here. Note that a previous version (as two months ago) did not have these stability issues but did not perform as well (see more on this in the next section). We chose to keep version 0.1.1077 as it included made simulcast much better and improved the results significantly (up to 240 participants, that is). ...  Continue reading

Multi-party calling architectures are a common topic here at webrtcHacks, largely because group calling is widely needed but difficult to implement and understand. Most would agree Scalable Video Coding (SVC) is the most advanced, but the most complex multi-party calling architecture.

To help explain how it works we have brought in not one, but two WebRTC video architecture experts. Sergio Garcia Murillo is a long time media server developer and founder of Medooze. Most recently, and most relevant for this post, he has been working on an open source SFU that leverages VP9 and SVC (the first open source project to do this that I am aware of). In addition, frequent webrtcHacks guest author and renown video expert Gustavo Garcia Bernando joins him.

Below is a great piece that walks through a complex technology and yet-to-be documented features in Chrome’s WebRTC implementation. Take notes!

{“editor”, “chad hart“}

One of the challenges of WebRTC multiparty solutions has always been how to adapt the video bitrate the for participants with different capabilities.  Traditional solutions are based on the Multi-point Control Unit (MCU) model. MCU’s transcode (fully decode the stream and then re-encode it) to generate a different version of the stream for each participant with different quality, resolution, and/or frame rate.

Then the Selective Forwarding Unit (SFU) model based on forwarding packets without any re-encoding began to become very popular. This was largely due to its scalable and relatively inexpensive server-side architecture. SFU’s were particularly popular for WebRTC. For the past couple of years, Chrome’s unofficial support for simulcast and temporal scalability within the VP8 codec provided one of the best ways to implement a WebRTC SFU. Simulcast requires the endpoints send two or three versions of the same stream with different resolutions/qualities so that the SFU server can forward a different one to each destination.   Fortunately when enabling simulcast in Chrome you get support for temporal scalability automatically (explained below). This means the SFU can also selectively forward different packets to provide different frame rates of each quality depending on the available peer bandwidth.

However, simulcast does have some drawbacks – its extra independently encoded streams result in extra bandwidth overhead and CPU usage.

Is there something better? Yes. Scalable Video Coding (SVC) is a more sophisticated approach to minimize this overhead while maintaining the benefits of the SFU model.

What is SVC?

Scalable Video Coding (SVC) refers to the codec capability of producing several encoded layers within the same bit stream. SVC is a not a new concept – it was originally introduced as part of H264/MPEG-4 and was later standardized there 2005. Unlike Simulcast which sends multiple streams with redundant information and packetization overhead, SVC aims to provide a more efficient implementation by encoding layers of information representing different bitrates within a single stream. Use of a single stream and this novel encoding approach helps to minimize network bandwidth consumption and and client CPU encoding costs cost while still providing a light weight video routing architecture.

3 Layer Types

There are three different kinds of layers:

  1. Temporal – different frame rates.
  2. Spatial – different image sizes.
  3. Quality – different encoding qualities.

VP9  supports quality layers as spatial layers without any resolution changes, so we will only refer to spatial and temporal layers from now on.

The SVC layer structure specifies the dependencies between the encoded layers. This makes it possible to pick one layer and drop all other non-dependant layers after encoding without hurting the decodability of the resulting stream.

In VP9 each layers are defined by an integer ID (starting at 0).  Layers with higher IDs have dependency on lower layers.

Encoding & SFU Selection

VP9 SVC produces a “super frame” for each image frame it encodes. Super frames are composed of individual “layer frames” that belong to a single temporal and spatial layer. When transmitting VP9 SVC over RTP, each super frame is sent in a single RTP frame with an extra payload description in each RTP packet that allows an SFU to extract the individual layer frames. This way, the SFU can select the only ones required for the current spatial and temporal layer selection.

Controlling Bandwidth

An SFU can downscale (both temporally and spatially) at any given full layer frame. It is only able to upscale on the escalation points signaled on the payload description header.

Downscaling a temporal layer will produce a decrease on the decoded frames per second (FPS). Downscaling on a spatial layer will produce a reduction on the decoded image size. Upscaling will provide the inverse effect.

Status of VP9 SVC in Chrome

Currently VP9 SVC is enabled in standard Chrome for screensharing only. However, VP9 SVC support for any encoded stream (at least since M57) can be enabled with a field trial command line parameter:

chrome --force-fieldtrials=WebRTC-SupportVP9SVC/EnabledByFlag_2SL3TL

If you do that, Chrome will SVC encode every VP9 stream it sends.  With the above setting, the VP9 encoder will produce 2 spatial layers (with size and size/2) and 3 temporal layers (at FPS, FPS/2 and FPS/4) and no quality layers. So for example, if encoding a VGA size image (640×480) at 30 FPS, you could switch between the following resolutions and framerates:

VP9 Payload format

Each RTP packet carrying a VP9 stream contains data from only one layer frame. Each packet also starts with a VP9 Payload Description. This payload description provides hints to the SFU about the layer frame dependencies and scalability structure.


Currently, Chrome Canary (58) uses 2 types of payload formats for VP9 SVC:

  1. Flexible mode – provides the ability to change the  temporal layer hierarchies and patterns dynamically. The reference frames of each layer frames are provided on the payload description header. This mode is currently only used for screen sharing.
  2. Non-flexible mode – the reference frames of each frame within the group of frames (GOF) are specified in the scalability structure of the payload description and they are fixed until a new scalability structure is sent. This is currently the mode used for real time video.


This is the actual scalability structure of layer frame dependencies used by Chrome Canary for real time video in the non-flexible mode:

The frame dependencies are required by the decoder in order to know if a frame is “decodable” or not given the previously received frames.

Dependency descriptions

Luckily, the payload description also provides hints for an SFU in order to decide which frames may be dropped or not according to the desired temporal and spatial layer it decides to send to each endpoint. The main bits of the payload description that we will be needed to check are:

  • : Inter-picture predicted layer frame, which specifies if the current layer frame depends on previous layer frames of the same spatial layer.
  • D: Inter-layer dependency used, which specifies if the current layer frame depends on the layer frame from the immediately previous spatial layer within the current super frame.
  • U: Switching up point, which specifies if the current layer frame depends on previous layer frames of the same temporal layer.

It is possible to up-switch up to a higher temporal layer on a layer frame which  bit is set to 0, as subsequent higher temporal layer frames will not depend on any previous layer frame from a temporal layer higher than the current one. When a layer frame frame does not utilize inter-picture prediction ( bit set to 0) it is possible to up-switch to the current spatial layer’s frame from the directly lower spatial layer frame.

Dependency Model

Now let’s us look as how an actual VP9 SVC encoded stream looks that was taken from a recent Chrome Canary capture. For each frame of the RTP stream we have decoded the payload description and extracted the representative bits and used the scalability structure to draw the layer frame dependencies.

It looks like this:

T indicates temporal layers. S indicates spatial layers. S0  and T0  are the base layers. Super  frames 1 to 4 comprise a group of frames, as does 5 to 8. You can see that each layer frame of the spatial layer S1 is dependent of the S0 layer frame of the same super frame. Also, it is clear that the second T2  frame of each scalability group is not scalable as it depends on the previous T2  and T1  frames.

Examples of selective forwarding

Using this model, let’s see how layers are selected for a given frame.


Let’s then downscale to T1  S1  at the T2   S1 layer frame of super frame 2 (in red), the result is that the size is not changed but the FPS are halved as expected:

We could downscale further to T1  S0  at the same layer frame and the result will be a reduced image size (width and height halved) and also halved FPS:


To upscale temporally the SFU should have to wait for a layer frame with the switching point flag (U bit set to 1) enabled. In the non-flexible mode this happens periodically according to the scalability structure.  The SFU needs to wait for a layer frame that is not an  inter picture predicted layer frame ( bit set to 0) before it can upscale spatially.

One way the the SFU is able to force the encoder to produce a non inter picture predicted layer frame is by sending an  RTCP feedback Layer Refresh Request (LRR) message or by a Picture Loss Indicator (PLI)/Full Intra Request (FIR)In our RTP example stream, we had to wait until frame 68 to see it. This happens to be initiated by an FIR between frames 67 and 68. Spatial layer S0 is not dependent on the previous temporal layer T0, so the scalability structure is restarted after that with a new Group of Frames.

So in the previous VP9 Stream the SFU will be able to upscale spatially on frame 68 and later temporally on frame 73. 

What’s missing

Today it is possible to enable VP9 SVC in Chrome (including stable) by passing a command line flag and automatically getting 2 spatial layers plus 3 temporal layers (as illustrated above). There are at least four issues that Google needs to solve before they make it a default option:

  1. Decide the best combination of Temporal and Spatial layers when enabling VP9 SVC or provide an API to configure that (but that probably requires part of the new ORTC-like APIs that are not yet available).
  2. Provide a way to enable or disable SVC in a per session basis, so you can have a multiparty call with SVC and a 1:1 call using traditional VP9 to avoid the overhead of SVC encoding.
  3. Denoising (blurring of frames to remove imperfections) was disabled and it is still not yet enabled by default for VP9.
  4. CPU usage when using VP9 SVC is still very high – on mid to high end devices it takes some time to detect CPU overuse and scale down the resolution being sent.

Results / Demos

As stated earlier, the main goal behind using SVC codecs is to be able to generate different versions of a stream without having to transcode it.   That means we need to generate many different stream versions of various bitrates to adapt to the changing bandwidth availability of the participants.   

In the next figure we show a stream being sent at 2Mbps using VP9 SVC. With the 2SL3TL  configuration above, we can generate 6 different versions by selecting different temporal and spatial layers.   The lower layer (¼ resolution at ¼ frame rate) is around 250kbps and the other ones have more or less 200kbps difference between them.

You can test this by yourself using the new open source 

Medooze SFU ...  Continue reading