Search Results

Search Results for: jitsi

I wanted to add local recording to my own Jitsi Meet instance. The feature wasn’t built in the way I wanted, so I set out on a hack to build something simple. That lead me down the road to  discovering that:

  1. getDisplayMedia for screen capture has many quirks,
  2. mediaRecorder for media recording has some of its own unexpected limitations, and
  3. Adding your own HTML/JavaScript to Jitsi Meet is pretty simple

Read on for plenty of details and some reference code. My result is located in this repo.


Atlassian’s HipChat acquired BlueJimp, the company behind the Jitsi open source project. Other than for positive motivation, why should WebRTC developers care? Well, Jitsi had its Jitsi Video Bridge (JVB) which was one of the few open source Selective Forwarding Units (SFU) projects out there. Jitsi’s founder and past webrtcHacks guest author, Emil Ivov, was a major advocate for this architecture in both the standards bodies and in the public. As we have covered in the past, SFU’s are an effective way to add multiparty video to WebRTC. Beyond this one component, Jitsi was also a popular open source project for its VoIP client, XMPP components, and much more. ...  Continue reading

Pion seemingly came out of nowhere to become one of the biggest and most active WebRTC communities. Pion is a Go-based set of WebRTC projects. Golang is an interesting language, but it is not among the most popular programming languages out there, so what is so special about Pion? Why are there so many developers involved in this project? 

To learn more about this project and how it came to be among the most active WebRTC organizations, I interviewed its founder – Sean Dubois. We discuss Sean’s background and how be got started in RTC, so see the interview for his background.  I really wanted to understand why he decided to build a new WebRTC project and why he continues to spend so much of his free time on it. ...  Continue reading

Chrome recently added the option of adding redundancy to audio streams using the RED format as defined in RFC 2198, and Fippo wrote about the process and implementation in a previous article. You should catch-up on that post, but to summarize quickly RED works by adding redundant payloads with different timestamps in the same packet. If you lose a packet in a lossy network then chances are another successfully received packet will have the missing data resulting in better audio quality.

That was in a simplified one-to-one scenario, but audio quality issues often have the most impact on larger multi-party calls. As a follow-up to Fippo’s post, Jitsi Architect and Improving Scale and Media Quality with Cascading SFUs author Boris Grozev walks us through his design and tests for adding audio redundancy to a more complex environment with many peers routing media through a Selective Forwarding Unit (SFU).

{“editor”, “chad hart“}

Fippo covered how to add redundancy packets in standard peer-to-peer calls without any middle boxes like a Selective Forwarding Unit (SFU).  What happens when you stick in a SFU in the middle? There are a couple more things to consider.

  • How do we handle conferences where clients have different RED capabilities? It may be the case that only a subset of the participants in a conference support RED. In fact this will often be the case today since RED is a relatively new addition to WebRTC/Chromium/Chrome.
  • Which streams should have redundancy? Should we add redundancy for all audio streams at the cost of additional overhead, or just the currently active speaker (or 2-3 speakers)?
  • Which legs should have redundancy? In multi-SFU cascading scenarios, do we need to add redundancy for the SFU-SFU streams?

Here we will discuss these questions, present what we recently implemented in Jitsi Videobridge, and share some more test results.

Back in April 2020 a Citizenlab reported on Zoom’s rather weak encryption and stated that Zoom uses the SILK codec for audio. Sadly, the article did not contain the raw data to validate that and let me look at it further. Thankfully Natalie Silvanovich from Googles Project Zero helped me out using the Frida tracing tool and provided a short dump of some raw SILK frames. Analysis of this inspired me to take a look at how WebRTC handles audio. In terms of perception, audio quality is much more critical for the perceived quality of a call as we tend to notice even small glitches. Mere ten seconds of this audio analysis were enough to set me off on quite an adventure investigating possible improvements to the audio quality provided by WebRTC.

A couple of weeks ago, the Chrome team announced an interesting Intent to Experiment on the blink-dev list about an API to do some custom processing on top of WebRTC. The intent comes with an explainer document written by Harald Alvestrand which shows the basic API usage. As I mentioned in my last post, this is the sort of thing that maybe able to help add End-to-End Encryption (e2ee) in middlebox scenarios to WebRTC.

I had been watching the implementation progress with quite some interest when former webrtcHacks guest author Emil Ivov of reached out to discuss collaborating on exploring what this API is capable of. Specifically, we wanted to see if WebRTC Insertable Streams could solve the problem of end-to-end encryption for middlebox devices outside of the user’s control like Selective Forwarding Units (SFUs) used for media routing.

The good news it looks like it can! Read below for details.

Before we get into the project, we should first recap how media encryption works with media server devices like SFU’s.

Media Encryption in WebRTC

WebRTC mandates encryption. It uses DTLS-SRTP for encrypting the media. DTLS-SRTP works by using a DTLS handshake to derive keys for encrypting the payload of the RTP packets. It is authenticated by comparing the a=fingerprint lines in the SDP that are exchanged via the signaling server with the fingerprints of the self-signed certificates used in the handshake. This can be called end-to-end encryption since the negotiated keys do not leave the local device and the browser does not have access to them. However, without authentication it is still vulnerable to man-in-the-middle attacks.

See our post about the mandatory use of DTLS for more background information on encryption and how WebRTC landed where it is today.

SFU Challenges

The predominant architecture for multiparty is a Selective Forwarding Unit (SFU). SFUs are basically packet routers that forward a single or small set of streams from one user to many other users. The basics are explained in this post.

In terms of encryption, DTLS-SRTP negotiation happens between each peer endpoint and the SFU. This means that the SFU has access to the unencrypted payload and could listen in. This is necessary for features like server-side recording. On the downside, it means you need to trust the entity running the SFU and/or the client code to keep that stream private. Zero trust is always best for privacy.

Unlike a more traditional VoIP Multipoint Control Unit (MCU) which decodes and mixes media, a SFU only routes packets. It does not care much about the content (apart from a number of bytes in the header and whether a frame is a keyframe). So theoretically the SFU should not need to decode and decrypt anything. SFU developers have been quite aware of that problem since the early days of WebRTC. Similarly, Google’s library has included a “frame encryption” approach for a while which was likely added for Google Duo but doesn’t exist in the browser. However, the “right” API to solve this problem universally only happened now with WebRTC Insertable Streams.

Make it Work

Our initial game plan looked like the following:

  1. Make it work

End-to-End Encryption Sample

Fortunately making it work was a bit easier since Harald Alvestrand had been working on a sample which simplified our job considerably. The approach taken in the sample is a very nice demonstration:

  1. opening two connections,
  2. applying the (intentionally weak, xor-ing the content with the key) encryption on both but
  3. only decryption on one of them.

You can test the sample on here.  Make sure you run the latest Chrome Canary (84.0.4112.0 or later) and that the experimental Web Platform Features flag is on.

The API is quite easy to use. A simple logging transform function looks like this:

The transform function is then called for every video frame. This includes an encoded frame object (named chunk) and a controller object. The controller object provides a way to pass the modified frame to the next step. In our case this is defined by the pipeTo  call above which is the packetizer.

Iterating improvements

With a fully working sample (video-only at first because audio was not yet implemented), we iterated quickly on some improvements such as key changes and not encrypting the frame header. The latter turned out to be very interesting visually. Initially, upon receiving the encrypted frame, the decoder of the virtual “middlebox” would just throw an error and the picture would be frozen. Exploiting some properties of the VP8 codec and not encrypting the first couple of bytes now tricks the decoder into thinking that frame is valid VP8. Which looks … interesting:

Give the sample a try yourself here.

Insertable Streams iterates on frames, not packets

The Insertable Streams API operates between the encoder/decoder and the packetizer that splits the frames into RTP packets. While it is not useful for

inserting your own encoder with WebAssembly ...  Continue reading

Editor’s Note: This post was originally published on October 23, 2018. Zoom recently started using WebRTC’s DataChannels so we have added some new details at the end in the DataChannels section.

Zoom has a web client that allows a participant to join meetings without downloading their app. Chris Koehncke was excited to see how this worked (watch him at the upcoming KrankyGeek event!) so we gave it a try. It worked, removing the download barrier. The quality was acceptable and we had a good chat for half an hour.

Opening chrome://webrtc-internals showed only getUserMedia being used for accessing camera and microphone but no  RTCPeerConnection like a WebRTC call should have. This got me very interested – how are they making calls without WebRTC?

Why don’t they use WebRTC?

The relationship between Zoom and WebRTC is a difficult one as shown in this statement from their website:

The Jitsi folks just did a

comparison of the quality ...  Continue reading

Deploying media servers for WebRTC has two major challenges, scaling beyond a single server as well as optimizing the media latency for all users in the conference. While simple sharding approaches like “send all users in conference X to server Y” are easy to scale horizontally, they are far from optimal in terms of the media latency which is a key factor in the user experience. Distributing a conference to a network of servers located close to the users and interconnected with each other on a reliable backbone promises a solution to both problems at the same time. Boris Grozev from the Jitsi team describes the cascading SFU problem in-depth and shows their approach as well as some of the challenges they ran into.

{“editor”: “Philipp Hancke“}

Real-time communication applications are very sensitive to network conditions such as throughput, delay, and loss. Lower bitrates lead to lower video quality and longer network latency leads to a longer end-to-end delay in audio and video. Packet loss can lead to “choppy” audio and video freezes due to video frame skipping.

Because of this it is important to select an optimal path between the endpoints in a conference. When there are only two participants this is relatively straightforward – the ICE protocol is used by WebRTC to establish a connection between the two endpoints to exchange multimedia. The two endpoints connect directly if possible, and otherwise use a TURN relay server in less typical situations. WebRTC supports resolving a domain name to get the TURN server address, which makes it easy to select a local TURN server based on DNS, for example by using AWS Route53’s routing options.

However, when a conference has more participants routed through a centralized media server the situation is much more complex. Many WebRTC services like Hangouts,, Slack, and our own, use a Selective Forwarding Units (SFU) to more efficiently relay audio and video among 3 or more participants.

The Star Problem

In this case all endpoints connect to a central server (in a star topology) with which they exchange multimedia. It should be obvious that selecting the location of the server has a huge impact on user experience — if all participants in the conference are located in the US, using a server in Sydney is not a good idea.

Most services use a simple approach which works well a lot of the time — they select a server close to the first participant in the conference. However, there are some cases where this isn’t optimal. For example, suppose we have the three participants as shown in the diagram above – two are based on the East Coast of the US and the third is in Australia. If the Australian participant (Caller C) joins the conference first, this algorithm selects the server in Australia (Server 2), but Server 1 in the US is a better choice since it is closer to the majority of participants.

Scenarios such as these are not common, but they do happen. Assuming the order in which participants join is random, this happens in ⅓ of conferences with 3 participants where one is in a remote location.

Another scenario which happens more often is illustrated in the diagram below: we have two groups of participants in two locations. In this case the order of joining doesn’t matter, we will always have some pairs of users that are close to each other, but their media has to go through a server in a remote location. For example, in the image below there are 2 Australian callers (C&D) and 2 US callers (A&B) .

Switching to Server 1 is non-optimal for Callers C&D. Server 2 is non-optimal for callers A&B. Whether we use Server 1 or Server 2 there will be some participants connected through a non-optimal remote server.

What if we weren’t limited to using one server? We could have every participant connected to a local server, we just have to interconnect the servers.

Solution: Cascading

Postponing the question of how do we actually interconnect the servers, let’s first look at what effect this has on the conference.

The SFU connection from C to D hasn’t changed – that still goes through Server 2. For the connection between A and B we use Server 1 instead of Server 2 as in the previous diagram  which is obviously better. The interesting part is actually the connection from A to C (or any of the others, for which the effects are analogous). Instead of using A<=>Server 2<=>C we use A<=>Server 1<=>Server 2<=>C.

Non-intuitive trip time impacts

Connecting SFU bridges like this has advantages and disadvantages. On the one hand, our results show that in such situations the end-to-end round-trip-time is higher when we add additional hops. On the other hand, reducing the round trip time from the client to the first server that it is connected to has an advantage on its own, because we can perform stream repair with lower latency on a hop-by-hop basis.

How does that work? WebRTC uses RTP, usually over UDP, to transfer media. This means that the transport is not reliable. When a UDP packet is lost in the network, it is up to the application to either ignore/conceal the loss, or request a retransmission using an RTCP NACK packet. For example the application might chose to ignore lost audio packets, and request retransmission for some but not all video packets (depending on whether they are required for decoding of subsequent frames or not).

With cascaded bridges, these retransmissions can be limited to a local server. For example, in the A-S1-S2-C path, if a packet is lost between A and S1, S1 will notice and request retransmission. If a packet is lost between S2 and C, C will request retransmission and S2 will respond from its cache. And if a packet is lost between two servers, the receiving server can request a retransmission.

Clients use a jitter buffer to delay the playback of video, in order to allow for delayed or retransmitted packets to arrive. The size of this buffer changes dynamically based in part on the round-trip time. When retransmissions are performed hop-by-hop, the latency is lower, and therefore the jitter buffer can be shorter, leading to lower overall delay.

In short, even though the end-to-end round-trip-time is higher with an extra server, this could lead to lower end-to-end media delay (but we have yet to explore this effect in practice).

Implementing a Cascading SFU

So how do we implement this in Jitsi Meet, and how do we deploy it on

Signaling vs. Media

Let us look at signaling first. Since its inception, Jitsi Meet has separated the concept of a signaling server (which is now Jicofo) and a media server/SFU (jitsi-videobridge). This separation allowed us to implement support for cascaded bridges relatively easily. For one thing, we could just keep all the signaling logic in a central place — Jicofo. Second, we already had the protocol for signaling between Jicofo and Jitsi Videobridge (COLIBRI). We only had to add a small extension to it. We already had support for multiple SFUs connected to one signaling server (for load balancing). Now we had to add the option for one SFU to connect to multiple signaling servers.

We ended up with two independent pools of servers — one pool of jicofo instances and one pool of jitsi-videobridge instances. The diagram below illustrates part of this.

The second part of our system is the bridge-to-bridge communication. We wanted to keep this part as simple as possible, and therefore we decided to not do any explicit signaling between the bridges. All signaling happens between jicofo and jitsi-videobridge, and the connection between two bridges is only used for audio/video and data channel messages coming from clients.

The Octo protocol

To coordinate this communication we came up with the Octo protocol, which wraps RTP packets in a simple fixed-length header, and allows to transport string messages. In the current implementation, the bridges are connected to each other in a full mesh, but the design allows for other topologies as well. For example, using a central relay server (a star of bridges), or a tree structure for each bridge.

Footnote: Note that instead of prepending the Octo header it could be added as an RTP header extension, making the streams between bridges pure (S)RTP. Future versions of Octo might use this approach

Second footnote: Octo doesn’t really stand for anything. We were initially planning to use a central relay, and for some reason it reminded us of an octopus, so we kept that name for the project.

In the Jitsi Videobridge terminology, when a bridge is part of a multi-bridge conference, it has a an additional Octo channel (actually one channel for audio and one for video). This channel is responsible for forwarding the media to all other bridges, as well as receiving media from all other bridges. Each bridge binds to a single port for Octo (4096 by default), which is why we need the conference ID field to be able to handle multiple conferences at once.

Currently the protocol does not have its own security mechanism and we delegate that responsibility to lower layers. This is something that we want to work on next, but for the time being the bridges need to be in a secure network (we use a separate AWS VPC).

Use with Simulcast

One of the distinguishing features of Jitsi Meet is simulcast where each participant sends multiple streams of different bitrates and the bridge helps select the ones that are needed. We wanted to make sure that this continues to work robustly, so we forward all of the simulcast streams between the bridges. This allows for quicker switching between streams (because the local bridge doesn’t have to request a new stream). However, it is not optimal in terms of bridge-to-bridge traffic, because some of the streams are often not used and just consume exra bandwidth for no benefit.

Active Speaker Selection

We also wanted to continue to support following the active speaker in a conference (giving them the most real estate). This turned out to be easy — we just have each bridge do the dominant speaker identification independently, and notify its local clients (this is also the approach others have used). This means that the calculation is done multiple times, but it is not expensive, and allows us to simplify things (e.g. we don’t have to decide which bridge does DSI, and worry about routing the messages).

Bridge Selection

With the current implementation, the bridge selection algorithm is simple. When a new participant joins, Jicofo needs to decide which bridge to allocate to it. It does so based on the region of the client and the regions and load of the bridges available to it. If there is an available bridge in the same region as the client, it’s used. Otherwise, one of the existing conference bridges is used.

For documentation about setting up Octo, see here.

Deploying Cascading SFU’s

We have now enabled geographical bridge cascading, as described above, on

For this deployment we run all machines in Amazon AWS. We have servers (both signaling and media) in six regions:

  • us-east-1 (N. Virginia),
  • us-west-2 (Oregon),
  • eu-west-1 (Ireland),
  • eu-central-1 (Frankfurt),
  • ap-se-1 (Singapore) and
  • ap-se-2 (Sydney).

We use a layer of geolocated HAProxy instances which help to determine which region a client is coming from. The domain is managed by Route53 and resolves to an HAProxy instance, which adds its own region to the HTTP headers of the request it forwards. This header is then used to set the value of the config.deploymentInfo.userRegion  variable made available to the client via the /config.js  file.

For diagnostics and to demonstrate this feature, the user interface on shows how many bridges are in use, and where each participant is connected to. Scrolling over the top left part of your local thumbnail shows you the number of servers and the region of the server you are connected to. Scrolling over a remote thumbnail shows you the region of the server the remote participant is connected to, and the end-to-end round trip time between your browser and theirs (as E2E RTT).


We initially launched Octo as an A/B test on in August. The initial results looked good and it is now enabled for everyone. We have a lot of data to go through and we are planning to look at how well Octo performs in detail and write more about it. We are also planning to use this work as the first stepping stone towards supporting larger conferences (for which a single SFU is not sufficient). So stay tuned for more about this in the coming months.

If you have any questions or comments, you can drop us a message on our

community forums ...  Continue reading

If you plan to have multiple participants in your WebRTC calls then you will probably end up using a Selective Forwarding Unit (SFU).  Capacity planning for SFU’s can be difficult – there are estimates to be made for where they should be placed, how much bandwidth they will consume, and what kind of servers you need.

To help network architects and WebRTC engineers make some of these decisions, webrtcHacks contributor Dr. Alex Gouaillard and his team at CoSMo Software put together a load test suite to measure load vs. video quality. They published their results for all of the major open source WebRTC SFU’s. This suite based is the Karoshi Interoperability Testing Engine (KITE) Google funded and uses on to show interoperability status. The CoSMo team also developed a machine learning based video quality assessment framework optimized for real time communications scenarios.

First an important word of caution – asking what kind of SFU is the best is kind of like asking what car is best. If you only want fast then you should get a Formula 1 car but that won’t help you take the kids to school. Vendors never get excited about these kinds of tests because it boils down their functionality into just a few performance metrics. These metrics may not have been a major part of their design criterion and a lot of times they just aren’t that important. For WebRTC SFU’s in particular, just because you can load a lot of streams on an SFU, there may be many resiliency, user behavior, and cost optimization reasons for not doing that. Load also tests don’t take a deep look at the end-to-end user experience, ease of development, or all the other functional elements that go into a successful service. Lastly, a published report like this represents a single point in time – these systems are always improving so result might be better today.

That being said, I personally have had many cases where I wish I had this kind of data when building out cost models. Alex and his team have done a lot of thorough work here and this is great sign for maturity in the WebRTC open source ecosystem. I personally reached out to each of the SFU development teams mentioned here to ensure they were each represented fairly. This test setup is certainly not perfect, but I do think it will be a useful reference for the community.

Please read on for Alex’s test setup and analysis summary.

{“editor”: “chad hart“}


One recurring question on the discuss-webrtc mailing list is “What is the best SFU”. This invariably produces a response of “Mine obviously” from the various SFU vendors and teams. Obviously, they cannot all be right at the same time!

You can check the full thread here. Chad Hart, then with Dialogic answered kindly recognizing the problem and expressed a need:

In any case, I think we need a global (same applied to all) reproducible and unbiased (source code available, and every vendor can tune their installation if they want) benchmark, for several scalability metrics.

Three years later my team and I have built such a benchmark system. I will explain how this system works and show some of our initial results below.

The Problem

Several SFU vendors provide load testing tools. Janus has Jattack. Jitsi has jitsi-hammer and even published some of their results. Jitsi in particular has done a great job with transparency and provides reliable data and enough information to reproduce the results. However, not all vendors have these tools and fewer still make them fully publicly available.  In addition, each tool is designed to answer slightly different questions for their own environments such as:

  • How many streams can a single server instance of chosen type and given bandwidth limit handle?
  • How many users can I support on the same instance?
  • How many users can I support in a single conference?
  • Etc.…

There was just no way to make a real comparative study – one that is independent reproducible, and unbiased. The inherent ambiguity also opened the door for some unsavory behavior from some who realized they could get away with any claim because no one could actually check them. We wanted to produce some results that one does not have to take on faith and that could be peer-reviewed.

What use cases?

To have a good answer to “What is the best SFU?” you need to explain what you are planning to use it for.

We chose to work on the two use cases that seemed to gather the most attention, or at least those which were generating the most traffic on discuss-webrtc:

  1. Video conferencing – many to many, all equals, one participant speaking at a time (hopefully) ,
  2. Media streaming – one-to many, unidirectional

Most video conferencing questions are focused on single server instance. Having 20+ people in a given conference is usually plenty for most. Studies like this one show that in most social cases most of the calls are 1-1, and the average is around 3. , This configuration fits very well a single small instance in any public cloud provider (as long as you get a 1Gbps NIC). You can then use very simple load balancing and horizontal scalability techniques since the ratio of senders to viewers is rarely high. Media streaming, on the other hand, typically involves streaming from a single source to thousands or tens of thousands of viewers. This requires a multi-server hierarchy.

We wanted to accommodate different testing scenarios and implement them in the same fashion across several WebRTC Servers so that the only difference is the system being tested, and the results are not biased.

For purposes of this post I will focus on the video conferencing scenario. For those that are interested, we are finalizing our media streaming test results and plan to present them  at Streaming Media West on November 14th.

The test suite

In collaboration with Google and many others, we developed KITE, a testing engine that would allow us to support all kinds of clients – browsers and native across mobile or desktop – and all kind of test scenarios easily. It is used to test WebRTC implementation everyday across browsers as seen on

Selecting a test client

Load testing is typically done with a single client to control for client impacts. Ideally you can run many instances of the test client in parallel in a single virtual machine (VM). Since this is WebRTC, it makes sense to use one of the browsers. Edge and Safari are limited to a single process, which does not make they very suitable. Additionally, Safari only runs MacOS or iOS, which only runs on Apple hardware. It is relatively easy to spawn a million VMs on AWS if you’re running Windows or Linux. It’s quite a bit more difficult, and costly, to setup one million Macs, iPhones, or iPads for testing (Note, I am still dreaming about this though).

That leaves you with Chrome or Firefox which allow multiple instances just fine. It is our opinion that the implementation of webdriver for Chrome is easier to manage with fewer flags and plugins (i.e. H264) to handle, so we chose to use Chrome.

Systems Under Test

We tested the following SFUs:

To help make sure each SFU showed its best results, we contacted the teams behind each of these projects. We offered to let them set up the server themselves or connect to the servers and check-up their settings. We also shared the results so they could comment. That made sure we properly configured each system to handle optimally for our test.

Interestingly enough, during the life of this study we found quite a few bugs and worked with the teams to improve their solutions. This is discussed more in detail in the last section.

Test Setup

We used the following methodology to increase traffic to a high load. First we populated each video conference rooms with one user at a time until it reached 7 total users. We repeated this process until the total target number of users was reached.  close to 500 simultaneous users.

The diagram below shows the elements in the testbed:


Most people interested in scalability questions will measure the CPU, RAM, and bandwidth footprints of the server as the “load” (streams, users, rooms…) ramps up. That is a traditional way of doing things that supposes that the quality of the streams, their bitrate… all stay equal.

WebRTC’s encoding engine makes this much more complex. WebRTC includes bandwidth estimation, bitrate adaptation and overall congestion control mechanism, one cannot assume streams will remain unmodified across the experiment. In addition to  the usual metrics, the tester also needs to record client-side metrics like sent bitrate, bandwidth estimation results and latency. It is also important to keep an eye on the video quality, as it can degrade way before you saturate the CPU, RAM and/or bandwidth of the server.

On the client side, we ended up measuring the following:

  • Rate of success and failures (frozen video, or no video)
  • Sender and receiver bitrates
  • Latency
  • Video quality (more on that in the next section)

Measuring different metrics on the server side can be as easy as pooling the getStats API yourself or integrating a solution like In our case, we measured:

  • CPU footprint,
  • RAM footprint,
  • Ingress and egress bandwidth in and out,
  • number of streams,
  • along with a few of other less relevant metrics.

The metrics above were not published in the Scientific article because of space limitation, but should be released in a subsequent Research Report.

All of these metrics are simple to produce and measure with the exception of video quality. What is an objective measure of video quality? Several proxies for video quality exist such as Google rendering time, received frames, bandwidth usage, but none of these gave an accurate measure.

Video quality metric

Ideally a video quality metric would be visually obvious when impairments are present.  This would allow one to measure the relative benefits of resilient techniques, such as like Scalable Video Coding (SVC), where conceptually the output video has a looser correlation with jitter, packet loss, etc. than other encoding methods. See the below video from Agora for a good example of a visual comparison:

After doing some quick research on a way to automate this kind of visual quality measurement, we realized that nobody had developed a method to assess the video quality as well as a human would in the absence of reference media with a  real-time stream. So, we went on to develop our own metric leveraging Machine Learning with neural networks. This allowed for real-time, on-the-fly video quality assessment. As an added benefit, it can be used without recording customer media, which is a sometimes a legal or privacy issue.

The specifics of this mechanism is beyond the scope of this article but you can read more about the video quality algorithm here. The specifics of this AI-based algorithm have been submitted for publication and will be made public as soon as it is accepted.

Show me the money results

We set up the following five open-source WebRTC SFUs, using the latest source code downloaded from their respective public GitHub repositories (except for Kurento/OpenVidu, for which the Docker container was used):

Each was setup in a separate but identical Virtual Machine and with default configuration.


First a few disclaimers. All teams have seen and commented on the result of their SFUs.

The Kurento Media Server team is aware that their server is currently crashing early and we are working with them to address this. On Kurento/OpenVidu, we tested max 140 streams (since it crashes so early).

In addition, there is a known bug in libnice, which affected both Kurento/OpenVidu and Janus during our initial tests.  After a libnice patch was applied as advised by the Janus team, their results significantly improved.  However, the re-test with the patch on Kurento/OpenVidu actually proved even worse. Our conclusion was that there are other issues with Kurento. We are in contact with them and working on fixes so, the Kurento/OpenVidu results might improve soon.

The latest version of Jitsi Videobridge (up to the point of this publication) always became unstable at exactly 240 users. The Jitsi team is aware of that and working on the problem. They have however pointed out that their general advice is to rely on horizontal scaling with a larger number of smaller instances as described here. Note that a previous version (as two months ago) did not have these stability issues but did not perform as well (see more on this in the next section). We chose to keep version 0.1.1077 as it included made simulcast much better and improved the results significantly (up to 240 participants, that is). ...  Continue reading