If you plan to have multiple participants in your WebRTC calls then you will probably end up using a Selective Forwarding Unit (SFU). Capacity planning for SFU’s can be difficult – there are estimates to be made for where they should be placed, how much bandwidth they will consume, and what kind of servers you need.
To help network architects and WebRTC engineers make some of these decisions, webrtcHacks contributor Dr. Alex Gouaillard and his team at CoSMo Software put together a load test suite to measure load vs. video quality. They published their results for all of the major open source WebRTC SFU’s. This suite based is the Karoshi Interoperability Testing Engine (KITE) Google funded and uses on webrtc.org to show interoperability status. The CoSMo team also developed a machine learning based video quality assessment framework optimized for real time communications scenarios.
First an important word of caution – asking what kind of SFU is the best is kind of like asking what car is best. If you only want fast then you should get a Formula 1 car but that won’t help you take the kids to school. Vendors never get excited about these kinds of tests because it boils down their functionality into just a few performance metrics. These metrics may not have been a major part of their design criterion and a lot of times they just aren’t that important. For WebRTC SFU’s in particular, just because you can load a lot of streams on an SFU, there may be many resiliency, user behavior, and cost optimization reasons for not doing that. Load also tests don’t take a deep look at the end-to-end user experience, ease of development, or all the other functional elements that go into a successful service. Lastly, a published report like this represents a single point in time – these systems are always improving so result might be better today.
That being said, I personally have had many cases where I wish I had this kind of data when building out cost models. Alex and his team have done a lot of thorough work here and this is great sign for maturity in the WebRTC open source ecosystem. I personally reached out to each of the SFU development teams mentioned here to ensure they were each represented fairly. This test setup is certainly not perfect, but I do think it will be a useful reference for the community.
Please read on for Alex’s test setup and analysis summary.
{“editor”: “chad hart“}
Introduction
One recurring question on the discuss-webrtc mailing list is “What is the best SFU”. This invariably produces a response of “Mine obviously” from the various SFU vendors and teams. Obviously, they cannot all be right at the same time!
You can check the full thread here. Chad Hart, then with Dialogic answered kindly recognizing the problem and expressed a need:
In any case, I think we need a global (same applied to all) reproducible and unbiased (source code available, and every vendor can tune their installation if they want) benchmark, for several scalability metrics.
Three years later my team and I have built such a benchmark system. I will explain how this system works and show some of our initial results below.
The Problem
Several SFU vendors provide load testing tools. Janus has Jattack. Jitsi has jitsi-hammer and even published some of their results. Jitsi in particular has done a great job with transparency and provides reliable data and enough information to reproduce the results. However, not all vendors have these tools and fewer still make them fully publicly available. In addition, each tool is designed to answer slightly different questions for their own environments such as:
- How many streams can a single server instance of chosen type and given bandwidth limit handle?
- How many users can I support on the same instance?
- How many users can I support in a single conference?
- Etc.…
There was just no way to make a real comparative study – one that is independent reproducible, and unbiased. The inherent ambiguity also opened the door for some unsavory behavior from some who realized they could get away with any claim because no one could actually check them. We wanted to produce some results that one does not have to take on faith and that could be peer-reviewed.
What use cases?
To have a good answer to “What is the best SFU?” you need to explain what you are planning to use it for.
We chose to work on the two use cases that seemed to gather the most attention, or at least those which were generating the most traffic on discuss-webrtc:
- Video conferencing – many to many, all equals, one participant speaking at a time (hopefully) ,
- Media streaming – one-to many, unidirectional
Most video conferencing questions are focused on single server instance. Having 20+ people in a given conference is usually plenty for most. Studies like this one show that in most social cases most of the calls are 1-1, and the average is around 3. , This configuration fits very well a single small instance in any public cloud provider (as long as you get a 1Gbps NIC). You can then use very simple load balancing and horizontal scalability techniques since the ratio of senders to viewers is rarely high. Media streaming, on the other hand, typically involves streaming from a single source to thousands or tens of thousands of viewers. This requires a multi-server hierarchy.
We wanted to accommodate different testing scenarios and implement them in the same fashion across several WebRTC Servers so that the only difference is the system being tested, and the results are not biased.
For purposes of this post I will focus on the video conferencing scenario. For those that are interested, we are finalizing our media streaming test results and plan to present them at Streaming Media West on November 14th.
The test suite
In collaboration with Google and many others, we developed KITE, a testing engine that would allow us to support all kinds of clients – browsers and native across mobile or desktop – and all kind of test scenarios easily. It is used to test WebRTC implementation everyday across browsers as seen on webrtc.org
Selecting a test client
Load testing is typically done with a single client to control for client impacts. Ideally you can run many instances of the test client in parallel in a single virtual machine (VM). Since this is WebRTC, it makes sense to use one of the browsers. Edge and Safari are limited to a single process, which does not make they very suitable. Additionally, Safari only runs MacOS or iOS, which only runs on Apple hardware. It is relatively easy to spawn a million VMs on AWS if you’re running Windows or Linux. It’s quite a bit more difficult, and costly, to setup one million Macs, iPhones, or iPads for testing (Note, I am still dreaming about this though).
That leaves you with Chrome or Firefox which allow multiple instances just fine. It is our opinion that the implementation of webdriver for Chrome is easier to manage with fewer flags and plugins (i.e. H264) to handle, so we chose to use Chrome.
Systems Under Test
We tested the following SFUs:
To help make sure each SFU showed its best results, we contacted the teams behind each of these projects. We offered to let them set up the server themselves or connect to the servers and check-up their settings. We also shared the results so they could comment. That made sure we properly configured each system to handle optimally for our test.
Interestingly enough, during the life of this study we found quite a few bugs and worked with the teams to improve their solutions. This is discussed more in detail in the last section.
Test Setup
We used the following methodology to increase traffic to a high load. First we populated each video conference rooms with one user at a time until it reached 7 total users. We repeated this process until the total target number of users was reached. close to 500 simultaneous users.
The diagram below shows the elements in the testbed:
Metrics
Most people interested in scalability questions will measure the CPU, RAM, and bandwidth footprints of the server as the “load” (streams, users, rooms…) ramps up. That is a traditional way of doing things that supposes that the quality of the streams, their bitrate… all stay equal.
WebRTC’s encoding engine makes this much more complex. WebRTC includes bandwidth estimation, bitrate adaptation and overall congestion control mechanism, one cannot assume streams will remain unmodified across the experiment. In addition to the usual metrics, the tester also needs to record client-side metrics like sent bitrate, bandwidth estimation results and latency. It is also important to keep an eye on the video quality, as it can degrade way before you saturate the CPU, RAM and/or bandwidth of the server.
On the client side, we ended up measuring the following:
- Rate of success and failures (frozen video, or no video)
- Sender and receiver bitrates
- Latency
- Video quality (more on that in the next section)
Measuring different metrics on the server side can be as easy as pooling the getStats API yourself or integrating a solution like callstats.io. In our case, we measured:
- CPU footprint,
- RAM footprint,
- Ingress and egress bandwidth in and out,
- number of streams,
- along with a few of other less relevant metrics.
The metrics above were not published in the Scientific article because of space limitation, but should be released in a subsequent Research Report.
All of these metrics are simple to produce and measure with the exception of video quality. What is an objective measure of video quality? Several proxies for video quality exist such as Google rendering time, received frames, bandwidth usage, but none of these gave an accurate measure.
Video quality metric
Ideally a video quality metric would be visually obvious when impairments are present. This would allow one to measure the relative benefits of resilient techniques, such as like Scalable Video Coding (SVC), where conceptually the output video has a looser correlation with jitter, packet loss, etc. than other encoding methods. See the below video from Agora for a good example of a visual comparison:
https://www.youtube.com/watch?v=M71uov3OMfk
After doing some quick research on a way to automate this kind of visual quality measurement, we realized that nobody had developed a method to assess the video quality as well as a human would in the absence of reference media with a real-time stream. So, we went on to develop our own metric leveraging Machine Learning with neural networks. This allowed for real-time, on-the-fly video quality assessment. As an added benefit, it can be used without recording customer media, which is a sometimes a legal or privacy issue.
The specifics of this mechanism is beyond the scope of this article but you can read more about the video quality algorithm here. The specifics of this AI-based algorithm have been submitted for publication and will be made public as soon as it is accepted.
Show me the money results
We set up the following five open-source WebRTC SFUs, using the latest source code downloaded from their respective public GitHub repositories (except for Kurento/OpenVidu, for which the Docker container was used):
- Jitsi Meet (JVB version 0.1.1077),
- Janus Gateway (version 0.4.3) with its video room plugin,
- Medooze (version 0.32.0) SFU app,
- Kurento (from OpenVidu Docker container, Kurento Media Server version 6.7.0),
- mediasoup (version 2.2.3),
Each was setup in a separate but identical Virtual Machine and with default configuration.
Disclaimers
First a few disclaimers. All teams have seen and commented on the result of their SFUs.
The Kurento Media Server team is aware that their server is currently crashing early and we are working with them to address this. On Kurento/OpenVidu, we tested max 140 streams (since it crashes so early).
In addition, there is a known bug in libnice, which affected both Kurento/OpenVidu and Janus during our initial tests. After a libnice patch was applied as advised by the Janus team, their results significantly improved. However, the re-test with the patch on Kurento/OpenVidu actually proved even worse. Our conclusion was that there are other issues with Kurento. We are in contact with them and working on fixes so, the Kurento/OpenVidu results might improve soon.
The latest version of Jitsi Videobridge (up to the point of this publication) always became unstable at exactly 240 users. The Jitsi team is aware of that and working on the problem. They have however pointed out that their general advice is to rely on horizontal scaling with a larger number of smaller instances as described here. Note that a previous version (as two months ago) did not have these stability issues but did not perform as well (see more on this in the next section). We chose to keep version 0.1.1077 as it included made simulcast much better and improved the results significantly (up to 240 participants, that is).
Also note nearly all of these products have had version releases since testing. Some have made improvements since the test results shown here.
Measurements
As an reference point, we chose one of the usual video test sequences, and computed its video quality score using several video quality assessment metrics:
- SSIM – a common metric that compares the difference between a distorted image and its original
- VMAF – an aggregate measure of a few metrics used and developed by Netflix
- NARVAL – our algorithm which does not require a reference
Note the relationship between quality score and bitrate is not linear. If you slowly decrease the bandwidth from the reference value (1.7Mbps) the quality score only decreases slightly until it hits a low bitrate threshold and then decreases more drastically. To lose 10% of the perceived video quality, you need to reduce the bandwidth to 250Kbps according to WMAF, or even 150k according to SSIM, and 100k according to NARVAL.
Tests on the SFUs showed the same pattern. Image 2 gives the bitrate as a function of the number of participants for each SFU. One can see here that WebRTC’s congestion control algorithms kick in early (at around 250 participants) to maintain bitrate. However, Image 3 shows that the latency increases more linearly. Despite decreasing bandwidth and increasing latency, the video quality metric shown in Image 4 only reports quality degradation much later around when the bandwidth goes below 200k. That shows again that bit rate and latency are not good proxies for Video Quality.
SFU improvements during testing
Beyond the results themselves presented above, what is interesting is to see the progress in the results triggered by this study. Just getting visibility has allowed the respective teams to address the most egregious problems.
Then you can also observe that Janus was very quickly limited. They had identified this bottleneck in an external library, and a possible solution, but had never really assessed the true impact. One can clearly see the difference between the graphs in this section (first runs), and the graphs in the previous section (latest results), were Janus seems to perform the best.
Bitrate as a function of the load.
Before (left) and after (right) application of patches to Janus and Jitsi. We also added mediasoup results (in green). Medooze and Kurento/OpenVidu results are the same in both plots as no better results could be generated the second time around.
RTT, or latency, as a function of the load (logarithmic scale).
Before (left) and after (right) application of patches to Janus and Jitsi. We also added mediasoup results (in green). Medooze and Kurento/OpenVidu results come from the same dataset.
Finally, one reviewer of our original article pointed to the fact that Medooze by Sergio Garcia Murillo’s, a CoSMo employee, ended up on top of our study, hinting to a possible bias caused by a conflict of interest. We went to great efforts to conduct all of our tests transparently without bias. I think it is refreshing to see that in the latest results several SFUs end up being on par or better than Medooze, removing the final worry some might have. It was good news for the Medooze team too – now they know what they have to work on (like improvements made in Medooze 0.46) and they have a tool to measure their progress.
Conclusion
We hope we have shown that unbiased comparative testing of SFUs is now relatively easy to achieve thanks to KITE and a few other tools recently developed by CoSMo in collaboration with the authors of the WebRTC ecosystem. We will continue working with the different open Source WebRTC SFUs vendors to help them improve their software. We are planning to make as much as possible of the code used to generate those results public, and in any case, to provide access for public researchers to the tool, in a non-profit way. Eventually we would like to host those results as a “live” web page, where new results would be made available as new versions of the software are made available.
See the full paper and summary slides presented this week at the IIT Real-Time Communications Conference.
We will in the following months provide results on different use cases, starting with streaming. If you really want to know how Janus SOLEIL, Medooze’s Millicast, Wowza, Ant Media and others perform (or crash), in a streaming environment, without marketing, without bias, stay tuned.
{“author”: “Alex Gouaillard“}
Paul Gregoire says
Nice post with surprisingly low numbers of connections; Red5 Pro does at least double on regular hardware and on beefy servers we’re at almost 10x. Also you mention Wowza and Ant Media among others, without bias or marketing; not buying it, just because of the simple fact that you left us out and you know who we are.
Alex [email protected] says
Dear paul,
Thank you for your comment.
As explained in the “what use case” section, the results presented here are for the Video Conferencing use case and not the streaming use case which will be presented separately at live streaming west in a month or so.
For a video use case, 500 streams is a lot, and none of the previous study went that far. For streaming, it is quite a few. The main reason being that in the video conferencing use case the number of streams going through the Server is quadratic to the number of incoming streams. Everyone received everyone else stream from the server and send one up. The number of streams served by the server is n*(n-1). In streaming, if you have 500 viewers, you send 500 streams, in Video Conference, if you had 500 individual in a single room, you would have 250,000 streams. In their previous post, jitsi showed that they can saturate a 1,000 streams server with only around 30 individuals (30*30 gets you close to 1,000). In our study, we limit the configurations to room of 7 individual to be only quadratic by segments: the formula giving you the number of streams in the server based on audience is then:
– number of full rooms: n/7
– remaining rooms: n%7 (always smaller than 7)
– each full room of 7 generates 7*6 = 42 streams
– total load on server: n/7*42+n%7
simplified: 6n.
So in our case we have almost 3,000 streams on the server with 500 viewer in video conferencing mode with rooms of 7.
As we indicated in our previous contact with your team, we would be happy to add red5 to the result pool of the streaming use case.
We usually requests that the team set themselves their server, to avoid mistakes we could do and bias to your results. We share results with the teams, and provide feedback and bug fix. Most of the team that have participate in this study got a better product out of it.
Let me know what we can do to involve you.
Paul Gregoire says
Alex, I appreciate the response and I will concede on one detail which is conference (many-to-many) vs streaming (one-to-many). There is certainly a difference in overhead between the separation of pubs/subs and conference participants.
Gustavo Garcia says
Thx for the comparison @agouaillard, nice job!
With the criteria being used wouldn’t the simplest SFU (no SVC, no BWE, RTCP, NACK, VAD…) win because of the lower overhead of the processing being done?
Could that explain why the most featured (jitsi?) have worse results?
Alex [email protected] says
Thanks gustavo.
The main goal of this study was to show that *comparative studies* are possible. A single test bed, all SFU run under the same condition, with the same use case, and so on and so forth. The goal was not to rank the SFU per say since the use case, and the question asked are by nature arbitrary. Same, Wether jitsi is the most feature complete or not really depends on what you want to do with it. We wanted to provide an environment where people could ask their own questions and get answers, by themselves. Since KITE separates the test infrastructure, the grid management, the tests and the reporting, we can now reuse all the setting of this study with a different test to ask a different question. Progress.
I don’t think that jitsi had worse results per say. The variations are big, and the differences are within the tolerance margin, for the metric we reported. Note that No server side metrics like RAM and CPU were reported in the article.
Putting several months of work in a 8 pages article requires compromises. We had to remove from the section VI, future work, our mention of simulcast and SVC. For this test, we switched OFF simulcast for all the SFUs that supported it, because we could not find a way to assess properly the video quality with a mix of simulcast and non simulcast streams. Also our testing environment was not stressing the feature that allow to evaluate simulcast or SVC well, opening the door for bias and misinterpretation.
We are working with callstats.io, in the scope of the VERIFY project, in fully instrumenting the network layer, which will allow to, well, verify that the values reported by GetStats() are correct for callstats, but also assess the ramping time, Bandwidth estimations, Simulcast and SVC layer switching mechanisms, and so on and so forth automatically.
At that stage, we will be in ideal situation to test simulcast and SVC implementation choices in different media servers. Think about the recent jitsi vs Zoom comparison made by the jitsi team, but fully automated and between all the possible media servers / infrastructure. We’re almost done, and are trying really hard to generate results in time for IETF in Bangkok, nov. 2 to 10.
Gustavo Garcia says
I fully understand the test you did is really hard to do and really appreciate that effort and I enjoyed reading this post.
But at the same time I think it is important to be very clear about what is measured and what is not. A SFU without SVC, BWE and retransmissions is a very bad SFU for videoconferencing in real network conditions and in this specific (and useful) study could get the highest score. Am I right?
I think if we want to fully assess SFUs for videoconferencing use cases we should include these two aspects:
1/ Environment: Test should be conveyed in non ideal network conditions. The audio quality is not the same with all the SFUs under packet loss and the framerate is not the same with all the SFUs under constrained bandwidth. Under ideal network conditions the one with just the best sockets&threads implementation could win.
2/ Metrics: We should include audio metrics (buffer sizes, PLC occurrences…) and video metrics (buffering, framerate, lipsync). Maybe callstats.io or rtcstats could help here like you say, although I don’t think it is easy to generate scores aggregating all those aspects (I tried in the past). Maybe testrtc guys can help with it.
Regarding “Wether jitsi is the most feature complete or not really depends on what you want to do with it. ” I would at least mention explicitly in the report that there are very relevant features that are present in some SFUs and not others. For example SFUs don’t have active speaker detection and can be much more critical than supporting 100 users more or less in a box.
Looking forward for all that future work you plan to do. Great job!
Sam says
nice job!
just wondering why didn’t include licode in the test.
Alex [email protected] says
We designed the study as follow:
– we wanted more than 3 SFUs to make the comparison valuable,
– we wanted very active SFUs that would be able to fix the bugs if we founded any,
– we wanted SFUs with peer-reviewed published benchmarks and evaluations,
– we wanted SFUs with their own testing tools in place so we can compare individual SFUs results.
Meedoze, Jitsi, Kurento and Janus all met those targets. In terms of own testing tools, meedoze was already using KITE, Jitsi had jitsi-hammer and a great benchmarking post, kurento had the Kurento-Testing-Framework, which had evolved into the ElasTest European union project with a lot of published results, and Janus had Jattack.
When presenting early results at CommConUK in april 2018, the main maintainer of media soup volunteered to participate, so we added there results for media soup in a second version of both the paper and the blog post, before the camera-ready versions. Other teams we reached out to either refused to participate, or did not reply in time for the publication. We have not reached out to the licode team.
While we have not added licode at this stage, we want to keep the results updated on the cosmo website, and if the Licode team wants to participate, or any team with a media server that supports webrtc for that matter, they are more than welcome, just contact us through cosmosoftware.io
liming says
Dear Alex,
Thanks for your contribution of WebRTC SFU Load Testing.
” After a libnice patch was applied as advised by the Janus team, their results significantly improved”.
but, in this article,what is the libnice patch? and how to improve the results?
Dr Alex says
Sorry for the late reply. There was a libnice patch, provided by RingCentral engineers, to address the fact that there was a single lock for all incoming packets, artificially creating a bottle neck. AFAIK this has been merged both in libnice upstream and in Janus as we speak. The results shown in our paper already integrated this optimisation.
rami r says
So what happend at end with openvidu test it’s still dead on 150 streams?
Dr Alex says
I m not sure what you mean by “at end”. At the time of testing, Kurento Media Server crashed at 150 streams and that was pretty much it. At teh time of writing this answer, the team would have addressed most of the problem and wrote a dedicated blog post. We have not run the test with their new version. Here is the link to the blog post for further reading:
https://medium.com/@openvidu/openvidu-load-testing-a-systematic-study-of-openvidu-platform-performance-b1aa3c475ba9
Rami r says
Thank you, becouse we use kurento for our streaming app, very small test server 2vcpu, 2gb ram $15 on digitalocean. there 3 rtp 500kbit streams of video h264(total 1.5mbit). it’s eat 54% of this computer. it’s ok?
I plan to have source of 300 channels each 500kbit, and on other side will be like 1000 web client recipients (webrtc) that recieve one of the channel each.(like TV)
Tasks: 107 total, 1 running, 64 sleeping, 0 stopped, 0 zombie
%Cpu0 : 42.5 us, 0.0 sy, 0.0 ni, 57.2 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu1 : 17.1 us, 0.3 sy, 0.0 ni, 82.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 2041272 total, 162712 free, 810852 used, 1067708 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 1061232 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10724 kurento 20 0 1856628 385952 18716 S 54.6 18.9 5649:36 kurento-media-s
29766 kurento 20 0 3166748 325372 28380 S 0.3 15.9 15:45.15 java
30758 root 20 0 44544 3988 3368 R 0.3 0.2 0:00.34 top
1 root 20 0 159992 8676 6136 S 0.0 0.4 0:30.35 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.05 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
roy says
Am I reading it wrong or the load test numbers of Jitsi and Janus are swapped on the original paper?
Jitsi Janus Medooze OpenVidu
Number of rooms 70 35 70 20
Number of client VMs 490 245 490 140
The paper says Jitsi crashed at 245 users.
Alex Gouaillard says
This is correct, the revision of jitsi we tested was crashing at exactly 245 users. It was not the use case for which jitsi had been designed, and it was alter fixed anyway. Thanks for catching this.
Filippos says
Very interesting! Is the code somewhere available for reproducing the results ?
Alex Gouaillard says
Hi filippos.
Thank you for your nice comment.
It’s an old study by now and all the SFUs have made substantial progress and substantial non-backward compatible changes in their UI, preventing the original code to work, I presume.
However, if you are still interested, most of the original code can be found in cosmo software GitHub: https://github.com/cosmosoftware
just look for the repo names which are prefixed by ‘kite’. There is actually much more than what is needed for this study, and you could find interesting projects there.
For completion, note the the openvidu team separately reproduce the results with a later version of their SFU. You can find their more recent blog post, with link to their code, there:
https://medium.com/@openvidu/openvidu-load-testing-a-systematic-study-of-openvidu-platform-performance-b1aa3c475ba9
HTH
Andrey Novikov says
I can’t see any graphs or numbers for CPU and footprint.
Is there any benchmarks that can show how many CPU and RAM is required for various SFUs on different number of participants?
(Full paper and summary slides from IIT Real-Time Communications Conference are not available anymore)
rajneesh says
Very nice work. Is NARVAL code open sourced ? where can i find it ?
Laurent Denoue says
would be fun to reproduce the tests with Pion https://github.com/pion/webrtc for example using this audio only SFU https://github.com/MixinNetwork/kraken