6 comments on “Breaking Point: WebRTC SFU Load Testing (Alex Gouaillard)

  1. Nice post with surprisingly low numbers of connections; Red5 Pro does at least double on regular hardware and on beefy servers we’re at almost 10x. Also you mention Wowza and Ant Media among others, without bias or marketing; not buying it, just because of the simple fact that you left us out and you know who we are.

  2. Dear paul,

    Thank you for your comment.

    As explained in the “what use case” section, the results presented here are for the Video Conferencing use case and not the streaming use case which will be presented separately at live streaming west in a month or so.

    For a video use case, 500 streams is a lot, and none of the previous study went that far. For streaming, it is quite a few. The main reason being that in the video conferencing use case the number of streams going through the Server is quadratic to the number of incoming streams. Everyone received everyone else stream from the server and send one up. The number of streams served by the server is n*(n-1). In streaming, if you have 500 viewers, you send 500 streams, in Video Conference, if you had 500 individual in a single room, you would have 250,000 streams. In their previous post, jitsi showed that they can saturate a 1,000 streams server with only around 30 individuals (30*30 gets you close to 1,000). In our study, we limit the configurations to room of 7 individual to be only quadratic by segments: the formula giving you the number of streams in the server based on audience is then:
    – number of full rooms: n/7
    – remaining rooms: n%7 (always smaller than 7)
    – each full room of 7 generates 7*6 = 42 streams
    – total load on server: n/7*42+n%7
    simplified: 6n.
    So in our case we have almost 3,000 streams on the server with 500 viewer in video conferencing mode with rooms of 7.

    As we indicated in our previous contact with your team, we would be happy to add red5 to the result pool of the streaming use case.

    We usually requests that the team set themselves their server, to avoid mistakes we could do and bias to your results. We share results with the teams, and provide feedback and bug fix. Most of the team that have participate in this study got a better product out of it.

    Let me know what we can do to involve you.

    • Alex, I appreciate the response and I will concede on one detail which is conference (many-to-many) vs streaming (one-to-many). There is certainly a difference in overhead between the separation of pubs/subs and conference participants.

  3. Thx for the comparison @agouaillard, nice job!

    With the criteria being used wouldn’t the simplest SFU (no SVC, no BWE, RTCP, NACK, VAD…) win because of the lower overhead of the processing being done?

    Could that explain why the most featured (jitsi?) have worse results?

    • Thanks gustavo.

      The main goal of this study was to show that *comparative studies* are possible. A single test bed, all SFU run under the same condition, with the same use case, and so on and so forth. The goal was not to rank the SFU per say since the use case, and the question asked are by nature arbitrary. Same, Wether jitsi is the most feature complete or not really depends on what you want to do with it. We wanted to provide an environment where people could ask their own questions and get answers, by themselves. Since KITE separates the test infrastructure, the grid management, the tests and the reporting, we can now reuse all the setting of this study with a different test to ask a different question. Progress.

      I don’t think that jitsi had worse results per say. The variations are big, and the differences are within the tolerance margin, for the metric we reported. Note that No server side metrics like RAM and CPU were reported in the article.

      Putting several months of work in a 8 pages article requires compromises. We had to remove from the section VI, future work, our mention of simulcast and SVC. For this test, we switched OFF simulcast for all the SFUs that supported it, because we could not find a way to assess properly the video quality with a mix of simulcast and non simulcast streams. Also our testing environment was not stressing the feature that allow to evaluate simulcast or SVC well, opening the door for bias and misinterpretation.

      We are working with callstats.io, in the scope of the VERIFY project, in fully instrumenting the network layer, which will allow to, well, verify that the values reported by GetStats() are correct for callstats, but also assess the ramping time, Bandwidth estimations, Simulcast and SVC layer switching mechanisms, and so on and so forth automatically.

      At that stage, we will be in ideal situation to test simulcast and SVC implementation choices in different media servers. Think about the recent jitsi vs Zoom comparison made by the jitsi team, but fully automated and between all the possible media servers / infrastructure. We’re almost done, and are trying really hard to generate results in time for IETF in Bangkok, nov. 2 to 10.

      • I fully understand the test you did is really hard to do and really appreciate that effort and I enjoyed reading this post.

        But at the same time I think it is important to be very clear about what is measured and what is not. A SFU without SVC, BWE and retransmissions is a very bad SFU for videoconferencing in real network conditions and in this specific (and useful) study could get the highest score. Am I right?

        I think if we want to fully assess SFUs for videoconferencing use cases we should include these two aspects:
        1/ Environment: Test should be conveyed in non ideal network conditions. The audio quality is not the same with all the SFUs under packet loss and the framerate is not the same with all the SFUs under constrained bandwidth. Under ideal network conditions the one with just the best sockets&threads implementation could win.
        2/ Metrics: We should include audio metrics (buffer sizes, PLC occurrences…) and video metrics (buffering, framerate, lipsync). Maybe callstats.io or rtcstats could help here like you say, although I don’t think it is easy to generate scores aggregating all those aspects (I tried in the past). Maybe testrtc guys can help with it.

        Regarding “Wether jitsi is the most feature complete or not really depends on what you want to do with it. ” I would at least mention explicitly in the report that there are very relevant features that are present in some SFUs and not others. For example SFUs don’t have active speaker detection and can be much more critical than supporting 100 users more or less in a box.

        Looking forward for all that future work you plan to do. Great job!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.