local Jitsi recording hack with getDisplayMedia audio capture and mediaRecorder

I wanted to add local recording to my own Jitsi Meet instance. The feature wasn’t built in the way I wanted, so I set out on a hack to build something simple. That lead me down the road to  discovering that:

  1. getDisplayMedia for screen capture has many quirks,
  2. mediaRecorder for media recording has some of its own unexpected limitations, and
  3. Adding your own HTML/JavaScript to Jitsi Meet is pretty simple

Read on for plenty of details and some reference code. My result is located in this repo.


Want to keep up on our latest posts? Please click here to subscribe to our mailing list if you have not already. We only email post updates. You can also follow us on twitter at @webrtcHacks for blog updates.

A couple of weeks ago, the Chrome team announced an interesting Intent to Experiment on the blink-dev list about an API to do some custom processing on top of WebRTC. The intent comes with an explainer document written by Harald Alvestrand which shows the basic API usage. As I mentioned in my last post, this is the sort of thing that maybe able to help add End-to-End Encryption (e2ee) in middlebox scenarios to WebRTC.

I had been watching the implementation progress with quite some interest when former webrtcHacks guest author Emil Ivov of reached out to discuss collaborating on exploring what this API is capable of. Specifically, we wanted to see if WebRTC Insertable Streams could solve the problem of end-to-end encryption for middlebox devices outside of the user’s control like Selective Forwarding Units (SFUs) used for media routing.

The good news it looks like it can! Read below for details.

Before we get into the project, we should first recap how media encryption works with media server devices like SFU’s.

Media Encryption in WebRTC

WebRTC mandates encryption. It uses DTLS-SRTP for encrypting the media. DTLS-SRTP works by using a DTLS handshake to derive keys for encrypting the payload of the RTP packets. It is authenticated by comparing the a=fingerprint lines in the SDP that are exchanged via the signaling server with the fingerprints of the self-signed certificates used in the handshake. This can be called end-to-end encryption since the negotiated keys do not leave the local device and the browser does not have access to them. However, without authentication it is still vulnerable to man-in-the-middle attacks.

See our post about the mandatory use of DTLS for more background information on encryption and how WebRTC landed where it is today.

SFU Challenges

The predominant architecture for multiparty is a Selective Forwarding Unit (SFU). SFUs are basically packet routers that forward a single or small set of streams from one user to many other users. The basics are explained in this post.

In terms of encryption, DTLS-SRTP negotiation happens between each peer endpoint and the SFU. This means that the SFU has access to the unencrypted payload and could listen in. This is necessary for features like server-side recording. On the downside, it means you need to trust the entity running the SFU and/or the client code to keep that stream private. Zero trust is always best for privacy.

Unlike a more traditional VoIP Multipoint Control Unit (MCU) which decodes and mixes media, a SFU only routes packets. It does not care much about the content (apart from a number of bytes in the header and whether a frame is a keyframe). So theoretically the SFU should not need to decode and decrypt anything. SFU developers have been quite aware of that problem since the early days of WebRTC. Similarly, Google’s library has included a “frame encryption” approach for a while which was likely added for Google Duo but doesn’t exist in the browser. However, the “right” API to solve this problem universally only happened now with WebRTC Insertable Streams.

Make it Work

Our initial game plan looked like the following:

  1. Make it work

End-to-End Encryption Sample

Fortunately making it work was a bit easier since Harald Alvestrand had been working on a sample which simplified our job considerably. The approach taken in the sample is a very nice demonstration:

  1. opening two connections,
  2. applying the (intentionally weak, xor-ing the content with the key) encryption on both but
  3. only decryption on one of them.

You can test the sample on here.  Make sure you run the latest Chrome Canary (84.0.4112.0 or later) and that the experimental Web Platform Features flag is on.

The API is quite easy to use. A simple logging transform function looks like this:

The transform function is then called for every video frame. This includes an encoded frame object (named chunk) and a controller object. The controller object provides a way to pass the modified frame to the next step. In our case this is defined by the pipeTo  call above which is the packetizer.

Iterating improvements

With a fully working sample (video-only at first because audio was not yet implemented), we iterated quickly on some improvements such as key changes and not encrypting the frame header. The latter turned out to be very interesting visually. Initially, upon receiving the encrypted frame, the decoder of the virtual “middlebox” would just throw an error and the picture would be frozen. Exploiting some properties of the VP8 codec and not encrypting the first couple of bytes now tricks the decoder into thinking that frame is valid VP8. Which looks … interesting:

Give the sample a try yourself here.

Insertable Streams iterates on frames, not packets

The Insertable Streams API operates between the encoder/decoder and the packetizer that splits the frames into RTP packets. While it is not useful for

inserting your own encoder with WebAssembly ...  Continue reading

WebRTC has made getting and sending real time video streams (mostly) easy. The next step is doing something with them, and machine learning lets us have some fun with those streams. Last month I showed how to run Computer Vision (CV) locally in the browser. As I mentioned there, local is nice, but sometimes more performance is needed so you need to run your Machine Learning inference on a remote server. In this post I’ll review how to run OpenCV models server-side with hardware acceleration on Intel chipsets using Intel’s open source Open WebRTC Toolkit (OWT).

Note: Intel sponsored this post. I have been wanting to play around with the OWT server since they demoed some of its CV capabilities at Kranky Geek and this gave me a chance to work with their development team to explore its capabilities. Below I share some background on OWT, how to install it  locally for quick testing, and show some of the models.


Want to keep up on our latest posts? Please click here to subscribe to our mailing list if you have not already. We only email post updates. You can also follow us on twitter at @webrtcHacks for blog updates.

Don’t touch your face! To prevent the spread of disease, health bodies recommend not touching your face with unwashed hands. This is easier said than done if you are sitting in front of a computer for hours.  I wondered, is this a problem that can be solved with a browser?

We have a number of computer vision + WebRTC experiments here. Experimenting with running computer vision locally in the browser using TensorFlow.js has been on my bucket list and this seemed like a good opportunity. A quick search revealed somebody already thought of this 2 week ago. That site used a model that requires some user training – which is interesting but can make it flaky. It also wasn’t open source for others to expand on, so I did some social distancing via coding isolation over the weekend to see what was possible.

Check it out at and keep reading below for how it works. All the code is available at I share some highlights and an alternative approach here.


Want to keep up on our latest posts? Please click here to subscribe to our mailing list if you have not already. We only email post updates. You can also follow us on twitter at @webrtcHacks for blog updates.

When most people think of WebRTC they think of video communications. Similarly, home surveillance is usually associated with video streaming. That’s why I was surprised to hear about a home security project that leverages WebRTC not for video streaming, but for the DataChannel. WebRTC’s DataChannel might not demo as well as a video call, but as you will see, it is a very convenient way to setup peer-to-peer information transfer. ...  Continue reading

WebRTC has a new browser – kind of. Yesterday Microsoft’s  “new” Edge browser based on Chromium – commonly referred to Edgium – went GA. This certainly will make life easier for WebRTC developers since the previous Edge had many differences from other implementations. The big question is how different is Edgium from Chrome for WebRTC usage?

The short answer is there is no real difference, but you can read below for background details on the tests I ran. If you’re new around WebRTC the rundown may give you some ideas for testing your own product.


Edge is a big deal because Windows 10 is a big deal – according to StatCounter, Windows has a 34% OS market share overall and 78% of the Desktop browser market. Windows 10 is 65% of all Windows deployments, so Edge is the default install browser that could potentially be used by more than 22% of pageviews and more than 50% of all Desktop pageviews. Now potential use is different from actual use and the original Edge just didn’t catch on in a big way, which is why Microsoft announced it would switch to using Chromium as its web engine.

Unlike the old Edge which was limited to Windows 10 and Android, the new Chromium based Edge also available on OS X and iOS Android. Edgium follows Chromium’s Canary, Dev, Beta, and GA build process and the pre-GA releases have been available for a while. Microsoft started updating “legacy” Edge users to the new version automatically yesterday.

WebRTC Support

At last year’s Kranky Geek, Greg Whitworth, Edge Product Manager, basically said the initial goal for WebRTC was to keep parity with Chrome with some added support for perfect negotiation, screen capture improvements (lots of people share their Office documents), and bug fixes.

You should check out that video here (Greg’s section is during the first 5 minutes or so):

He also talks a bit about roadmap and what’ll be next.

Testing to see What’s Different


I ran through a battery of tests, similar to what I did for Safari, to see what’s different about Edgium and Chrome. Most of my testing was done earlier this week on my OSX using Edge Beta 80 and Edge Canary 81. I also ran a smaller number of checks on Edge 10 on Windows 10 just to check for consistency.

Unless otherwise noted, I used the official WebRTC samples for various tests. is s simple way to check for WebRTC compatibility. Just load it up in Chrome and Edge and click run.

Results: nothing to see here – Chrome and Edge were both identical.

getUserMedia visualizations

There was no doubt the getUserMedia API would be supported, but I wondered if Microsoft choose to change the display or permissions behavior.

Whenever a camera or microphone source is active, you get a red notification dot to the left:

Chrome’s notification is a grey circle to the right that flashes briefly before going solid.

Within the URL bar, here are the Camera and Microphone in-use symbols:

The microphone icon appears when both the camera and microphone are in use. The Camera icon only appears if only the camera is being used.

Results: same concept as Chrome with slight visual differences

Media permissions UI

The permissions UI also looks the same:

Controlling individual site access to your camera and microphone is also the same – you can jump there quick by going to edge://settings/content/camera or edge://settings/content/microphone

I also checked to make sure there wasn’t any differences with secure origins for accessing user media – i.e. you must use HTTPS or a localhost if you want to use getUserMedia . This was exactly the same.

Results: Edge is same as Chrome

Screenshare / getDisplayMedia

Microsoft mentioned screen share improvements, I also wondering if that user flow would look any different?

The screen / app / tab picker is the same as Chrome:

Like the media in use indicator, the tab notification icon is on the left instead of the right with Chrome. Also like Chrome, if you share a tab it gives you a tab sharing indicator. Unlike Chrome, it does not blink for a few seconds when first initiated.

And if you are sharing a tab you also get the same tab sharing notification warning:

The window and application sharing notice is also the same as Chrome:

Screenshare Performance improvements

Greg also mentioned ...  Continue reading

When running WebRTC at scale, you end up hitting issues and frequent regressions. Being able to quickly identify what exactly broke is key to either preventing a regression from landing in Chrome Stable or adapting your own code to avoid the problem. Chrome’s tool makes this process much easier than you would suspect. Arne from Whereby gives you an example of how he used this to workaround an issue that came up recently.
{“editor”, “Philipp Hancke“}

In this post I am going to provide a blow-by-blow account of how a change to Chrome triggered a bug in Whereby and how we went about determining exactly what that change was.

Building an application on the foundation of WebRTC comes with some benefits and some drawbacks. The big benefit is obviously that you’re able to leverage the tremendous amount of awesome work that has gone into making WebRTC what it is today. However, it does also have some drawbacks. The correct operation of your application will depend on the correct operation of the supporting WebRTC technologies (like the browser) as well as the correct interaction between your application code and those technologies. When that interaction develops faults, the nature of those faults can either be blindingly obvious, very obscure, or somewhere in between. This is a story of a situation where it was neither obvious nor somewhere in between.

The Problem

That should never happen

The story starts when our support specialist Ashley reaches out to discuss the problems of a customer who is experiencing something weird. They reported that recently they had been experiencing problems transmitting video out past their corporate firewall. Audio was fine and video passed the firewall into their corporate network just fine. Participants outside the firewall, however, could only hear the participants on the inside the firewall, not see them. Even stranger, if a participant inside the firewall shares their screen, that will transmit out past the firewall. Outside participants can then observe the self-view of the inside participants via screen sharing, but the main video feed remains blank.

Knowing the architecture of WebRTC under the hood, this is an odd failure mode. If a firewall is letting audio packets past its gates, the odds that it would stop the accompanying video packets are quite unlikely. At the very least it would require some pretty invasive (D)TLS man-in-the-middle setup in order to know the difference between the two at all. So we started digging.

The first relevant clue is that this customer is using one of our paid-for SFU meeting rooms, meaning the video feed from the individual participants is sent as several different spatial resolution feeds simultaneously (aka “simulcast”). Our alumni fippo is offering some commentary from the sidelines as we are investigating this, and early on suggested simulcast as a triggering condition. Quite correctly, it would turn out.

The twist

The second relevant clue is that the affected connections are transiting via our TURN network as well. Presumably this is because the firewalls in question are quite restrictive in what outbound ports they allow traffic towards. As we are hard at work trying to understand the exact network conditions the customer is operating under, fellow engineer Hans Christian demonstrates that he can replicate the behavior simply by forcing TURN to be used towards our SFUs. There are some other subtleties too, like the order in which the different participants join the room, but once it is possible to reproduce the problem these are easier to map out. Still, no flashes of insights based on these two clues. TURN is designed to pass traffic along indiscriminately, and with the amount of TURN traffic we process (a lot), the outcry would have been much wider if this was not operating as it should.

Narrowing it down

Being able to reproduce, we quite quickly are able to determine that:

  1. it doesn’t matter what equipment is used by the receiving participant, and
  2. that the sending participant whose video goes missing needs to run Whereby on Chrome 73 or later.

Our attention turns to the release notes and change log for Chrome 73, but still nothing stands out. Time to look at packet traces.

Our SFU is capable of dumping RTP packets in the clear in development mode to facilitate this kind of debugging. However, in this case we want to inspect the packets on both sides of the TURN server. Even if the SFU is able to provide a log of the packets it sees after stripping DTLS, these can not immediately be compared to the packet traces towards the TURN server. Here, the somewhat obscure Chrome Canary command line flag --disable-webrtc-encryption  comes in handy. After adding similar functionality to our SFU code base (again in development mode, of course), we are able to obtain clear text packet traces from both the inbound and outbound leg of Chrome 73’s TURN server session. It is then quickly done to eliminate this as the source of error; the TURN server is transmitting data as faithfully as could be wished for.

With the clear text packet traces in hand though, it is time to start delving into the actual contents of the RTP stream. It shortly becomes apparent that our SFU is not receiving the keyframes it is expecting from Chrome 73. This is very odd, as we are sending quite a lot of PLIs (“Picture Loss Indicators”) in Chrome’s direction, asking for some:

At this point we feel we have something tangible to report to the Chrome team, and do so. However, the failure still seems pretty tied to our particular setup, so we continue investigations on our end as well.

Debugging Chrome with Bisect

Divide and conquer

Being able to reproduce a customer’s problem in a controlled environment makes all the difference in the world when investigating a tough issue. For one thing, it makes it easier to vary parameters to see exactly what contributes to triggering the problem. Additionally, it also increases the amount of parameters you can vary a lot. The precise version of the browser used to run your application is one thing you can vary in your debug environment that it is hard to get the customer to do for you. If you can try any browser version you want, this enables a powerful technique called “bisecting”.

When troubleshooting software, you often know you have one version of code that exhibits the problem you are trying to avoid and one that does not. If you want to determine which change to the software code base introduced the problem, one way to do that is to just replay all changes that were made to it on its way from the first version to the other, testing the software at each step to see if the problem has manifested yet. While this works,  it can be a very tedious and time-consuming process if you only become aware of the problem after the software code base has significantly changed, since the number of steps to retrace can become excessive.

When bisecting, you assume that the problem was introduced at one particular point, and then remains for all subsequent versions. This allows you to optimize this process by iteratively selecting a version midway between your know “good” version and the known “bad” version, testing it to see if it exhibits the problem or not. This midway version then either becomes your new good or bad version, and you start over by selecting a new midway point. Eventually the good and bad versions are close enough together that you can reason about all the changes between them and hopefully spot the change that introduced the problem. (Side note: while the terminology “good”, “bad” and “problem” implies qualitative judgement, this process can be used to pinpoint any change in behavior, harmful or benign. In our case, at this point we didn’t know if we were looking for a problem in Chrome or just a change that triggered problematic behavior in our application.)

Bisection was made popular by git, which introduced a dedicated command called “git bisect” to automate this process. The difference between your good and bad versions in git is represented as the path of commits from one to the other. The midway point is picked as the commit closest to the midway point on this path. The process terminates when your most recently found good version is the parent of your most recently found bad version. It is then up to you to examine the changes introduced by the bad commit to determine what the actual problem is. To get the most out of this process, the commit path between your original good and bad versions should be mostly linear, all the commits on the path should represent runnable versions of the software, and the changes introduced by each commit should be small enough that they can be effectively analyzed once the bisection process terminates. (There are ways to cope if these requirements don’t hold all of the time, though.)

Play it again, Sam

This sounds nice in theory, but how do we apply this to Chrome and our application? Our premise is that under Chrome 72, our application doesn’t exhibit the behavior we are trying to eliminate, while under Chrome 73 it does. If our last known good version had been e.g “Chrome 62”, it could have made sense to bisect over Chrome major versions, but in this case we already know that the issue we are investigating was introduced in major version 73. We want to go one level deeper. We need to break the difference difference between Chrome 72 and Chrome 73 into smaller steps, and then try to apply the bisection approach to that list of steps. Fortunately, as part of the Chromium build infrastructure, Google maintains a pretty nifty change set lookup tool called omahaproxy. We use this to look up Chrome versions “72.0.3626.121” and “73.0.3683.90”, which yields “base branch positions” 612437 and 625896, respectively. So there are over 13000 “base branch” steps (which corresponds to (small chunks of) git commits to the main Chrome git repository) between Chrome versions 72 and 73, which is a lot. Compiling and testing these step by step to see if the problem has manifested yet is obviously untenable. However, since bisecting works by halving the search space for each iteration, we should be able to narrow this down to one base branch step in log2(13459) ≈ 14 steps. Still, this is a significant undertaking, compiling Chrome from scratch takes a lot of time, and the first iterations will yield steps far enough apart that the benefit from iterative compilations will be marginal.

A hidden gem

At this point, we’re treated to a very pleasant surprise. It turns out (again, hat tip to the intrepid fippo) that Google actually provides precompiled binaries for all these base branch steps! They even provide a python script to run bisection using these binaries! There’s a good description of how to obtain and use this on the chrome developer pages. Having installed this, and given the base branch steps we found above, we can now search through the full list of steps very easily. By giving the range we want to search, it will present us with different Chrome candidates to test. We’ll tell bisect-builds what arguments we want to run Chrome with, and it will start the candidates automatically for us, one by one. It is then up to us to perform the necessary steps to reproduce our problem, terminate the candidate Chrome and report if the problem was present or not. The next candidate started will be selected based on our report. In practice this process looks like this:

This whole process is surprisingly efficient and on the whole takes around ten minutes.

Once more, with feeling

The bad news is that the identified change is a dependency update, bumping the version of the bundled WebRTC code base. The version bump encompasses 53 different commits to the WebRTC repository, from 74ba99062c to 71b5a7df77 . Looking at the list of commits, we can speculate as to which ones are likely to be involved in our issue (did you notice the entry “2019-01-21 [email protected] Remove simulcast constraints in SimulcastEncoderAdapter” in that list..?). We can also start reading the code diffs to understand how behavior has changed. If we were luckier, the problematic interaction would have been obvious by now, and we could have called it a day.

However, while we are getting nearer to answering “what” has triggered our problem, the “why” is still eluding us. We conclude that we need to instrument the changed code paths to understand better how they interact with our application. Fortunately, the Chrome (or rather, Chromium) source code is publicly available, and Google provides good instructions for building it yourself. Following the steps described in the build instructions, we soon have the Chromium source code and related Google tooling installed on our local machine. At this point we deviate slightly from the official build instructions. Since we want to build a historic version of Chromium rather than the tip of master, we use git checkout 94cbf3f56 to position the tree at the correct revision. We then make sure all dependencies are synced to this Chromium version using gclient sync . From here on out, most of the action is in the subdirectory third_party/webrtc . This is a separate git repository from the main Chromium repo, and after the sync above gclient has positioned us at revision 71b5a7df77  here.

Having identified a likely offending commit, we could now speculatively back this out to see if the problem disappears. However, as long as we have to compile Chromium from scratch anyway, we choose to be thorough and bisect our way through these 53 commits as well. We do this by running a traditional git bisect process over the WebRTC repository change span that was indicated earlier. The fundamentals of this process are similar to the first bisect process we went through, but some of the bits that previously happened behind the curtain are now up to us to perform by hand. At each step, we now have to compile and start our Chromium candidates explicitly:

The first compile of Chromium in a fresh tree takes a significant amount of time (like double-digit number of hours), so we leave this overnight and return to the task the next morning. The subsequent builds take around 5 minutes to prepare at each step, and the rest of the process runs more or less interactively. This is much slower than bisecting with pre-built binaries, but tolerable for our remaining 5-6 steps. Had our steps been further apart, the time for each build would have increased significantly, approaching the time for a full build. This just highlights how efficient that first bisection process was.

At the rainbow’s end

Ultimately, as shown above, this process confirms that is the change that causes our app to misbehave. This is just 124 lines of added code, and while we are still in the dark as to what the actual problem is, at this point it is feasible to instrument all the changed code flows in Chromium to understand how they interact with our application code.

As we do so, it gradually becomes clear what the culprit is. The Chrome change ensures that simulcast layers are always sorted in the right order, as sometimes (apparently) the layers can be given in reverse order (highest bandwidth allocated to the bottom layer, rather than the top layer). This should be a no-op for Whereby – the layers are always ordered from lowest to highest bandwidth anyway. Or so we thought.

Looking at debug printouts reveals that Chrome has ordered our layers in the order [1 (640×360), 0 (320×180), 2 (1280×720)] – so neither ascending or descending order, but some sort of jumbled mess. Digging further, layer 0 was been set with a bandwidth cap of 768kbps, which is higher than the cap given for layer 1 and lower than the one given for layer 2. So, when there’s only bandwidth available for one layer, Chrome will only send video on layer 1, not layer 0 as our SFU expects. At this point we are able to make the connection to the Whereby application code.

As mentioned earlier, we push a significant amount of TURN data. In an effort to control the cost spent on relaying this data, when we determine that ICE has resorted to using a “relay” candidate, we use RTCRtpSender.setParameters to restrict the bandwidth target to, coincidentally, 768kbps. This logic predates our use of simulcast. The unfortunate consequence is that when both SFU, TURN and simulcast are used, we effectively set the target bandwidth for layer 0 to a much higher value than the intended usual 150kbps. As a result, this layer is down-prioritized and left dry. Our SFU would receive a video stream on layer 1, but since it was expecting layer 0 to be the initially active one, it got stuck trying to obtain that first keyframe. This left the user with just audio and no video stream, exactly as reported.

The change to fix this was 7 or 8 lines of code.

Takeaways ...  Continue reading