We have a long tradition here of analyzing major services that use WebRTC. In a sign of WebRTC’s success, that list has been getting much longer and we’re not keeping up. Fortunately one of our favorite authors, Gustavo Garcia Bernardo recently found the time to review the new Microsoft Azure Communications Service, He found some interesting results that we are happy to present here.
Gustovo has a deep career in real time communications and has been intimately involved with WebRTC since its very early days. Make sure to check out some of his past webrtcHacks posts or check out his blog over at rtcbits.com where he provides many unique insights in shorter form.
If you go this far I assume you know what Microsoft Azure is. Whenever a 1.6 trillion dollar company does a product launch it is generally a big deal, and particularly so to those of us who deal with communications API’s on a regular basis. Microsoft has had a long and very unique history with WebRTC, so we were extra curious to see how WebRTC was used as part of this new offer. As you’ll see, this one has some interesting peculiarities too.
Note I did reach out to the Azure Communications Team at Microsoft to give this a brief review for technical accuracy where they could comment. Thanks also to Fippo for his help testing.
{“editor”, “chad hart“}
Some weeks ago Microsoft announced Azure Communications Services (ACS). This new product in their catalog of cloud services provides chat, SMS, PSTN calling and Video communications. It competes in the Communications Platform as a Service (CPaaS) category against dominant players like Vonage, Twilio, Agora and against video API offerings from Zoom or Amazon.
The Microsoft offering is not very different from its competitors. This post will focus on the voice and video parts. These are based on WebRTC. As you can see in the details shown later it reuses a big chunk of existing Microsoft infrastructure (from Skype and/or Microsoft Teams).
At a high level, there are 2 API’s:
- Management API – this includes server-side SDKs for the creation of users and access tokens
- Client SDKs – available for web, Android and iOS, these connect endpoints to the communication servers to send and receive audio/video/screensharing as well as media from the PSTN and Microsoft Teams.
API and features provided
There are two basic primitives in the Client API:
- Calls and
- Rooms.
With the Call interface you can call any other user connected to the system. With the Rooms primitives you can join a generic room. The support for identity and calling is stronger than in other platforms probably because of the infrastructure being reused and the feature providing integration with the Teams platform. It is interesting the lack of access control provided for rooms where every access token apparently has permissions to join every room if the room Id is known.
In the client side the basic call control operations (mute/unmute, hold/unhold, screen sharing) are present in addition to some audio & video device management APIs to simplify configuration of the system.
WebRTC compliance
As a summary, let us compare where what Azure uses in this case is different from the WebRTC standard (either the W3C or the various IETF drafts):
Feature | WebRTC/RTCWeb Specifications | Azure |
SDES | MUST NOT offer SDES | uses SDES |
ICE | RFC 5245 | RFC 5245, full ICE instead of ice-lite |
Audio codec | Opus or G.711 | G.722, which is optional and G.711 |
Video codec | VP8 or H.264 | H.264, VP8 supported on P2P |
Multiple streams | Unified Plan | Plan B |
Simulcast | standard, implementations vary | not supported |
Client SDKs
This Client SDK is provided for Web, iOS and Android. Browsers support is limited at this point. It only includes Chrome, some limited support (only receiving) for Safari, and the new Chromium-based Edge for Windows-only.
Browser | Windows | macOS | Ubuntu | Linux | Android | iOS |
---|---|---|---|---|---|---|
Chrome | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | n/a |
New Edge | ✔️ | n/a | ||||
Safari | n/a | Some | n/a | n/a | n/a | ✔️ |
While testing the Web and Android SDKs it was noticeable that they still need some polishing. For example the browser logs shows very verbose console and common warnings related to statistics or some requests failing although that is expected for a first version.
Server-side Management SDKs
The administration SDKs to create users and tokens are provided by Microsoft for C#, Python, Java and Node.js support. Those SDKs will run in a trusted application and require an access key that is created in the Azure console. Microsoft gets bonus points for supporting primary and secondary access keys to support access key rotation.
Other features
Some other advanced functionality:
- PSTN Calling: the Private Preview did not allow us to test this, but according to the documentation it provides Dialing-Out support for both 1:1 and group calls
- SMS – as above we could not test this, but sending and Chat is also part of the Azure communications offering
- Teams Integration: this is also in Private Preview but looks like one of the use cases where this new comms platform could get initial traction given the popularity of the Teams product nowadays.
There is no mention of recording or broadcasting capabilities in the documentation or SDKs and neither any integration with stream processing capabilities of Azure like text to speech or vision APIs.
Signaling
The signaling is based on HTTP requests. One can see many references to Skype domains in the signaling showing how this product has been used on top of other existing parts of Microsoft ecosystem. Actually even the user identifiers inside the JWT tokens of Azure Comms Services are called
skypeids :
Below is an example of the proprietary signaling based on custom JSON format over HTTP sent when you mute/unmute your microphone:
For the 1:1 calls the system uses direct P2P WebRTC connections. When in Rooms mode, ACS uses an SFU to forward the audio and video packets between the different participants. Those SFUs are located in different regions. In my case (being in Europe), I was assigned to one in Dublin during my tests.
SDP & Media
PeerConnection Plan
The Client SDK uses a single WebRTC PeerConnection for sending and receiving multiple streams. This is the most efficient and modern mechanism, but not always used by all the platforms. On the down side it uses the original Plan-B semantics instead of the new Unified Plan semantics. That is not atypical given the incumbency of Plan-B – many of the largest multi-party applications are still on Plan-B today.
Interactive Connectivity Establishment (ICE)
In terms of media connectivity, ACS uses both STUN and TURN TCP servers. Surprisingly TURN TLS is not included – that could limit ACS’ ability to connect mostly in restrictive enterprise environments.
1 |
http://localhost:5000/, { iceServers: [turn:52.158.34.11:3478?transport=udp, turn:52.158.34.11:443?transport=tcp], iceTransportPolicy: all, bundlePolicy: max-bundle, rtcpMuxPolicy: require, iceCandidatePoolSize: 0, sdpSemantics: "plan-b" }, {advanced: [{enableDtlsSrtp: {exact: false}}, {googCpuOveruseDetection: {exact: false}}]} |
For the direct connection to the SFU, it uses typical ICE UDP candidates but also ICE TCP candidates in port 3478. The ICE support is not ice-lite but full ice . That is not very common in SFUs with public IPs because it is harder to implement. Full ICE doesn’t provide many advantages but doesn’t have any negative impact either.
Encryption
The encryption is based on SRTP as required by WebRTC. However, the SFU/Rooms the keys exchange uses SDES and not the standard DTLS protocol. That is simpler and provides faster establishment but is only supported in Chrome. It will likely be removed at some point since the standard explicitly forbids SDES since it is less secure than the standard DTLS requirement.
Codecs
G.722 is used for the audio codec. That is really uncommon for WebRTC platforms but not as surprising given the need of PSTN interoperability and the reuse of existing Microsoft infrastructure. This is a fragment of the SDP answer with the audio channel information:
1 2 3 4 5 6 7 |
m=audio 3480 RTP/SAVPF 9 0 8 13 101 c=IN IP4 40.113.83.182 a=rtpmap:9 G722/8000 a=rtpmap:0 PCMU/8000 a=rtpmap:8 PCMA/8000 a=rtpmap:13 CN/8000 a=rtpmap:101 telephone-event/8000 |
The video codec selected in H.264. It uses RTX retransmissions for reliability. ACS doesn’t include simulcast support to adapt the video quality to the needs of different participants in the room. Also at least in the examples I tested the bitrate was pretty low. You can see in the next capture from the sender parameters how it is configured to use H264 at 200kbps.
RTCP
Some other details at RTP/RTCP level is the usage of bundle, rtcp-mux and rtcp-rsize that are used in most of the platforms too. It also reserves 50 ssrcs for each stream (1501, 1551…) and during the initial establishment of the call 8 remote streams are pre-allocated in the remote SDP for future participants.
Bandwidth Estimation (BWE)
For bandwidth estimation it uses receiver side support (based on REMB) instead of the more modern and optimized sender side bandwidth estimation (based on Transport Feedback).
Other unidentified stuff
There are also non-standard extensions present in the SDP. I doubt these have any impact and are probably inherited from other applications. For example:
1 2 3 4 |
a=x-mediabw:applicationsharing-video send=8000;recv=8100 a=x-source-streamid:19 a=x-signaling-fb:* x-message app send:dsh a=x-signaling-fb:* x-message app send:src,csrc,vc recv:src |
Conclusions
Azure Communication Services has a simple API. Everything worked as expected with very small effort. The documentation was good and the interactive samples really helpful. It also promises an easy to understand and competitive pricing model.
On the other hand, this is still a Beta product. It is not going to be as mature as existing offerings from competitors that have been around for years. If ACS is going to be considered seriously, Microsoft will have to extend the support to other browsers and cleanup existing web support. In addition, the lack of some video quality technologies (mostly simulcast) and lack of support for newer codecs (specially Opus) was something unexpected and hopefully is addressed by Microsoft upcoming releases. The lack of recording is also a big gap for many popular use cases.
The most promising part from my point of view is the potential integration with the Azure ecosystem for features like push notifications, text-to-speech, computing, pubsub… For example it would be very useful to have pubsub support for audio/video but that is only available for SMS right now. I’m also looking forward to what people can build using the Teams integration but I couldn’t evaluate that part during these tests.
{“authors”: ”Gustavo Garcia”}
Leave a Reply