As you may have heard, Whatsapp discovered a security issue in their client which was actively exploited in the wild. The exploit did not require the target to pick up the call which is really scary.
Since there are not many facts to go on, lets do some tea reading…
The security advisory issued by Facebook says
A buffer overflow vulnerability in WhatsApp VOIP stack allowed remote code execution via specially crafted series of SRTCP packets sent to a target phone number.
This is not much detail, investigations are probably still ongoing. I would very much like to hear a post-mortem how WhatsApp detected the abuse.
We know there is an issue with SRTCP, the control protocol used with media transmission. This can mean two things:
- there is an issue with how RTCP packets are decrypted, i.e. at the SRTCP layer
- there is an issue in the RTCP parser
SRTCP is quite straightforward so a bug in the RTCP parser is more likely. As I said last year, I was surprised Natalie Silvanovichs fuzzer (named “Fred” because why not?) did not find any issues in the webrtc.org RTCP parser.
We actually have a bit of hard facts provided by the binary diff Checkpoint Research wherein they analyzed how the patched version is different.
They found two interesting things:
- there is an additional size check in the RTCP module, ensuring less than 1480 bytes
- RTCP is processed before the call is answered
Lets talk about RTCP
RTCP, the Realtime Control Protocol, is a rather complicated protocol described in RFC 3550. It provides feedback about how the RTP media stream is doing such as packet loss. A UDP datagram can multiplex multiple individual RTCP packets into what is called a compound packet. When a RTCP compound packet is encrypted using SRTCP, all of the packets are encrypted together with a single authentication tag that is usually 12 bytes long.
To make demuxing compound packets possible, each individual RTCP packet specifies its length in a 16 bit field. For example a sender report packet starts like this:
1 2 3 4 |
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ header |V=2|P| RC | PT=SR=200 | length | |
The length field is defined as
length: 16 bits
The length of this RTCP packet in 32-bit words minus one,
including the header and any padding. (The offset of one makes
zero a valid length and avoids a possible infinite loop in
scanning a compound RTCP packet, while counting 32-bit words
avoids a validity check for a multiple of 4.)
which is… rather complicated. It particular this definition means that the RTCP parser MUST validate the length field against the length of the datagram and the remaining bytes in the packet. Some RTCP packets even have additional internal length fields.
For the first packet in a compound packet length validation is usually done by the library implementing SRTCP like libSRTP. Mind you that WhatsApp probably uses PJSIP and PJMEDIA, or at least they did in back in 2015 when I took a look.
The length check for the second packet needs to be done by the RTCP library. I would not be surprised if this is where things went south. Been there, done that. And it remains a bit unclear whether the length field is validated against the remaining bytes. 1480 seems like a very odd number to check for though. At first I thought this made sense since it was 1492 minus the 12 bytes for the SRTCP tag but the maximum payload size of UDP turned out to be 1472 bytes, not 1492. So now I end up being confused again…
Don’t process data from strangers
There is another issue here. As the New York Times article said it looks like the victims received calls they never answered. Checkpoint’s analysis confirms that the RTCP code is (still) triggered in the patched version. This is not good and made the attack easier at least.
WhatsApp had been spending quite some time on optimizing call setup time back in 2015 already. Establishing the media connection before the user picks up the actual call is a good way to optimize things in theory. This thinking is not entirely new, even a standard such as XMPPs Jingle advises against using this without some kind of trust relationship:
the responding client SHOULD NOT return a “session-accept” action to the initiator until the responder has explicitly agreed to proceed with the session (unless the initiator is on a list of entities whose sessions are automatically accepted).
Even in terms of privacy this is hard to get right as one must at least avoid exposing IP addresses. WhatsApp probably ensures this with their relay-first approach.
Summary
Such a bug in RTCP processing is bad on its own. In combination with processing data before the call was accepted this became even worse and was actively exploited. End-to-end fuzzing (even a simple fuzzer like Fred that was mentioned earlier) might have found the issue. The current patch looks like a hotfix though and I would expect further changes in that area of the code (after learning how to patch-diff!).
Not processing media from people not on the contact list is another change I would expect and that one is pretty easy to observe with wireshark.
Tim Panton found the right words:
I’m feeling deep sympathy for the coder who wrote the WhatsApp bug. RTCP/DTLS is a horrid protocol. I stared at the RFC for days puzzling out how to get the encrypted payload length.
{“author”: “Philipp Hancke“}
Alessandro Toppi says
1500 (common MTU) – 20 (IP header) = 1480 (UDP header + payload).
I think every check involving that value is meant to be considered on the whole UDP datagram. If your WhatsApp stack analysis still stands, their VoIP service does not use ICE, so it makes sense that the implementation directly handles UDP datagrams.
In general there is nothing wrong in using datagrams (and RTCP compound packets) bigger than 1480, given that you want to accept all the consequences of MTU segmentation. But by looking at the mentioned diffs my idea is that the checks are intended to limit the data given to the parsers on a size basis. Indeed there is a memcpy instruction and raw handling of buffer locations, suggesting the usage of static allocated, fixed size arrays as buffers.
Philipp Hancke says
Right but why would you use a 1480 byte buffer in the first place instead of 1472? The UDP header would typically be stored separetely…
DEAN says
UDP header can be exploited with string value to mirror packets already a very easy exploit to accept API as if from trusted resource