I was starting to work with a big dataset and was dreading the idea of bogging down my machine with MySQL or SQLServer, so I decided to give Google’s BigQuery a try. Before I got into my project, I was pleasantly surprised to a public GitHub dataset was readily available. Tsahi does a cursory analysis of WebRTC projects on GitHub by manually counting search results every month. I was also inspired by Billy Chia‘s great NoJitter post analyzing WebRTC topics on Stack Overflow.
I was curious to see if I could extract some details from GitHub to see:
- Has WebRTC activity been increasing?
- Who are the top contributors?
- How does WebRTC compare to other technologies?
More than that, I want to try to open this analysis up to others, so I have fully documented my methodology and queries.
Here goes.
GitHub Developers <3 WebRTC
If you have ADHD and can’t get through a lengthy methodology and analysis, this is the summary results section. If you want to see the long walk-through of how I did my methodology, analysis, and some of the more specific results, then please read the more detailed section after this one.
The short methodology synopsis is this – I used the publicly available github dataset on Google’s BigQuery to look for repository – aka repo – and user activity for repos that had WebRTC related terms – i.e. WebRTC, getUserMedia, PeerConnection, and DataChannel.
New repos and contributors over time
Some of the key developer interest metrics you can get from GitHub include:
- Total number of repos – individual code repositories
- Total contributors – GitHub users that made actual pushes of code that contained WebRTC
Things are looking good here for WebRTC with a fairly steady ramp across all of these metrics. Note there is some seasonality here, but Q4 2014 still added 791 repos and 556 new contributors.
Total repos and contributors
You can also tweak this to see how the total count has grown over time:
The GitHub dataset I pulled only starts in March of 2012, but you can see a big ramp of 30 repos that month to 5196 by the end of 2014. On the contributor side 2014 closed with 3977 contributors.
New repos and contributors are one thing, but are these mostly developers giving WebRTC a quick check and then abandoning it? To check this, I looked at the total repo and contributor activity by month across the available timespan:
This was reassuring. Again, WebRTC shows the same upward pattern with more than 500 active repos this past October and November and an average of more than 400 contributors per month in Q4 2014.
Is this a big deal? Comparing to WebGL
Perhaps this isn’t a big deal. The best way to check is to compare these results against another dataset. I chose WebGL because:
- WebGL is based on JavaScript. WebRTC is based on JavaScript.
- WebGL leverages existing DOM elements – the canvas. WebRTC leverages existing DOM elements – video and audio
- If you wanted to do WebGL like things on the web before WebGL, you needed Flash. If you wanted to do WebRTC like things on the web before WebRTC, you needed Flash.
- WebGL struggles with hardware support on mobile. WebGL struggles with hardware support on mobile.
- WebGL is the web version of OpenGL. WebRTC is the web version of VoIP.
The WebGL 1.0 specification was actually released a few months before Google Open Sourced the WebRTC project in 2011 where we are still waiting on a full, formal spec.
So here is how it compares in terms or new user and repository creation:
As noted above, WebGL effectively started much sooner than WebRTC, so factoring this in, WebRTC does not look too shabby. There is no doubt the WebGL has a larger amplitude, but is surprising that it follows the same pattern (r=.90 & r=.91 for stats nerds).
Activity shows the same:
So is WebRTC a big deal? It has not surpassed WebGL, but if you use WebGL as a guide then WebRTC should continue to grow on a similar trajectory.
WebRTC’s alright!
Here are some summary statistics:
Clearly WebRTC is doing just fine, more than doubling its size in 2014. While the WebRTC community here is smaller than WebGL, it actually showed slightly higher engagement in terms of new and active as percentage of total in 2014.
Note this is just considering open source WebRTC developers posting their code on Github. This does not show:
- Closed-source developers
- Developers with private repos on GitHub
- Developers using open source repositories other than GitHub
- Developers who cloned a repo to check it out, but never pushed any code
That’s it for the short summary. I have a lot more below including some additional datapoints…
Want more? Detailed Analysis & Methodology
If you are interested in doing this sort of thing yourself, here is how I did it, step-by-step. I am not going to go into the details of how to write an SQL query (see w3schools for a good start on that) or how to work pivot tables, but I’ll share the queries I wrote.
Using BigQuery
The first thing you need to do is sign-up for Google’s BigQuery here. If you are a new user like I was, they give you a free trial. Even after that, you get 1 TB of data processed per month at no charge. I never got close to this and I doubt you would unless you are running queries all of the time. If you are paranoid, BigQuery tells you how much data your query will use before you run it by clicking on the validator check mark:
Its not easy to find, but you can check your unbilled usage at https://console.developers.google.com/project/[your project name]/billing/unbilledinvoice – just put your project name in where I indicated in the URL.
As of today, you’ll see they use the github repository as an example right on the BigQuery homepage. To get started with this dataset, read through the instructions starting at the Analyzing event data with BigQuery on githubarchive.org.
Finding WebRTC data
Assuming you got through the above, now we need to extract WebRTC related data from one of the several tables available on BigQuery.
I started using the githubarchive:github.timeline table since it had all the relevant payload data is already extracted (vs. having to do this yourself) and it has a full history in a single table without having to aggregate multiple tables.
I wanted to see WebRTC related creation. To determine this I look for one of the following terms in a repo name, push payload, or repo description:
- WebRTC
- getUserMedia
- PeerConnection
- DataChannel
This is not perfect, but seemed to do a decent job of grabbing most of the interesting activity. A non-WebRTC repo with the word DataChannel will show up. I did a bunch of random scans through the dataset to see if this was happening often and I could only find one instance out of dozens where “datachannel” might not refer to a WebRTC DataChannel.
Likewise it is not going to pick up a WebRTC repo if no one ever uses any of the keywords in appropriate place.
Note I attempted to add “ORTC” into this list (using REGEXP_MATCH(LOWER(payload_commit_msg), r'\bortc\b') ), but got many false positives and the valid results had overlap with the above terms anyway.
From here we need to figure out what fields to extract and under what conditions we should grab this data. On the fields side, I scanned through the schema names and did some trial-and-error experimentation to determine what fields were interesting.
This was my query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
/* Pull interesting fields from pushes that have WebRTC related terms in the commit or repo description*/ SELECT repository_name, repository_url, repository_created_at AS created, repository_organization, repository_language, repository_description, repository_has_downloads, repository_forks, repository_fork, repository_pushed_at, repository_homepage, repository_watchers, payload_commit_msg, actor_attributes_login, actor_attributes_name, actor_attributes_company FROM [githubarchive:github.timeline] WHERE type="PushEvent" AND ( /*Check the repo name */ LOWER(repository_name) CONTAINS "webrtc" OR LOWER(repository_name) CONTAINS "getusermedia" OR LOWER(repository_name) CONTAINS "peerconnection" OR LOWER(repository_name) CONTAINS "datachannel" OR /*See if any keywords in the commit message */ LOWER(payload_commit_msg) CONTAINS "webrtc" OR LOWER(payload_commit_msg) CONTAINS "getusermedia" OR LOWER(payload_commit_msg) CONTAINS "peerconnection" OR LOWER(payload_commit_msg) CONTAINS "datachannel" OR /*And check the repo description */ LOWER(repository_description) CONTAINS "webrtc" OR LOWER(repository_description) CONTAINS "getusermedia" OR LOWER(repository_description) CONTAINS "peerconnection" OR LOWER(repository_description) CONTAINS "datachannel" ) AND YEAR(repository_pushed_at) < 2015 /*2015 data has different schema*/ |
You can find the resultant table in guthubWebrtc.webrtcPushes in a nice compact 15.7MB with 53,851 rows.This extract is not perfect.
When Schemas Change
I started this analysis in mid January and came back to write it all up in February. In checking things for blog post I saw this note on the githubarchive:github.timeline description:
Note: this table is DEPRECATED and is no longer updated with new activity. Please see githubarchive.org for instructions on how to use new day, month, year datasets.
Crap! This means a new and much more complex query. I looked into this and it would require: manually mapping the payload fields in new 2015 data back to the extracted payload format found in the timeline dataset and the prior-to-2015 data, hoping they match for the fields I care about, and them merging them. I started to do this, but it was tedious job matching the new JSON format with JSON_Extract command to the existing timeline data. I decided to save this for another day since it would only give me a extra month of data (you’ll see I summarize based on monthly results). As a result, you are only going to be able to use this query to see data up to the end of 2014.
Figuring out what this data means
Now I had a bunch of smaller, summary datasets I could easily throw into pivot tables and chart. I did this in this Excel sheet. For reference, you can find all the charts in this PowerPoint.
Repo Analysis
Building a repo timeline
One of the things I would like to see if the number of repos created with a WebRTC element over time. I am also interested to see if the repo is a fork, how many watchers it has, and how many different contributors there are. To do this I used the following query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
/*extracts unique repos and selected fields */ SELECT /*grab all the data in the subquery and perform some summary functions */ *, REGEXP_EXTRACT(repo, '(.*)/{1}') AS owner, /*get the owner/account/org/top-level repo*/ IF( (forks + watchers + contributors) > 2, FALSE, TRUE) AS solos, /*my measure of group activity*/ DATEDIFF(firstWebrtc, createDate) as daysB4webrtc, /*for data checking */ IF( MONTH(firstWebrtc) >= 10, CONCAT( STRING(YEAR(firstWebrtc)), '-', STRING(MONTH(firstWebrtc)) ), CONCAT( STRING(YEAR(firstWebrtc)), '-0', STRING(MONTH(firstWebrtc)) ) ) AS yearMonth /*to help with data summarization*/ FROM ( /*Grab the key data*/ SELECT SUBSTR(repository_url, 20) AS repo, MIN(created) AS createDate, MIN(repository_pushed_at) AS firstWebrtc, /*WebRTC 'startDate*/ MAX(repository_fork) AS fork, /*mark TRUE if it is ever true */ MAX(repository_forks) AS forks, MAX(repository_watchers) AS watchers, COUNT( distinct actor_attributes_login) AS contributors, LAST( repository_language) AS language, FROM githubWebrtc.webrtcPushData GROUP BY repo ORDER BY createDate, fork, forks, watchers, language, contributors ) |
From here I dumped into Excel to play with the data in a more interactive fashion. After determining what additional summary functions were useful, I went back in and created the summary functions on the outer query to make this more portable.
New Repos over Time
For my repo data above, I had the following fields:
- Url of the repo
- createDate – date the repo was created
- firstWebrtc – date the first WebRTC related push was indentified
- fork – was the repo a fork?
- forks – the number of times the repo was forked
- watchers – the most watchers a repo ever had
- contributors – the number of individuals who made a push
- language – the last language registered
Often times an individual will fork a repo to check it out and then never use it again. I wanted to see if this was occurring often, so I made a custom excel field that basically added the forks, watchers, and contributors fields together. If this added up to more than 2 it I assumed that it was more than just someone doing a quick fork. The formula marks these FALSE in the Individuals field. If it was 1 (a repo has to have at least 1 contributor) or 2 the Individuals field was identified as TRUE.
I also formatted the create date into a year-month format to help with the pivot table.
After fixing up my query, I extracted the data to Excel and made some pivot tables and pivot charts on the Timeline tab.
See the total of all repos, independent of Individuals and Community repo types in the summary section in the beginning of the post.
Some observations:
- There is clearly lots of growth since 2012 here
- There appears to be some seasonality – its not too surprising that December always trends slower than November
- Group activity seems to be trending slightly down overall while Solos according to my definition are trending upward
- As a percentage of total by period, group activity is declining in favor of many more solo repos (only 13% in 2014 from 76% in 2010).
- Removing Forks – those repos that were made from forks – does not change the trend data
- Aggregate activity is trending up despite a slowdown in November and December 2014
Languages
Since I had the data in the data_repos dataset, I also looked at the programming language distribution of these new repos. This data is a little suspect since it uses GitHub’s own language classification if none is explicitly specified. If your project has a lot of CSS associated with it, the project could be classified as CSS. As I point out in my methodology above, I am also grabbing the last identified language as opposed to the first one in the hope that GitHub’s classification has improved over time.
Nonetheless, the data is not all that surprising showing mostly JavaScript followed by C++:
New Owners
Github allows a hierarchy with an owner that can be an individual user or organization with repos under that. For example, in github.com/webrtcHacks/webrtc_baby_monitor – webrtcHacks is the owner, webrtc_baby_monitor is the repo. One group could organize 1000 lines of code in 10 repos while another with a million could have just a single repo. Is it really fair to weigh them the same? To find out I also setup a query to aggregate the data at the owner level. Here is my query for this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
/*extracts unique owners and selected fields */ SELECT /*outer query to help with data summarization*/ *, YEAR(createDate) AS startYear, MONTH(createDate) AS startMonth, IF( MONTH(createDate) >= 10, CONCAT( STRING(YEAR(createDate)), '-', STRING(MONTH(createDate)) ), CONCAT( STRING(YEAR(createDate)), '-0', STRING(MONTH(createDate)) ) ) AS yearMonth, FROM ( SELECT REGEXP_EXTRACT(repository_url, 'github.com/(.*)/{1}') AS owner, COUNT(DISTINCT repository_url) as repos, MIN(repository_pushed_at) AS createDate, /*WebRTC 'startDate*/ SUM(repository_forks) AS forks, COUNT( distinct actor_attributes_login) AS contributors, FROM githubWebrtc.webrtcPushes GROUP BY owner, repository_url ORDER BY repos DESC, createDate, forks, contributors ) |
I dumped this data into the data_owners tab in Excel and made some charts.
Here is some of what I noticed:
- Organizations like rtc-io, Mozilla, and Kurento maintain a lot of repos
- It turns out 750 organizations out of 3744 had more than one repo
- Just about 20% of orgs
- 42% of all WebRTC repos belonged to a owner with more than one repo, so these are significant
The moving average shows a steady rise in the “owners” using WebRTC in their repos.
This data ended up not being all that distinct or useful compared to the repo data, so I left it out of the summary analysis above to keep it less cluttered.
New Contributors
We can use a similar methodology to summarize unique contributors within these repos.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
/*extracts WebRTC contributors */ SELECT /*outer query to help with data summarization*/ *, YEAR(firstPush) AS startYear, MONTH(firstPush) AS startMonth, IF( MONTH(firstPush) >= 10, CONCAT( STRING(YEAR(firstPush)), '-', STRING(MONTH(firstPush)) ), CONCAT( STRING(YEAR(firstPush)), '-0', STRING(MONTH(firstPush)) ) ) AS yearMonth, FROM ( SELECT actor_attributes_login AS user, FIRST( actor_attributes_name) AS name, FIRST( actor_attributes_company) AS company, COUNT( distinct repository_url ) AS repos, MIN( repository_pushed_at) AS firstPush, MAX( repository_pushed_at) AS lastPush, COUNT(distinct DATE(repository_pushed_at)) AS daysPushed, DATEDIFF(CURRENT_DATE(), MAX( repository_pushed_at)) AS daysSincePush FROM githubWebrtc.webrtcPushes GROUP BY user ORDER BY daysPushed DESC, repos DESC, name, company, firstPush, lastPush, daysSincePush ) |
This went into the data_contributors tab in my spreadsheet.
Perhaps just as important as the number of organizations is the number of individuals working with WebRTC. This is what I made the data_contributors dataset for. This data shows a total of 3977 unique contributors – users who actually pushed some WebRTC code on github.
Like with the repos, this data shows some seasonal lumpiness but the 6-month average shows a steady, albeit slowing, upward trend.
I also wanted to see who was the most active individual on GitHub. To help get some insight on this I included an aggregate number of “days” where code was pushed by individual in the daysPushed field.
Here are the top 10 pushers:
user | company | Total |
muaz-khan | webrtc-experiment.com | 262 |
DamonOehlman | NICTA | 234 |
hwine | Mozilla | 222 |
svn2github | (this is a bot) | 215 |
fippo | &yet | 206 |
thebehera | pristine.io | 196 |
ehsan | Mozilla | 188 |
lodoyun | lynckia | 172 |
EricDavies | Priologic | 149 |
longsleep | struktur AG | 141 |
Congrats to webrtcHacks friend and guest Muaz Khan of webrtc-experiment.com for being the most active WebRTC contributor according to my methodology!
Monthly Activity
After doing the daysPushed analysis, I realized this metric is not very intuitive to understand when you start to use it in aggregate for repo analysis. An easier to understand and compare metric is just how many active users you have in a given month. After I made this it was easy to modify the query for repos and owners too:
Repos
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/*Find how many repos have pushes in a given month */ SELECT PushDate, 'repos' as Type, COUNT(DISTINCT repo) as count FROM ( SELECT SUBSTR(repository_url, 20) AS repo, IF( MONTH(repository_pushed_at) >= 10, CONCAT( STRING(YEAR(repository_pushed_at)), '-', STRING(MONTH(repository_pushed_at)) ), CONCAT( STRING(YEAR(repository_pushed_at)), '-0', STRING(MONTH(repository_pushed_at)) ) ) AS PushDate, FROM githubWebrtc.webrtcPushes GROUP BY repo, PushDate) GROUP BY PushDate ORDER BY PushDate ASC |
Owners
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/*Find how many owners have pushes each month */ SELECT PushDate, 'owners' as type, COUNT(DISTINCT owner) as count FROM ( SELECT REGEXP_EXTRACT(repository_url, 'com/(.*)/{1}') AS owner, IF( MONTH(repository_pushed_at) >= 10, CONCAT( STRING(YEAR(repository_pushed_at)), '-', STRING(MONTH(repository_pushed_at)) ), CONCAT( STRING(YEAR(repository_pushed_at)), '-0', STRING(MONTH(repository_pushed_at)) ) ) AS PushDate, FROM githubWebrtc.webrtcPushes GROUP BY owner, PushDate) GROUP BY PushDate ORDER BY PushDate ASC |
Contributors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/*Find how many users make pushes each day */ SELECT PushDate, 'contributors' AS type, COUNT(DISTINCT user) as count FROM ( SELECT actor_attributes_login AS user, IF( MONTH(repository_pushed_at) >= 10, CONCAT( STRING(YEAR(repository_pushed_at)), '-', STRING(MONTH(repository_pushed_at)) ), CONCAT( STRING(YEAR(repository_pushed_at)), '-0', STRING(MONTH(repository_pushed_at)) ) ) AS PushDate, FROM githubWebrtc.webrtcPushes GROUP BY user, PushDate) GROUP BY PushDate ORDER BY PushDate ASC |
With a little formatting magic in Excel, I got this:
The 6-mo moving average shows a very linear trend in an upward direction, boding well for real WebRTC adoption and repeated use.
Comparing to WebGL
I was also interested in seeing how WebRTC usage compared to other HTML5 technologies. The closest analog to WebRTC I could think of is WebGL – WebGL is relatively new, but rapidly growing HTML5 technology and like WebRTC it replaces older Flash-based mechanisms.
To get this data, I modified my webrtcPushData query to grab commits or repos that contain “webgl”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
/* Pull interesting fields from pushes that have WebRTC related terms in the commit or repo description*/ SELECT repository_name, repository_url, repository_created_at AS created, repository_organization, repository_language, repository_description, repository_has_downloads, repository_forks, repository_fork, repository_pushed_at, repository_homepage, repository_watchers, payload_commit_msg, actor_attributes_login, actor_attributes_name, actor_attributes_company FROM [githubarchive:github.timeline] WHERE type="PushEvent" AND ( /*Check the repo name */ LOWER(repository_name) CONTAINS "webgl" OR /*See if any keywords in the commit message */ LOWER(payload_commit_msg) CONTAINS "webgl" OR /*And check the repo description */ LOWER(repository_description) CONTAINS "webgl" ) AND YEAR(repository_pushed_at) < 2015 /*2015 data has different schema*/ |
I saved this into githubWebrtc.webGLPushes. From there I reran all the above queries, substituting webGLPushes in the FROM field for webrtcPushData and saved this into a separate spreadsheet here.
How does this data compare to similar HTML5 technologies like WebGL? Are these trends unique to WebRTC? To figure this out I took my WebGL timeline data and manually merged that with the WebRTC data_repos dataset in Excel to create several combined datasets. Using this I created some new pivot tables and charts:
Except for a surge of WebGL activity in the first half of 2012, and some random spikes in a given month on both sides, the patterns look the same. Both show some seasonal effects and the gradual decline in group activity, so WebRTC is not so special, at least compared to WebGL. Of note, WebRTC has roughly the same new “group” activity, but diverges more considerably on “solo” activity as I describe above.
Also note this is the same graph as shown in the initial summary section except with the Group owners data too. I actually ran a correlation analysis between these 2 datasets and got a r of .91 for the repos and .90 for contributors, meaning these graphs show very similar trends albeit with different magnitudes (as you can see).
What’s Next
If the methodology proves to be sound, I would like to automate this and have it post the results to a dedicated webrtcHacks page.
Beyond that, there is a bunch of analysis I would like to do with more time:
- Build the query to merge in the new 2015 data using the different scheme
- Compare WebRTC against other technologies (Tsahi suggested canvas, angular, SIP or Flash as better alternatives to WebGL)
- Compare WebRTC’s growth rates to GitHub’s overall growth rates – GitHub is growing, so it WebRTC doing better than the GitHub average growth?
- See if I can get any usable data for this analysis from prior to March 2012
- Filter any false positives
If you are adventurous I hope you will repeat and improve this analysis. Volunteers needed! I posted the queries above to the readme on this repo: https://github.com/webrtcHacks/gitHub-WebRTC-analysis. Looks like I just increased my repo counts by 1…
{“author”: “chad hart“}
Billy Chia says
Chad,
Thanks for the shout out! This is a terrific post. I’m impressed by the level of detail and the tutorial section at the end. Particularly interesting was when you checked active WebRTC repos and contributors by month and saw that active repos were growing, not just new repos. I think that says a lot about the health of the WebRTC ecosystem.
Kosso says
Great info . Thanks.
I’ve had my eye on WebRTC for many years now. Now many browser release versions are supporting it more, things are looking up.
Apple are dragging their heels though. Most likely down to them worrying more about their bottom line via apps, than the security of your device via the web.