Quick Introductiongh-impact measures open source influence. gh-impact is based upon the stars an open source project receives on GitHub: an account has a gh-impact score of n if they have n projects with n stars. Higher gh-impact scores correspond to accounts that have many well-used projects. See here for more information.
How is influence calculated?
One particular index of influence known as the H-Index combines the quantity and quality of an individual’s work to create a single number that basically captures the idea of influence (Hirsch, 2005).
A scientist has index h if h of his/her Np papers have at least h citations each, and the other (Np − h) papers have no more than h citations each.
A more direct way to say it comes from Wikipedia H-Index:
a scholar with an index of h has published h papers each of which has been cited in other papers at least h times.
How is gh-impact novel?
Other scientists have used a similar function to estimate achievements. The most interesting we’ve come across is the Eddington number, proposed some time around 1940: “the number of days in your life on which you have cycled at least E miles.” (Jeffers & Swanson, 2005) The analogy between the H-Index and the Eddington number is obvious, and thus with gh-impact as well.
gh-impact is a novel extension of the fundamental principle shared by h-index and the Eddington number of bicycling. There has been a lot of activity within the academic community to advance the science of bibliometrics for the purpose of understanding science better. There has been relatively less research invested into applying these findings to other domains, like open source software.
We believe gh-impact is a unique line of inquiry into the influence of open source publishers and the projects they create.
What is the significance of gh-impact?
gh-impact is a novel data set with wide-ranging applications. Our data provides industry indicators, ranks companies and foundations, identifies trends, and helps individuals to demonstrate the impact of their work.
If history is any indicator, the introduction of the h-index to academic audiences was a major event for performance metrics. For better and for worse, this single metric has become an important indicator of career success. There are many instances where it is useful to know how a body of work compares to another’s work output.
NB: use caution in performing comparisons without a solid understanding of gh-impact. We have identified a variety of nuances that must be taken into account, some of which are discussed in this report.
Why is gh-impact better than raw rankings?
gh-impact is a single metric that combines both quantity and quality. Raw rankings can be problematic for a variety of reasons, depending on the question being asked. For example, two accounts with the same total stars may have earned those stars differently. An account with a single popular project may result in a large number of stars, whereas an account with many projects that each have fewer stars could result in the same total number of stars. Raw rankings will not differentiate between these two accounts, whereas gh-impact will.
How do we know gh-impact is valid?
gh-impact correlates with many other metrics that are also indicators of success, including:
- total number of stars
- number of stars collected by the most popular project
- total number of followers
However, gh-impact is not merely a correlate of other indicators of success. We have identified several situations where gh-impact is a unique factor accounting for variance in important outcomes.
How is GitHub like Academia?
In Academia there are citations, which are a direct reference to another author’s work. On GitHub there are stars, watchers, followers, forks, and even other potential metrics that work like citations. I propose the following typical uses for those GitHub mechanisms:
- projects are starred by people who use those projects
- projects are starred by people who are likely to use those projects in the future
- projects tend to be watched by project owners and others who build the project
- projects tend to be forked by developers who are building part of a project
- users accounts on GitHub are followed for many reasons
GitHub stars are therefore the best measure of actual project use. On that basis, gh-impact currently substitutes project stars for citations in the H-Index computation.
What is the significance of knowing somebody’s gh-impact score?
gh-impact can provide a rough estimate of a GitHub account’s overall productivity and influence. There may be circumstances under which gh-impact is useful for making comparisons, but this is an area of ongoing research and it is generally not advised. You can read more about the project on this site.
What about users who star their own projects?
gh-impact will not be significantly impacted by users who star their own projects. Consider a user who stars all of their own projects. This user cannot attain a gh-impact score above 1 through this method alone because each user has just one star to give.
Is this robust against manipulation/distortion?
Under usage conditions that would not violate GitHub’s terms of service, gh-impact is probably robust. However, this is the Internet. On that basis it’s naive to presume that all users will adhere to GitHub’s terms.
The authors uncovered multiple accounts that appeared to have inflated gh-impact scores on the basis of some strange project usage behavior. In one situation we observed, it appears a network of accounts were automatically generating and starring projects - presumably as part of research into git usage itself. These accounts have been removed from the results.
Why does this number not match my expectation?
If we consider GitHub’s data to be “perfect,” then in comparison gh-impact can be influenced by several sources of error.
An important influence is the data source used for this research (GHTorrent) which is itself an ambitious project that is still working to improve the timescale mirrored in the project.
Another influence is that only project ownership is considered during the calculation. In practical terms, only one user account may be the owner of a project despite the possibility that several individuals may contribute to that project.
Organizations and Individuals
A project is owned by an account, and that account may belong to either an individual or to an organization. gh-impact will under-estimate influence in cases where individuals are most productive through their contributions to organization-owned projects.
Forks are not currently considered during gh-impact computation, so any authors of popular forks will not receive full credit for those contributions.
Are gh-impact scores comparable between two users or two industries?
Maybe, but be careful. H-Index, from which gh-impact is derived, has been found to vary systematically between academic fields. As a result, it is not possible to compare the productivity of academics in different fields on the basis of H-Index.
We suspect the same cross-field comparisons will be dangerous on GitHub, just as with academia. For example, some programming languages encourage the creation of many little packages, which may lead to slightly higher gh-impact scores for developers who use such languages. Whether gh-impact is stratified by industry or by another facet like language is a future research direction.
We can now say with confidence that Individual and Organization accounts behave differently, and are not directly comparable. As a result, these two account types are analyzed and presented differently throughout the gh-impact work.
Where do the data come from?
Researchers at GHTorrent provide MySQL database dumps on a regular basis (Gousios, 2013). The 2016-07-19 database dump (approximately 41.3GB) was downloaded and the users (~13M rows), projects (~34M rows), and watchers (~49M rows) tables were extracted as CSV files. The watchers table is named for historical reasons, but it actually contains stars - not watchers.
How is gh-impact actually computed?
Data were imported into Postgres directly from the MySQL dumps. All subsequent computation was performed in Postgres using SQL, which provides powerful query planning tools. The basic idea for an H-Index SQL query (Linoff, 2013) was adapted for this project. A series of Postgres Views stored intermediate computations.
How is the web application constructed?
Flask-Diamond was used to model the databases and to export gh-impact computation results as JSON. This dynamic backend was adapted to work with GitHub Pages by batch-exporting all JSON results (~14MB) to the filesystem for static operation. The search interface is then able to request a JSON file to fulfill a search request without the need for a persistent Python application process.
Will the authors help me with my project?
Academics: please get in touch. I am also looking for post-doctoral research opportunities starting Winter 2017.
Industry: Premium consulting services are available. DM @iandennismiller to set up an initial consultation.
Gousios, G. (2013). The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (pp. 233–236). Piscataway, NJ, USA: IEEE Press. Retrieved from http://dl.acm.org/citation.cfm?id=2487085.2487132
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.
Jeffers, D., & Swanson, J. (2005). How high is your E? Physics World, 18(10), 21. http://doi.org/10.1088/2058-7058/18/10/30
Linoff, G. (2013). Answer to SQL for computing … h-index. Retrieved from http://stackoverflow.com/a/18787390/1146681.