Social Network Analysis of Windows C/C++ Contributors on GitHub
William La Cholter, Antonius Stalick, Matthew Elder, Tony Johnson, Kathleen Carley
Modern software development is often a collaborative and dispersed activity, especially open source software development, with the source code repositories uniting code and other project artifacts. Open source projects have many roles that can be mined from the repository, including owner, author and committer of code, follower, and outside contributor. By analyzing the collaboration patterns of software developers with different roles, we can understand the communities and social network of development within a given population. Underpinning these roles is identity. In this talk we present our work mining the GitHub community and the insights and challenges with the complex manifestation of repository user identities.
GitHub is the most popular open source software website, with over 18 million public repositories and 40 million registered developers. It provides a lot of community structure over software repositories, whose designs long predate modern social networks. Because different kinds of projects, target platforms, and development technologies can have radically different workflows, we focus on a long-established subpopulation of GitHub: Windows C/C++ software developers. We identify and investigate 1,835 repositories tagged with those terms in July 2019, extracting the committers, authors, and taggers from the commits within our repositories of interest. We identify the unique contributors across those roles, build the network of contributors according to role, and extract key nodes in this network.
In our research, we identify a key challenge to social network analysis of this data: the unreliability of user identity that includes ambiguity, duplication, and bogus information. For example, GitHub users commit using different names and/or e-mail addresses, specify only a partial name or e-mail address, or claim someone else’s identity. These issues affect the network, depending on the rules that one uses for aggregating multiple names and e-mail addresses to a likely single identity. Furthermore, users may make changes using common local or system account names, and the system also complicates construction of the network by using system-wide identifiers for certain commits, creating network ties where they do not actually exist.
We investigate multiple strategies for assigning user identity, including a novel approach based on a hybrid combination of user attributes and the network structure. We calculate agent-level characteristics and graph-level metrics such as density, betweenness centralization, and diameter to derive models to analyze the data. We explore an augmented naïve Bayesian network model to identify which attributes and metrics significantly contribute to determining which users are unique and build a model to classify ambiguous user identities for analysis. This analysis will not only be used for identification, but also inform network clusters and key nodal attributes. We examine measures of centrality (degree, closeness, eigenvector, and betweenness) and meta-network measures (e.g., task exclusivity and redundancy) to evaluate the impact of different identity strategies on the resulting networks. Lastly, we explore Social Cognitive Mapping to analyze and visualize the community structures of the GitHub users, and we use a Weighted Consensus Graph to derive the graph of points in the latent space and uncover the communities of collaboration.← Schedule