Diversity in software engineering research Meiyappan Nagappan Thomas Zimmermann Christian Bird Presented by Guilherme de Assis 1 Introduction - Researchers have worked towards maximizing the impact of software engineering research on practice. - They tried to provide techniques and results that are general as posible. - “General conclusions from empirical studies in software engineering are difficult because any process depends on a potentially large number of relevant context variables”, Basili et al. 2 1 Introduction - Selecting projects for a study is an important step of the study. - More is not necessary better. - Your sample must be representative of population. - The article aims to: - assess the quality of a sample - identify projects that could be added to further improve the quality of the sample. 3 1 Introduction - Diversity: a diverse sample contains members of every subgroup in the population and within the sample the subgroups have roughly equal size. - Representativeness: in a representative sample the size of each subgroup in the sample is proportional to the size of that subgroup in the population. 4 1 Introduction - Sample Covarage: defined as the percentage of projects in a population that are similar to a given sample. - Sample coverage allows us to assess the quality of a given sample; the higher the coverage, the better - Further, it allows prioritizing projects that could be added to further improve the quality of a given sample 5 2.1 Universe, Space, and Configuration - Universe: a large set of projects, that can vary for diferent research areas. Example: all open-source projects. - Dimensions: characterizes each projects. Example: total lines of code, number of developers. - Space: the set of dimensions that are relevant for the generality of a research. Example: total lines of code, number of developers, main programming language. 6 2.1 Universe, Space, and Configuration - For each dimension d, a similarity function that decides whether two projects p1 and p2 are similar with respect to that dimension was defined: - The list of the similarity functions for a given space is called the configuration. - Similarity functions can vary across research studies. Ex: C and C++. 7 2.1 Universe, Space, and Configuration - To identify similar projects within the universe, we require the projects to be similar to each other in all dimensions. - If no similarity function is definied: - For numeric dimensions(ex.: number of developers): - For categorical dimensions(ex.: main programming language): - two projects are similar in a dimension if the values are identical. 8 2.2 Example: Coverage and Project Selection - Total: 50 projects. - In total seven other projects are similar to A. Thus project A covers (7+1)/50=16% of the universe. - For individual dimensions: project A covers 13/50=26% for number of developers and 11/50=22% for lines of code. 9 2.2 Example: Coverage and Project Selection - Show how a second project increases the coverage. - Adding Project B: universe coverage increase to 18/50=36%. Covarege of the developer to 60% and lines of code to 56%. - Adding Project C: The universe covarage increases only to 18%. 10 2.2 Example: Coverage and Project Selection - To provide a good coverage of the universe, one should select projects that are diverse rather than similar to each other. 11 2.2 Example: Coverage and Project Selection 12 2.2 Example: Coverage and Project Selection 13 2.3 Computing Coverage - Sample coverage of a set of projects P for a given universe U, an n-dimensional space D, and a configuration ( 1, … , ) is computed as follows: - Research topics can have different parameters for universe, space, and configuration. - It is important to not just report the coverage but also the context 14 in which it was computed. 3.1 The Ohloh Universe - Universe: active projects that are monitored by the Ohloh platform. - Steps: - 1. Extract the identifiers of active projects using Project API of Ohloh to measure coverage for ongoing development. 2. Three diferent categories of data (each with one call to the API): Analysis: main programming language, source code size and contributors. Activity category: how much source code developers have changed each month Factoid category: basic observations about projects such as team size, project age, comment ratio, and license conflicts. 15 3.1 The Ohloh Universe - 3. XML from Ohloh to text files. Remove projects that don’t have main language or invalid data (negative number of lines of code). - The universe consists of a total of 20.028 projects. 16 3.2 The Ohloh Space The dimensions used for the space are: - Main language. The most common programming language in the project. Total lines of code. Blank lines and comment lines are excluded by Ohloh when counting lines of code. Number of contributors (12 months). Contributors with at least one commit in the last 12 months. Number of churn (12 months). Number of added and deleted lines of code, excluding comment lines and blank lines, in the last 12 months. Number of commits (12 months). Commits made in the last 12 months. 17 3.2 The Ohloh Space - Project age: - < 1 year: Young - between 1 and 3: Normal - between 3 and 5: Old - > 5: Very Old - Project activity. if during the last 12 calendar months, there were at least 25% fewer commits than in the prior 12 months, the activity is Decreasing; if there were 25% more commits, the activity is Increasing; otherwise the activity is Stable. 18 3.2 The Ohloh Space 19 3.3 Covering the Ohloh Universe 20 3.3 Covering the Ohloh Universe 21 3.4 Covering the Ohloh Universe with the ICSE and FSE Conferences - What are the most frequently used projects in the ICSE and FSE conferences? - Eclipse Java Development Tools (JDT) - 16 papers Apache HTTP Server - 12 papers gzip, jEdit, Apache Xalan C++, Apache Lucene - 8 papers Mozilla Firefox - 7 papers - How much of the Ohloh universe do the ICSE and FSE conferences cover? - The 207 Ohloh projects analyzed in the two years of the ICSE and FSE conferences covered 9.15% of the Ohloh population. There isn’t low, because values in all dimensions have to be similar for a project to be similar to another. 22 3.4 Covering the Ohloh Universe with the ICSE and FSE Conferences - What are showcases of research with high coverage? 23 4 Discussion - Coverage scores do not increase or decrease the importance of research, but rather enhance our ability to reason about it. - Results that differ can still have value, especially in a space that is highly covered. - Always report the universe, space, and configuration with any coverage score. 24 5 Conclusion - More isn’t necessary better. - Technique to assess how well a research study covers a population of software projects. - This helps researchers to make informed decisions about which projects to select for a study. - Extends to researchers analyzing closed source projects. - Describe the universe and space of their projects without revealing confidential information about the projects or their metrics and place their results in context. 25
© Copyright 2025