Generating the Blueprints of the Java Ecosystem

Generating the Blueprints of the Java Ecosystem
Vassilios Karakoidas, Dimtris Mitropoulos, Panos Louridas*, Georgios Gousios, Diomidis Spinellis
Athens University of Economics and Business
Department of Management Science and Technology
*louridas@aueb.gr
This work presents the dataset obtained by statically analysing a set of projects (11,365 projects) of the Maven Central Repository by three static analysis tools; Cross-Lanugage Metric Tool (CLMT),
Chidamber and Kemerrer Java Metrics Tool (CKJM), and JDepend. These tools cover four aspects of a software project; class design, method design, package design and program size.
11,365 projects
22,730 Jars
74,565,772 LoC
65 Metrics
32,844,836 Measurements
Detecting Domain-specific Language Usage in Open
Source Projects
Dataset Construction Process
Maven Repository
XML
Filtered Java projects
with source and binary
jars
Valid Project Collection
446,749 Artifacts
SQL
Github Repository
RTF
HTML
DSLs
Regular Expressions
XPATH
XSLT
Workers
#1
#2
#3
The detection process was easy, the source code was statically analysed and the usage of specific packages were detected e.g.
java.util.regex (regular expressions), java.sql and javax.sql for SQL.
Workers: Download
the jars then execute
clmt, ckjm, and
jdepend
#N
How many DSLs are used per project?
CKJM
#1 XML with 3094 uses
Analyse exported data
Regex, 1751
8400
SQL, 1035
Measurements are
analysed and stored in
the MySQL database
MySQL
Database
XSLT, 888
1000
Metric Categories
Class
17 metrics
Program Size
17 metrics
Method
3 Metrics
Package
6 metrics
Note: These are the
unique metrics per
category, since the
three tools have
several in common.
Number of Projets
XPath, 190
HTML, 68
100
RTF, 7
Facts
~35% of the projects are using at
least one DSL
10
547 projects are using four DSLs
1
0
1
2
Database Schema
project
prj_pk: int(11)
prj_name: varchar(500)
category
cat_pk: int(11)
cat_name: varchar(500)
3
4
5
Number of DSLs
CLMT
8 projects are using 7 DSLs!
Popular DSL Combinations
measurement_type
mt_pk: int(11)
mt_name: varchar(500)
measurement
meas_pk: int(11)
meas_value: varchar(500)
meas_id: int(11)
meas_filename: int(11)
cat_pk: int(11)
prj_pk: int(11)
mt_pk: int(11)
JDepend
XML, XSLT (475)
identifiers
ident_pk: int(11)
ident_name: varchar(500)
Regex, XML (303)
Regex, SQL, XML, XSLT (80)
Regex, SQL, XML (71)
Regex, XML, XSLT (158)
SQL, XML, XSLT (54)
SQL, XML (162)
XML, XPath (50)
...
Regex, SQL (116)
Research Opportunities
The dataset can be used by researchers to test their models and theories against a large set of emprical data e.g. fine
tune software quality models that are based on metrics.
Practicioners can test their tools and validate their calculations against CKJM and JDepend.
This research has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program
“Education and Lifelong Learning” of the National Strategic Reference Framework (nsrf) - Research Funding Program: Thalis - Athens University of Economics
and Business - Software Engineering Research Platform.
Contact Information
Vassilios Karakoidas
bkarak@aueb.gr