A Time Series Interaction Analysis Method for Building Predictive

A Time Series Interaction Analysis Method for Building
Predictive Models of Learners using Log Data
Christopher Brooks
Craig Thompson
Stephanie Teasley
School of Information
University of Michigan
Ann Arbor, MI, USA
Dept. of Computer Science
University of Saskatchewan
Saskatoon, SK, Canada
School of Information
University of Michigan
Ann Arbor, MI, USA
brooksch@umich.edu
craig.thompson@usask.ca
ABSTRACT
As courses become bigger, move online, and are deployed to
the general public at low cost (e.g. through Massive Open
Online Courses, MOOCs), new methods of predicting student achievement are needed to support the learning process.
This paper presents a novel method for converting educational log data into features suitable for building predictive
models of student success. Unlike cognitive modelling or
content analysis approaches, these models are built from interactions between learners and resources, an approach that
requires no input from instructional or domain experts and
can be applied across courses or learning environments.
Categories and Subject Descriptors
K.3 [Computing Milieux]: Computers and Education;
I.2.1 [Computing Methodologies]: Artificial Intelligence—
Applications and Expert Systems
1.
INTRODUCTION
Predictive models in education generally require intimate
knowledge of the domain being taught, the learning objectives, and the pedagogical circumstances under which the
instruction takes place. While there is work that focuses on
removing some of these constraints and focusing instead on
specific tools or pedagogies (e.g. analysis of discussion forum
communication), this limits techniques to only those courses
which use particular technologies or pedagogical approaches.
In this paper we present a more general method of building predictive models for educational data based on student
interactions with the learning environment. Unlike existing
work in the area (e.g. [3], [14]), we aim to build models
solely from coarse grained observations of interactions over
time between a student and course resources. Our goal is to
not only build an accurate predictive model for a particular
course, but to do so in a fashion that scales across many different courses and learning environments. We aim to enable
“one click modeling” of a large variety of educational data
systems without the need to burden instructors, pedagogical
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
LAK ’15 March 16 - 20, 2015, Poughkeepsie, NY, USA
ACM 978-1-4503-3417-4/15/03 ...$15.00.
http://dx.doi.org/10.1145/2723576.2723581.
steasley@umich.edu
experts, or learning technologists. These models can then be
used by these individuals to gain insight into activities that
have happened in a course, build early-warning systems for
student success, or characterize how courses relate to one
another. Of course, we do not claim that experts should
be removed entirely from the modelling process, rather we
aim to augment their activities by making it easier to create
data-driven models.
A strong motivation for this approach comes from the
growing list of educational software systems that collect “clickstream” data about learners. For instance, the BlackBoard
and Sakai learning management systems both collect data
on the interactions learners have with various tools and
content, the Opencast lecture capture system collects finegrained data on access to lecture video and configuration of
the playback environment, and the Coursera massive open
online course platform collects web logs of how users have
navigated through the course website. All of these systems
do this educational data logging in addition to maintaining
traditional operations data based on the features available
to learners.
To narrow the focus of this paper, we specifically apply a
technique we refer to as time series interaction analysis to
predict student achievement in summative evaluations from
Massive Open Online Courses (MOOCs). A number of approaches have been used to predict student achievement and
the general consensus is that previous student evaluations
(either summative or formative) are the best predictor of
future success in higher education. For instance, using data
from a four year traditional public school, Jayaprakash et
al. [14] provide a description of a logistic regression model
in which partial grades for a course followed by cumulative
grade point average are the strongest predictors of a students final grade. Barber and Sharkey [3] provide a similar analysis of data collected from a four year private online
university, demonstrating that while the importance of prior
academic achievement decreases as formative assessment is
collected, both measures are more important than the data
collected about student behaviors in the learning environments used. With the increased availability and quality of
institution-wide student demographic and assessment data
through university data warehouses, it is not surprising that
summary measures of student activity in online courseware
have been most explored when building predictive models.
Massive open online courses are different from courses offered in traditional higher education settings in many ways.
In this work we focus on one particular difference between
MOOCs and traditional course delivery that impacts the
ability of institutions to build predictive models of student
achievement – namely, institutions that offer MOOCs tend
to have very little prior information about learners and the
learners’ previous academic achievements. For instance, students engaging in Coursera offerings at the University of
Michigan are not obligated to report demographic information, residency information, previous history with the content they are studying, or their goals for enrolling in the
MOOC. In these cases, interaction behavior with the learning platform is the only source of data that is available from
which to form a predictive model until course examinations
have been completed.
This paper proceeds as follows: In section 2 we provide
a discussion of our approach to modeling educational log
data, including details of a method of generating features
suitable for data mining based on the popular n-grams technique used in the field of text analysis. We detail the kind
of data available from the Coursera MOOC platform, and
the datasets which we have used to validate our approach.
In section 3 we describe how our approach can be used to
support three different course modeling activities: (1) understanding a single cohort of learners, (2) generalizing a
model across different sessions of a course, and (3) demonstrating the effectiveness of the model for predicting success
over the course. In this last question we consider specifically
how time series interaction analysis models change in form
and accuracy throughout a course, an important consideration when building automated early warning systems. We
conclude with a discussion about the generalizability of the
approach and avenues for further exploration.
2.
APPROACH
In the field Technology Enhanced Learning (TEL), much
attention has been paid to understanding how people learn
from a cognitive perspective. For instance, Anderson’s ACTR theory of skill knowledge [2], which is used as a basis for
many intelligent tutoring systems (see [7]), suggests that
cognitive skills can be described as production rules: small
operations of data manipulation organized around atomic
goals. Firing of correct rules is done repeatedly with the
facts available to a learner, leading them to demonstrate a
particular higher level cognitive skill. Inability to fire correct
rules in such a way that a skill is demonstrated indicates
a lack of having the correct rules, and suggests a need for
educational intervention (learning) or that the rule matching
mechanism needs improvement.
Ohlsson’s theory of learning based on performance errors
provides an alternative to the ACT-R theory, where he argues that it is through making mistakes and correcting them
that we demonstrate learning [17]. Providing a correct answer does not signify the learner understands; instead, the
learner may just not yet have made a mistake and may have
inadvertently answered correctly. It is the times the learner
demonstrates mistakes that indicate learning is happening.
This approach is core to the constraint-based modeling family of intelligent tutoring systems such as [16].
Learner interactions with content and problems are not
the only focus of learning theories, as learning through communication with other individuals has been explored broadly
under the theory of social constructivism [11]. While the majority of work related to TEL in this area has been on peerto-peer learning through chat or discussion forums, some
have also applied intelligent systems in the form of peer
matching [6] or tutors based on dialogue systems [12].
In this work we aim to enable the modeling of learners
based on data gathered from learning systems that log the
interactions learners have with resources. This is a datadriven approach to modelling learners versus a theoretical
approach, and it is meant to be complementary to the approaches described above. This approach has particular
benefits for scaling the creation of learner models, as no
interaction is required from human experts (instructors, instructional designers, or tutors) in order to generate the
models. The end results may then either be used in an
automated fashion as part of an early-warning system, or
may be used by pedagogical experts as a reflection on how
learner–environment interactions relate to student success.
2.1
General Model for Educational Log Data
We view the learning system as being made up of five
pieces: students, resources, interactions, events, and outcomes. The first of these, students, is a set of individuals
who interact with the learning environment. These individuals have characteristics that are known when they first begin interacting with the environment and, for simplification
of modelling purposes, these characteristics do not change.
For example, demographic variables (e.g. age, gender, ethnicity) as well as prior knowledge (e.g. previous grades or
other measures of evaluation) can be associated with an individual, and may be a direct influence on their outcomes.
In the results described in the next session we omit student
characteristics from our modeling, but we note here that
they may be useful (and readily accessible) when creating
predictive models.
Students interact with a learning system through resources.
These resources may be web content, discussion forums, lecture video, or even intelligent tutoring systems. Resources
may be described through different levels of generalization.
For instance, the coarse grain “lecture” resource may be
made up of individual “lectures”, each of which may be made
up of “segments”. An important distinction between this
view of resources and others is that we intentionally conflate
pedagogy, technology, and content into a single item, and do
not attempt to disambiguate resources by defining them to
be about concepts, methods, or delivery mechanisms.
An interaction denotes a singular circumstance in which
a student uses a resource, and represents a temporal relationship between the student and resource. For instance, an
interaction may be viewing a lecture, submitting a quiz, or
reading a discussion forum post. It is expected that individual interactions will be processed through aggregation,
summation, scaling, or other mathematical functions in order to describe different levels of granularity that may be
useful in the modelling process. This processing is to be
applied in an automated manner, and not require a priori
hypotheses based on the content, concepts, or individuals
involved.
Each interaction exists between two events. Events are
demarcations of the beginning and end of time-frames of interest. Conceptually, events can be hierarchically arranged,
and a given set of data might have a start and end time
which encompass other events such as assignment deadlines,
examinations, or course beginning and endings. In the investigation section to follow we will focus only on a single
set of events that note the beginning and end of a course,
but one can readily imagine how it may be useful to predict
outcomes for other pairs of events (e.g. the beginning of
the course and the first major exam, or the beginning of the
course and the first assignment deadline).
Educational outcomes can be measured in various ways including through taxonomies of skill acquisition (e.g. through
Bloom’s taxonomy [4] or the like), grades (which may be
content-based or a comparison between students in a cohort), or student satisfaction (which may be measured
through self-reports or through proxy variables such as retention in a program). In our characterization of educational
data modelling we make no attempt to link specific interactions to outcomes in a theoretical manner. Instead, we argue
that consistent and repeatable correlations found through
the data mining process will either support or not support
linkages between interaction patterns and educational theory. Thus, evidence for learning theory is an output of the
modelling process which can be reflected upon by practitioners, but theory is not necessarily an input to the process.
The only constraint we put on the educational outcome is
that it be well-defined and measurable so that it can be used
as a predictor variable in the data mining process.
2.2
Creating Time Series Features from Log
Data
In data mining classification tasks, a feature is a key/value
pair associated with an instance in the dataset which describes it in some fashion. Features may be nominal, ordinal, or real values, and may be discrete or continuous.
In our approach to modeling interactions of learners within
MOOC platforms we generate a base set of binary features
(true or false) based on the timeframe in which a resource
was accessed by the learner.
2.2.1
Timeframes
We represent timeframes as relative offsets from the start
of the course. This allows for comparison across models
where courses are treated as being similar to one another
as one might do with consecutive offerings of a course. We
chose 4 different granularities of timeframes: accesses within
a calendar day, a three calendar day period, a calendar week,
and a calendar month. Thus, in a course that is offered over
60 days there will be sixty one-day features, twenty threeday features, roughly nine week features (depending when
the course started), and up to three month features. In addition to these binary features, we generate seven summative
features which hold counts of the number of calendar days
of the week a learner has accessed a given resource.
2.2.2
Resources
Students have a variety of resources available for learning
in MOOCs and, with the introduction of third party tools
through Learning Tool Interoperability (LTI) standards, the
list of these resources can be very broad. Further, one can
conceptualize resources as being hierarchically arranged – a
particular web page of content might belong to a collection
of pages which belong to a section in a course, or a particular
question on a quiz might be composed within in a section of
a particular exam.
We decomposed the Coursera clickstream datafile into a
relational database.1 These tools parse access URLs and distinguish resources by the paths and parameters that have
been used to access them. As we were interested in looking at a longitudinal dataset collected over several years, we
restricted our investigation of interactions to three coarse
grained resources: lecture videos, discussion forum threads,
and quiz attempts. The choice to make this representation
at a coarse level (e.g. viewing any lecture video is considered an interaction with the lecture videos resource, instead
of making separate resources for each lecture video that exists) was somewhat arbitrary, and we leave discussion of the
potential affect this has on classification accuracy to our
conclusions.
2.2.3
Applying n-grams to Time Series Features
The co-occurrence of features based on the time series
data may represent patterns that correlate with outcomes of
interest. For instance, if all students who watch lectures on
the sixth, seventh, and eighth day of the course end up with
a passing grade in the course, while those who do not watch
lectures these days fail to get a passing grade, then this pattern of behavior is valuable (and would be captured by our
existing transformations). If, however, a pattern of interaction such as watching consecutive lectures on any three
days was correlative with an outcome of interest, this pattern would be missed by the features described thus far.
To capture these more general patterns of interaction, we
apply the n-gram technique from text mining to interactions. An n-gram is a sequence of n words, and n-gram
features are often used as counts of particular n-grams. For
instance, if the words “quick brown fox” occurs twice in a
given document, the n-gram (in this case a 3-gram) feature
quick brown fox would have a value of two. In our data we
are dealing with accesses to resources such as lecture videos,
so an n-gram with the pattern (f alse, true, f alse), the label
of week, and count of 2 would indicate that a student had
two occurrences of the pattern of not watching lectures in
one week, watching in the next week, and then not watching
again in the third week.
We generate the set of n-grams ranging from 2-grams to 5grams covering all permutations of (f alse, true) from (f alse,
f alse) to (true, true, true, true, true). We repeat this process for all of the features described in section 2.2.1: single
days, 3-day lengths, weeks, and months.
2.3
Massive Open Online Course (MOOC)
Example
The University of Michigan has offered a number of
MOOCs through the Coursera platform since 2012. As an
example, one of these MOOCs was 104 days long and used
video lectures (which we coded as lectureview ), discussion forums (coded as forumthread ), and quizzes (coded as quizattempt) throughout. Using the method described in 2.2.1,
we coded single day accesses (1d), three day accesses (3d),
one week accesses (1w) and one month accesses (1m) over
the 104 day period, resulting in 480 boolean features in the
format shown in Figure 1.
We added 21 more features which were summative in nature describing accesses to resources over days of the week
(starting with Sunday coded as a 0) in the format shown in
Figure 2.
Finally, we added 717 more features representing the 2-,
3-, 4-, and 5-gram patterns described in section 2.2.3. Each
of these features are in the format as shown in Figure 3. The
final datafile for this course was made up of a total of 1221
1
The tools for this process are open source and available at
https://bitbucket.org/umuselab/mooc-scripts
0_1D_FORUMTHREAD...103_1D_FORUMTHREAD
0_3D_FORUMTHREAD...35_3D_FORUMTHREAD
0_1W_FORUMTHREAD...14_1W_FORUMTHREAD
0_1M_FORUMTHREAD...4_1M_FORUMTHREAD
Figure 1: List of features representing student
interactions with discussion forums. Separate
features exist for single day, three day, one
week, and one month accesses.
0_DOTW_FORUMTHREAD...6_DOTW_FORUMTHREAD
0_DOTW_LECTUREVIEW...6_DOTW_LECTUREVIEW
0_DOTW_QUIZATTEMPT...6_DOTW_QUIZATTEMPT
Figure 2: Days of the week counts as features
for the three resources.
total features for each instance (each student enrolled in the
course) in the dataset.
[0,
[0,
[0,
[0,
0]_1D_FORUMTHREAD...[1,
0]_3D_FORUMTHREAD...[1,
0]_1W_FORUMTHREAD...[1,
0]_1M_FORUMTHREAD...[1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1]_1D_FORUMTHREAD
1]_3D_FORUMTHREAD
1]_1W_FORUMTHREAD
1]_1M_FORUMTHREAD
Figure 3: Temporal pattern features as n-grams
for interactions with resources.
3.
EXPERIMENT
We investigated whether our approach of time series interaction analysis would be suitable for answering three predictive modeling questions about MOOCs:
R1 Can we create an accurate post-hoc explanatory model
that describes the patters of interaction that lead to
learners achieving a passing grade (defined as a Coursera “normal” grade) for a given session of a course?
R2 What is the post-hoc model generalizability and can it
be used to accurately describe new sessions of a course?
R3 How does model accuracy and explanation change over
time if the model is created while a course is ongoing?
The first of these questions is a reflective activity, aimed
at providing a summary to course designers or instructors
as to how interactions have affected achievement. These instructional experts can then modify the pedagogy, resources,
help methods, or structure of the course to target particular
groups of learners. If the accuracy of the model is low, it still
may be useful in describing how certain kinds of interaction
patterns relate to achievement.
The second of these questions is longitudinal in nature,
aimed at generalizing the model across sessions and characterize how predictive it might be of a new offering. Such
models are powerful but generally require historical data
and an understanding of how the structure, content, and
pedagogy of course resources has changed over subsequent
offerings. In the case of Michigan MOOCs we have made
the assumption that these resources have undergone minimal change which may not be true in other situations.
The last of these questions is most relevant when building
early-warning systems for student success. Being able to
identify early on in a course which students will pass and
which will fail allows for targeted delivery of help resources
or other interventions to those who need it the most. There
may be explanatory power in early models as well if accuracy
is high, as it allows instructional experts to see patterns of
interaction that may be associated with success but not most
predictive once the course has finished. These explanations
might help tutors or instructors identify particular problems
with course design or student misconceptions.
To address these questions, we formed predictive models with J48 decision trees using the Weka toolkit [13] for
4 different Michigan MOOCs offered on the Coursera platform. We chose these four MOOCs based on how long they
had been running and whether the curriculum, platform,
and data formats behind the course were largely unchanged.
These MOOCs covered a variety of domains and anticipated
skill levels of learners. Each MOOC had its own criteria for
determining what a passing grade for the course was and,
similar to the results of others [18], the total number of learners achieving a passing grade are only a small portion of the
total number of learners who have enrolled in the course.
Summary statistics for each of the MOOC datasets we used
are given in Table 1.
3.1
Technical Parameters and Nomenclature
Educational data, especially data from MOOCs, is often
highly unbalanced. As shown in Table 1, the fraction of students who pass the course is between 1.46% and 11.79%. As
there is a tendency for machine learning algorithms to bias
towards the majority class, and training based on balanced
data has been shown to improve accuracy in real-world educational data mining activities [8], [14], we report the majority of our measures using balanced random data where
balancing is done through subsampling of the majority class.
The exception to this is for research question R3 described
in Section 3.4 as this question is particularly aimed at generating models for a course in situ, where the proportion of
students who pass is not already known.
Our experiment does not address what the optimal data
mining technique or parameters are for each of the three research questions described. Decision trees were chosen for
their ease of use and clear interpretability for to instructional designers [5], and as our contribution is primarily one
of feature engineering we anticipate various other kinds of
machine learning techniques (e.g. Bayesian models, support
vector machines) will work with similar, or better, results.
We do not make any claims here as to whether J48 is the
ideal method of learning that should be used in this domain.
Unless otherwise stated, all data processing was done using
the Weka toolkit version 3.7.10 with the J48 classifier, an
implementation of C4.5. The classifier was parameterized
with a confidence level of 0.25 and a minimum leaf node size
of 50.
In our results related to all three research questions we report the number of correct classifications, the number of incorrect classifications, and Fleiss’ κ (kappa) [9] as a measure
of agreement between observed data and predicted data.
κ is chance corrected, and ranges from -1 to 1 where 0 is
an indication of chance agreement and 1 indicates complete
agreement. A challenge in educational data mining is the
determining how strong a measure of κ is needed – different authors provide widely different views on the issue (see
[15], [10], [1] as cited in [19]), though there is some agreement that values above 0.41 are fair or better, and values
Course and Offering
Number of Students
with Grades
Number of
Students Passing
Passing Criteria
Total Number
of Interactions
in Course
Literature-1
Literature-2
Internet-1
Internet-2
Internet-3
Philosophy-1
Philosophy-2
Philosophy-3
Networks-1
Networks-2
Networks-3
14, 628
17, 385
34, 120
19, 990
23, 790
38, 344
48, 521
17, 385
62, 826
35, 274
58, 006
668
384
2, 909
2, 671
1, 395
2, 331
2, 559
1, 946
1, 409
1, 379
930
≥ 60
≥ 60
≥ 80
≥ 80
≥ 80
≥ 75
≥ 75
≥ 75
≥ 78
≥ 78
≥ 78
2, 696, 875
1, 670, 370
4, 331, 072
3, 910, 375
2, 713, 321
8, 719, 079
9, 840, 823
9, 255, 691
7, 379, 887
4, 764, 195
4, 509, 613
Table 1: Course statistics for each MOOC analyzed. Note that the criteria for passing a course can be a
complex calculation and that it is not uncommon for MOOCs to allow for greater than 100 points to be
achieved.
[0,
[0,
|
|
0, 0, 0]_1W_FORUMTHREAD <= 4: pass (590.0/19.0)
0, 0, 0]_1W_FORUMTHREAD > 4
[0, 0, 0, 0, 1]_1D_LECTUREVIEW <= 2: fail (647.0/21.0)
[0, 0, 0, 0, 1]_1D_LECTUREVIEW > 2: pass (99.0/23.0)
Figure 4: Example of end of course decision tree for
Literature-1
above 0.81 are strong. Actual determination of the value of
κ is highly contextual and is determined by the end use of
the data – a value of kappa = 0.4 might represent a strong
agreement if the cost and risk of intervention is low, while a
more conservative kappa = 0.8 might be required for higher
risk or high cost interventions. We address the challenge of
interpreting measures like κ in the future work section of the
paper.
3.2
Accurate Post-hoc Explanatory models
(R1)
Research question R1 addresses the issue of whether accurate post-hoc explanatory models can be created for a
course. This leads to two related analyses (a) are the models
created explanatory in nature and (b) what is the accuracy
of the models?
3.2.1
Post-hoc Model Explanatory Power
To address the first of these issues, we rely on the decision
tree generated by the J48 implementation of C4.5. The tree
lists a set of rules where leaf nodes classify learners based
on the features used in training. Each leaf node has a misclassification rate given in parenthesis after the leaf node,
where the first number is the total number of instances represented by the leaf and the second is the number of those
instances that are misclassified. Figure 4 gives an example
of the decision tree for one course, Literature-1. In it there
are only three different paths to classification:
• If the learner has less than or equal to four one week
[0, 0, 0, 0] counts they will pass the course (misclassification rate of 3%). This path suggests that reading
discussion forums is valuable in passing this course. An
instructional expert might find this description help-
ful and (with a belief that the activity is causal), may
try and pull in students who go long periods without
reading discussion forums.
• If the learner has not followed the first rule, but has
more than two [0, 0, 0, 0, 1] one day lecture views
then they will pass the course (misclassification rate of
3%). This suggests that watching behavior of lectures,
spaced broadly (five days apart), at least a couple of
times is valuable if reading of discussion forums over
time is not being done. While the relatively high misclassification rate suggests care should be paid to overrelying on this path, an instructional expert (with the
belief that the activity is causal) may do a midterm
evaluation of how students are using lecture content,
and send out email invitations to students who have
disengaged.
• The last path suggests that students who do not fit
the other two descriptions are likely to fail the course
(misclassification rate of 23%)
A second end-of-course decision tree, for Networks-1, is
given in Figure 5. This tree is even more limited and has
only two paths, one that suggests that if learners attempt
and assessment in at least two consecutive months after the
first (e.g. that they have a pattern of one month quiz attempts of [0, 1, 1] or higher) they will pass the course. This
kind of pattern, which relies on interactions with assessment
mechanisms, is prevalent throughout the rest of the MOOC
courses we considered. It is important to note that a quiz attempt does not capture the grade a student achieved on the
quiz, nor reveal conceptual errors a student may have made
about particular questions. Instead, this model only looks
at interaction activity, and misclassification rates are low.
This tree may demonstrate the low assessment demands of
learners in MOOCs (e.g. that quizzes are easy enough that
just attempting a quiz will result in passing the course), or
the highly specialized backgrounds of learners in MOOCs
(e.g. that learners aiming to get certificates in the course
already have strong backgrounds in the subject).
While a more thorough understanding of the explanatory
power of these models would require user studies, it seems
reasonable to suggest that the end of course models have
[0, 1, 1]_1M_QUIZATTEMPT <= 0: fail (1389.0/3.0)
[0, 1, 1]_1M_QUIZATTEMPT > 0: pass (1431.0/24.0)
Figure 5: Example of end of course decision tree for
Networks-1
minimal explanatory benefits to instructional experts. The
models presented here instead act as summaries of learner
activity which are broadly predictive. The J48 decision tree
does not capture all intermediate models of activity, and
prunes out features which may be somewhat (even significantly) predictive but not as predictive as the summary features discussed here. The post-hoc models created through
the time series interaction analysis lack strong explanatory
powers.
3.2.2
Post-hoc Model Accuracy
The second part of research question one (R1) investigates
the accuracy of the post-hoc time series interaction models.
If the models are inaccurate then further research on improving the explanatory power of the model is questionable.
However, if the models are accurate predictions of student
results then they may be applicable in automated situations.
The results here are positive: model accuracy as measured
by the κ statistic are extremely high, above 0.9 in all cases,
with a misclassification rate below 5% in all cases (Table 2).
This, along with the low number of paths to leaf nodes in the
decision trees, suggests that the time series interaction analysis method captures features which are highly correlated
with learner achievement, in this case defined as receiving a
passing grade in the MOOC course.
Course
κ
Literature-1
Internet-1
Philosophy-1
Networks-1
0.90
0.96
0.92
0.98
correctly
classified (%)
incorrectly
classified (%)
1,273
5,306
1,660
2,793
63
82
62
27
(95.28)
(98.48)
(96.40)
(99.04)
(4.72)
(1.52)
(3.60)
(0.06)
Table 2: Accuracy of models for each of the courses
considered. A κ of 1 indicates a perfectly accurate
model, while a κ of 0 represents a model as good as
chance.
The post-hoc models created through the time series interaction analysis are highly accurate. However, this does
not speak to whether the models are generalizable or not as
this experiment did not use cross-validated or testing on a
hold out set. The next section (R2) will consider this issue
directly.
3.3
Post-hoc Model Generalizability (R2)
A post-hoc analysis for a single session describes the features that most highly correlate with success within that
session. An important issue with predictive models is how
both accuracy and explainability change after several sessions of a course have run. We trained daily models from
the combined (balanced) data of the first two offerings of
each course, and tested these models on a full dataset (unbalanced) from the third offering of the course.2 It is important to note that each session of the course was made up
of different learners accessing resources in a different portion of the calendar year, and that that we did not combine
all three datasets and hold out a random percentage. In-
stead, our interest was in observing the sensitivity this kind
of model might have to courses being run over a different
time period. Further, only minimal investigation was made
to ensure that each course continued to use quizzes, lecture
video, and discussion forums in a way similar to previous
offerings, and pedagogical approach or instructional technique was not constrained in any way. Table 3 provides the
results of this analysis, showing a drop in accuracy but still
relatively high values of κ ≥ 0.50.
Course
κ
correctly
classified (%)
incorrectly
classified (%)
Internet
Philosophy
Networks
0.63
0.50
0.73
22,556 (94.37)
53,389 (93.68)
57,466 (98.99)
1,346 (5.63)
3,603 (6.32)
640 (1.10)
Table 3: Accuracy of models when trained on the
first two sessions of a course and applied to the third
session. Overall κ values drop significantly, yet remain well above the 0.4 threshold for fair-moderate
quality.
While the accuracy of the models dropped, the size of
the decision trees increased significantly (see Figure 6), and
an understanding as to whether this size leads to a higher
level of explainability or not is not clear without further
user studies. For instance, the bolded section of the tree
suggests that even with a lack of two week quiz attempts
([0, 0]_ 1W_QUIZATTEMPT > 8) and not watching lectures
each month ([1, 1, 1]_1M_LECTUREVIEW) it is possible to
pass the course and that may depend on whether a student has watched lecture in the 18th three day period
(18_3D_LECTUREVIEW = True: pass (55.0/26.0)). While
this could be erroneous (the misclassification rate is quite
high, at 47%) it is also possible that this period represents
a pivotal point in the course that only the instructional experts associated with the course would recognize.
In summary, models trained on the first two sessions of
a course are generalizable to a third session with moderate
accuracy, and the explanatory power of models may change
but making this determination requires further study.
3.4
Model Accuracy and Explanation Change
Over Time (R3)
Summative models such as those presented the previous
sections may be useful for understanding a particular cohort
of learners as they display patterns of interaction with resources that correlate with success in a given course. A central topic in the emerging field of learning analytics however,
is how practical predictive systems can be formed based on
educational data. These systems not only need to be able
to consider unseen data as described in the previous section,
but also need to work with it in situ while a course has only
been partially completed.
To investigate this issue we trained daily models from the
combined (balanced) data of the first two offerings of each
course, and tested these models on full dataset (unbalanced)
from the third offering of the course. In our approach we
make the assumption that the resources, assessment criteria, and instructional tempo (e.g. length of course, deadlines
2
As there were only two offerings of the Literature course
we excluded it from this analysis.
[0,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[0,
|
|
|
|
0]_1W_QUIZATTEMPT <= 8
[1, 0, 0]_1M_LECTUREVIEW <= 0
|
2_1M_LECTUREVIEW = False
|
|
[1, 1, 0]_1M_QUIZATTEMPT <= 0
|
|
|
18_3D_LECTUREVIEW = False
|
|
|
|
[1, 1]_1W_QUIZATTEMPT <= 0
|
|
|
|
|
[1, 1, 0]_3D_QUIZATTEMPT <= 0
|
|
|
|
|
|
4_3D_QUIZATTEMPT = False
|
|
|
|
|
|
|
3_3D_QUIZATTEMPT = False: fail (56.0/20.0)
|
|
|
|
|
|
|
3_3D_QUIZATTEMPT = True: pass (55.0/23.0)
|
|
|
|
|
|
4_3D_QUIZATTEMPT = True: pass (70.0/21.0)
|
|
|
|
|
[1, 1, 0]_3D_QUIZATTEMPT > 0: pass (51.0/13.0)
|
|
|
|
[1, 1]_1W_QUIZATTEMPT > 0: pass (502.0/65.0)
|
|
|
18_3D_LECTUREVIEW = True: pass (221.0/14.0)
|
|
[1, 1, 0]_1M_QUIZATTEMPT > 0
|
|
|
[1, 0, 0, 0, 0]_1W_QUIZATTEMPT <= 0: pass (99.0/10.0)
|
|
|
[1, 0, 0, 0, 0]_1W_QUIZATTEMPT > 0: fail (55.0/3.0)
|
2_1M_LECTUREVIEW = True: pass (3775.0/90.0)
[1, 0, 0]_1M_LECTUREVIEW > 0
|
[1, 0, 0, 0]_1W_QUIZATTEMPT <= 0: pass (101.0/5.0)
|
[1, 0, 0, 0]_1W_QUIZATTEMPT > 0
|
|
[0, 1, 1]_1D_FORUMTHREAD <= 0
|
|
|
[0, 1, 0, 0, 0]_3D_QUIZATTEMPT <= 0: pass (57.0/21.0)
|
|
|
[0, 1, 0, 0, 0]_3D_QUIZATTEMPT > 0
|
|
|
|
[0, 0, 0]_1D_LECTUREVIEW <= 70: fail (164.0/37.0)
|
|
|
|
[0, 0, 0]_1D_LECTUREVIEW > 70
|
|
|
|
|
[0, 0, 0, 0, 0]_3D_QUIZATTEMPT <= 17: pass (86.0/34.0)
|
|
|
|
|
[0, 0, 0, 0, 0]_3D_QUIZATTEMPT > 17: fail (53.0/19.0)
|
|
[0, 1, 1]_1D_FORUMTHREAD > 0: pass (56.0/16.0)
0]_1W_QUIZATTEMPT > 8
[1, 1, 1]_1M_LECTUREVIEW <= 0
|
18_3D_LECTUREVIEW = False: fail (4736.0/122.0)
|
18_3D_LECTUREVIEW = True: pass (55.0/26.0)
[1, 1, 1]_1M_LECTUREVIEW > 0: pass (276.0/33.0)
Figure 6: The decision tree generated for the end of the Internet course when trained on two datasets.
for assignments, etc.) go largely unchanged between course
offerings at the particular grain size we are investigating. In
our experiment we look not at the specific details of each
resource (e.g. the particular lecture video or segment of a
video a learner may have watched), but only at the coarse
grained activity of learners. Thus we expect that our approach will be valuable even if fine grained changes are made
to resources (e.g. videos are modified with new content) as
long as the macro patterns of interaction are unchanged.
Figure 7 shows the change in κ over time for the three
courses we investigated. The dashed blue upper line in each
subfigure represents the κ when evaluating the model against
the training data (the first two offerings of the course), while
the solid red lower line represents the κ of the test data (the
third offering of the course). With respect to the training
data, all three courses show similar trends of a rapidly increasing κ that starts between 0.25 and 0.3 and reaches a
more stable value between 0.8 and 0.9 roughly three weeks
(21 days) into the course. To have positive κ values after
one day of course delivery is encouraging, and that the values continue to rise quickly suggests that this approach may
be beneficial for automated early warning systems.
The first two subfigures of Figure 7 show a rise in the value
of the testing κ as well, coming to a value over 0.4 (Philosophy) and 0.5 (Internet) within the first three weeks of the
course, climbing to values above of 0.5 and 0.6 respectively
by the end of the period. Such values are much larger than
chance agreement (κ = 0) and fit well within the fair or better category suggested in the literature. The third subfigure,
corresponding to the Networks course, showing the rate of
change in the testing κ over time, appears more linear in its
growth. The first session of this course was one week longer
than the second and third sessions – while we did no analysis of the differences between sessions, it is interesting to see
that the predictive model still retains power (albeit, it takes
until day 24 to achieve a κ ≥ 0.4) despite being trained on
more heterogeneous data.
To better understand the effect the change in accuracy
has over time, we graphed the change in confusion matrices
values for each of the courses (Figure 8).3 Each matrix is
made up of four values: the number of students who were
predicted to pass and did (true positives), the number of
students who were predicted to fail and did (true negatives),
the number of students who were predicted would pass and
did not (false positives) and the number of students who
were predicted to fail and passed (false negatives). The values are reported in actual terms from the unbalanced third
dataset. For instance, if an early-warning system for the Internet course was configured with these predictive models,
it would have identified 20,730 students as likely to fail, incorrectly classifying 46 of these who end up pass the course
while at the same time missing 1,915 students who will end
up failing the course. The instructional expert (or systems
administrators or designers) must weigh the cost of the intervention (e.g. fiscal cost to the institution) as well as the
detriment of delivering the intervention to the 46 students
(e.g. annoying or discouraging on-track students). This second issue is one that is concerning, and it is positive to see
Kappa Over Time for Internet MOOC
Confusion Matrix Over Time for Internet MOOC
1
3000
False Positives
0.9
0.7
0.6
0.5
0.4
0.3
0.2
Training Data
Testing Data
0.1
Number of Students (total = 23,902)
0.8
Kappa
False Negatives
2500
0
True Positives
2000
1500
1000
500
0
Days
Days
Kappa Over Time for Philosophy MOOC
Confusion Matrix Over Time for Philosophy MOOC
1
12000
False Positives
0.9
Kappa
0.6
0.5
0.4
0.3
0.2
Training Data
Testing Data
0.1
Number of Students (total = 56,992)
0.7
0
True Positives
8000
6000
4000
2000
0
Days
Days
Kappa Over Time for Networks MOOC
Confusion Matrix Over Time for Networks MOOC
1
9000
0.9
8000
0.7
0.6
0.5
0.4
0.3
0.2
Training Data
Testing Data
0.1
0
Number of Students (total = 58,006)
0.8
Kappa
False Negatives
10000
0.8
False Positives
False Negatives
True Positives
7000
6000
5000
4000
3000
2000
1000
0
Days
Figure 7: Kappa κ Over Time
Day
Figure 8: Classification Rates Over Time. True
negatives omitted for readability (the vast majority of MOOC users do not achieve academic
success).
that in our models the false negatives (incorrectly predicting
students will fail) drop off rapidly in all situations (by the
21 day mark).
Applying time series interaction analysis models to new
MOOC unbalanced datasets based on similar balanced historical data is moderately accurate by the third week of the
course (κ ≥ 0.4) with false negatives and true positives
reaching stable low and high levels respectively.
4.
4.1
CONCLUSIONS AND FUTURE WORK
Conclusions
While much work has been done in leveraging cognitive
modelling for predictive models, data-driven predictive modelling of achievement that scales across contexts (courses offerings, instructors, and institutions) is in its infancy. The
technique we have described here is based on automatically
generating machine learning features from learner interactions with educational resources over time. Specifically, features are created as n-grams over different time periods from
log files created by the technology enhanced learning environment (in this case, Coursera). This technique can be
scaled widely and applied to educational datasets without
burdening a domain or instructional expert in the process
of model generation.
Using this approach we have shown that:
R1 Models are highly accurate (κ ≥ 0.9) when used posthoc but lack strong explanatory power.
R2 Models trained on the first two sessions of a course are
generalizable to a third session with moderate accuracy
(κ ≥ 0.5), and the explanatory power of models may
change.
R3 Models are moderately accurate when applied to new
real-world data by the third week of the course (κ ≥
0.4) with false negatives and true positives reaching
stable low and high levels respectively. We have further characterized what this accuracy looks like at
a day-by-day level, an important consideration when
building predictive modelling solutions and an important issue for future research.
4.2
Future Work
Having demonstrated the success of this approach to feature generation, an important next step will be to analyze
the predictive nature of each of the features generated. In
this work, we performed no feature selection prior to model
generation. It may be the case that the predictive power
of the models is increased by selecting only those features
that strongly correlate with the predicted value (course outcome). Furthermore, the decision trees generated provide
little insight into the relative predictive power of each of the
features. Inspecting the trees, we can discern which features
are used in making predictions, however, we cannot determine exactly how much more informative these features are
than the features not used by the decision trees. Finally, to
3
Due to the massive level of true negatives, students who we
predict will fail and do, we have omitted these values from
the graphs. To determine the value at any given time take
the total sample size given on the y-axis label and subtract
from it the graphed values of False Positives, False Negatives, and True Positives.
increase model interpretability, preference should be given
to shorter n-grams, as the concise nature of these features
leads to easier explanation.
In order to make predictions about student outcomes in a
particular course, the approach used in this paper assumes
that prior course offerings are similar to the present offering. Thus, patterns of activity that lead to success in prior
offerings will again lead to success in the current offering. It
may be the case that some offerings of a course are taught
in one particular style, whereas other offerings are taught
in a different style (for instance, if there are two different instructors a single instructor who is testing alternate
teaching methodologies). In our results of section 3.4, we
pooled all data from prior course offerings to build a single
predictive model to apply to the final course offering. Instead, it might be more prudent to build multiple models,
one for each course offering and pool their predictions in a
voting model. Alternately, we could analyze the resource
usage of the current course and the prior courses to find
the most similar prior offering, in an attempt to apply the
single most relevant model. This final approach may also
generalize across course domains, allowing us to make predictions about student in new courses, rather than relying
on historical offerings of the same course for training.
In the MOOC context, we have an abundance of data;
the enrollment numbers are significant, and the online-only
nature of the course allows for tracking many of the studentcontent interactions. Does this hold true for traditional
higher education blended courses? Will our approach work
for courses with as few as 30 students? How well will the
approach generalize when there is significant offline content,
such as face to face lectures, or assigned readings from textbooks, where student interaction cannot be tracked? Further analysis is required to determine the scope and magnitude of data that is needed in order to build a model of
student success in these situations.
While we have proposed that our technique may be integrated into an early warning system for students at risk of
course failure, determining an appropriate level of predictive
accuracy remains an open problem. There is the potential
for some degree of harm caused by erroneous predictions:
successful students who are advised that they are at risk
of failure may face unwarranted anxiety, or consider withdrawing from a course unnecessarily. Conversely, students
at risk of failure who are advised that they are on a path
to success may be falsely reassured and perhaps blindsided
by their eventual failure. A further analysis of these costs,
as well as the benefits of correct predictions, is required to
fully evaluate the merit of our predictive models as well as
the predictive models used by early warning systems.
5.
REFERENCES
[1] D. G. Altman. Practical statistics for medical research.
CRC Press, 1990.
[2] J. Anderson. Rules of the mind. 1993.
[3] R. Barber and M. Sharkey. Course correction: using
analytics to predict course success. In Proceedings of
the 2nd International Conference on Learning
Analytics and Knowledge, pages 259–262. ACM, 2012.
[4] B. S. Bloom, M. Engelhart, E. J. Furst, W. H. Hill,
and D. R. Krathwohl. Taxonomy of educational
objectives: Handbook i: Cognitive domain. New York:
David McKay, 19:56, 1956.
[5] C. A. Brooks and J. E. Greer. Explaining predictive
models to learning specialists using personas. In 4th
International Conference on Learning Analytics and
Knowledge 2012 (LAK’14), pages 26–30, 2014.
[6] S. Bull, J. Greer, G. McCalla, L. Kettel, and J. Bowes.
User modelling in i-help: What, why, when and how.
In User Modeling 2001, pages 117–126. Springer, 2001.
[7] Carnegie Learning. The Cognitive Tutor: Applying
Cognitive Science to Education. Technical report,
Carnegie Learning, Inc., Pittsburgh, PA, USA, 1998.
[8] N. V. Chawla. C4.5 and imbalanced data sets:
investigating the effect of sampling method,
probabilistic estimate, and decision tree structure. In
Proceedings of the ICML, volume 3, 2003.
[9] J. L. Fleiss. Measuring nominal scale agreement
among many raters. Psychological Bulletin,
76(5):378–382, 1971.
[10] J. L. Fleiss, B. Levin, and M. C. Paik. Statistical
methods for rates and proportions. John Wiley &
Sons, 2013.
[11] K. J. Gergen. The social constructionist movement in
modern psychology. American psychologist, 40(3):266,
1985.
[12] A. C. Graesser, P. Chipman, B. C. Haynes, and
A. Olney. Autotutor: An intelligent tutoring system
with mixed-initiative dialogue. Education, IEEE
Transactions on, 48(4):612–618, 2005.
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
P. Reutemann, and I. H. Witten. The weka data
mining software: an update. ACM SIGKDD
explorations newsletter, 11(1):10–18, 2009.
[14] S. M. Jayaprakash, E. W. Moody, E. J. Laur´ıa, J. R.
Regan, and J. D. Baron. Early alert of academically
at-risk students: An open source analytics initiative.
Journal of Learning Analytics, 1(1):6–47, 2014.
[15] J. R. Landis, G. G. Koch, et al. The measurement of
observer agreement for categorical data. biometrics,
33(1):159–174, 1977.
[16] B. Martin. Constraint-based modelling: Representing
student knowledge. New Zealand Journal of
Computing, 7(2):30–38, 1999.
[17] S. Ohlsson. Learning from performance errors.
Psychological Review, 103(2):241–262, 1996.
[18] L. Perna, A. Ruby, R. Boruch, N. Wang, J. Scull,
C. Evans, and S. Ahmad. The life cycle of a million
mooc users. In Presentation at the MOOC Research
Initiative Conference, 2013.
[19] Q. Xie. Agree or disagree? a demonstration of an
alternative statistic to cohen’s kappa for measuring
the extent and reliability of agreement between
observers. 2013.