Song Similarity Classification

Song Similarity Classification
Using Music Information Retrieval on the Million Song Dataset
Authored by
¨ ter
Richard Nysa
Nysater@kth.se
070-4229705
Cardellgatan 3
11436 Stockholm
Tobias Reinhammar
Tobrei@kth.se
070-6648678
Abrahamsbergsv¨agen 87
16830 Bromma
Supervisor
Anders Askenfelt
School of Computer Science and Communications
Royal Institute of Technology
Bachelor Degree Project in Computer Science, DD143X
May 24, 2013
Abstract
The purpose of this study was to investigate the possibility of automatically classifying the similarity of song pairs. The machine learning algorithm K-Nearest Neighbours, combined with both bootstrap aggregating and
an attribute selection classifier, was first trained by combining the acoustic
features of 45 song pairs extracted from the Million Song Dataset with usersubmitted similarity for each pair. The trained algorithm was then utilized
to predict the similarity between 50 hand-picked and about 4000 randomly
chosen pop and rock songs from the Million Song Dataset.
Finally, the algorithm was subjectively evaluated by asking users to identify which out of two randomly ordered songs, one with a low and one with
a high predicted similarity, they found most similar to a target song. The
users picked the same song as the algorithm 365 out of 514 times, giving the
algorithm an accuracy of 71%.
The results indicates that automatic and accurate classification of song
similarity may be possible and thus may be used in music applications. Further research on improving the current algorithm or finding alternative algorithms is warranted to draw further conclusions about the viability of using
automatically classified song similarity in real-world applications.
Sammanfattning
Syftet med denna studie var att unders¨oka huruvida det a¨r m¨ojligt att automatiskt r¨akna ut hur lika tv˚
a l˚
atar a¨r. I studien anv¨andes maskininl¨arningsalgoritmen k n¨armaste grannar tillsammans med bootstrap aggregering och
en klassificerare som s˚
allar bort ovidkommande egenskaper. Algoritmen
tr¨anades f¨orst genom att kombinera ett flertal akustiska parametrar med
anv¨andares likhetsbed¨omningar f¨or 45 l˚
atpar skapades genom att kombinera
10 l˚
atar uttagna fr˚
an The Million Song Dataset med varandra. Den tr¨anade
algoritmen anv¨andes sedan f¨or att r¨akna ut likheten mellan 50 handplockade och ungef¨ar 4000 slumpm¨assigt valda pop- och rockl˚
atar fr˚
an the Million
Song Dataset.
Avslutningsvis utv¨arderades resultaten genom en andra fas av anv¨andartestning. Anv¨andare blev ombedda att lyssna p˚
a en m˚
all˚
at, en av de 50
handplockade l˚
atarna, f¨oljt av en av de l˚
atar som algoritmen matchat som
mycket lik och en l˚
at som den matchat som mycket olik, i slumpm¨assig
ordning. Anv¨andaren fick sedan v¨alja vilken av de tv˚
a l˚
atarna som tycktes
likna m˚
all˚
aten. Algoritmen och anv¨andaren valde samma l˚
at i 365 av 514
fall, vilket ger algoritmen en tr¨affs¨akerhet p˚
a 71%.
Resultaten tyder p˚
a att det kan vara m¨ojligt att utveckla en algoritm
som automatiskt kan klassificera likhet mellan l˚
atar med h¨og precision och
d¨armed skulle kunna anv¨andas i musikapplikationer. Ytterligare utveckling
av algoritmen, eller forskning p˚
a alternativa algoritmer, a¨r n¨odv¨andigt f¨or
att kunna dra vidare slutsatser om hur anv¨andbart automatisk uppskattning
av l˚
atlikhet ¨ar f¨or verkliga till¨ampningar.
Statement of Collaboration
Tobias Reinhammar recruited the test subjects for the web application and
chose the tracks used in the user data gathering phase and the application
evaluation. Richard Nys¨ater wrote most of the code used in the application,
with critical input from Tobias regarding design, usability and functionality. This paper, the project specification and major project decisions and
evaluation were collaborative efforts where both parties contributed equally.
Contents
1 Introduction
1.1 Background . . . . .
1.1.1 Terminology .
1.1.2 Related work
1.2 Problem statement .
1.2.1 Hypothesis . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
3
4
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Method
2.1 Million Song Dataset . . . . . . . . . . . . . . .
2.1.1 Features . . . . . . . . . . . . . . . . . .
2.2 Machine Learning . . . . . . . . . . . . . . . . .
2.2.1 Supervised Machine Learning algorithms
2.3 Project phases . . . . . . . . . . . . . . . . . . .
2.3.1 Gathering input data . . . . . . . . . . .
2.3.2 Utilizing the user data . . . . . . . . . .
2.3.3 Subjective evaluation . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
. 5
. 5
. 7
. 7
. 8
. 9
. 9
. 10
3 Results
11
3.1 User ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Automated similarity rating . . . . . . . . . . . . . . . . . . . 12
3.3 User evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Discussion
4.1 Confounding factors . . . . . . . .
4.1.1 Million Song Dataset . . . .
4.1.2 Feature usage . . . . . . . .
4.1.3 User data . . . . . . . . . .
4.1.4 Learning tracklist . . . . . .
4.1.5 Machine learning algorithm
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
15
15
15
16
16
5 Conclusions and future work
17
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Acknowledgements
18
References
19
6 Appendix
6.1 Appendix A - Million Song Dataset Field
6.2 Appendix B - Web Application . . . . .
6.2.1 User rating application . . . . . .
6.2.2 User evaluation application . . .
6.3 Appendix C - The Evaluation Tracklist .
22
22
24
24
25
26
ii
List
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
1.1
Background
With the massive amount of music tracks available today, ways to allow
users to discover new music according to their personal preferences by automatically analyzing songs are in high demand. While recommending similar
artists is a prominent feature in popular music applications, recommending
similar songs is, as of now, quite uncommon.
Thanks to the distribution of the Million Song Dataset, a large amount
of acoustic features and metadata is now freely available to researchers. One
way of analyzing songs to discover related music is to define a similarity
between songs and then recommending songs with a high similarity.
While using a group of experts or users to manually tag related artists
is used today, manually rating song similarity is unfeasible as the time requirement is immense. As such, utilizing an automated system to classify
the similarity between songs is a far more sensible approach. Although such
a system would be highly beneficial, measuring similarity is not a simple
task due to the complexity of the many patterns that create the notion of
similarity.
1.1.1
Terminology
Music information retrieval - MIR
MIR is the science of extracting information from music. The field of MIR
has become increasingly relevant with the current development of various
music services available over the internet. With the massive amount of songs
available, useful tools for discovering new music according to personal preferences are in high demand. Thus, effective methods of automatically analyzing
1
songs are required in order to be able to process large quantities of data.
Million Song Dataset - MSD
The purpose of the MSD is, as described by the creators LabROSA and The
Echo Nest [1],
• to encourage research on algorithms that scale to commercial sizes.
• to provide a reference dataset for evaluating research.
• as a shortcut alternative to creating a large dataset with APIs.
• to help new researchers get started in the MIR field.
The MSD contains a large quantity of musical features and metadata
extracted from one million songs provided by The Echo Nest. The Echo
Nest is the largest repository of dynamic music data in the world, containing
data on over 34 million songs [2]. The full list of features and metadata
is available in appendix A. This project does not use the full Million Song
Dataset due to time and computational restraints.
Machine Learning
Machine learning is a field that seeks to answer the question ”How can we
build computer systems that automatically improve with experience, and
what are the fundamental laws that govern all learning processes?” [3].
A typical application of machine learning is when designing an algorithm
would be too complex for a human or when an application needs to adapt
to new environments without human input. Furthermore, current knowledge
may not be relevant in the future and when continuously redesigning a system
is not feasible, machine learning may allow the system to adapt on its own
[4].
Supervised Machine Learning
Supervised machine learning, is an area of Machine Learning where algorithms are first trained on externally supplied data in order to make predictions about future data [5]. It accomplishes this by first associating vectors
of observations, the training data, to class labels and then creating a mathematical function. The function can then be used to predict the value of
missing labels in future data [6].
2
Training set
The externally supplied data that consists of known values for the class labels
that will later be predicted by the supervised machine learning algorithm is
called a training set.
Test set
The data with unknown class labels which are predicted by the machine
learning algorithm is called the test set.
Waikato Environment for Knowledge Analysis - WEKA
WEKA is a suite of machine learning software created by the University of
Waikato in New Zealand. Weka contains machine learning tools for data preprocessing, classification, regression, clustering, association rules, and visualization [7]. Since WEKA is easily accessible and contains many of the popular
machine learning algorithms used within Music Information Retrieval, it was
chosen as the main tool to create the similarity algorithm.
1.1.2
Related work
Music Similarity
Aucouturier and Pachet performed similar research on automatically classifying the similarity between songs [8]. Aucouturier and Pachet’s study used a
song set of 17,075 songs on which the authors compared the timbre of different pairs of songs to determine the similarity. They performed a subjective
evaluation with 10 users in which the users were presented with a target
song, followed by two test songs. The two test songs were chosen so that
one was measured similar to the target song, while the other was dissimilar.
Afterwards, users were asked to decide which of the two test songs they found
most similar to the target song. In this experiment the similar test song was
picked by the users 80% of the time.
The success of Aucouturier and Pachet’s work was an inspiration to this
project in a number of ways, but mainly in two aspects. Their research
towards the relevance of the timbre feature was influential when selecting
features for this study. Furthermore, the evaluation method conducted by
the authors was deemed suitable for this study’s evaluation.
Music Title Identification
A related work in this domain is the identification of music title [9, 10],
where an artist and title of a music source is identified by matching the
3
audio fingerprint of the music to known fingerprints of a large amount of
songs. An identification rate above 98% has been achieved for the MPEG-7
standard [9]. While music title identification works great for finding the title
of a music source, it does not address the problem of finding similar music
since it does not consider the factors that make humans perceive two songs as
similar [11]. However, it is clear that many of the features used in music title
identification, such as timbre and loudness, can be used when determining
music similarity.
Music Genre Classification
Automatic classification of the genre of a song is one of the more commonly
researched topics in the field of music information retrieval today. Many
studies have managed to correctly classify genres with a high degree of accuracy. Tzanetakis and Cook achieved an accuracy of 60% [12] and other
studies have achieved an accuracy of at least 80% [13, 14].
Genre classification is closely related to music similarity, with similar
predictive machine learning algorithms and musical features used for both
tasks.
1.2
Problem statement
The main purpose of this project was to investigate the possibility of using the
data contained in the MSD in combination with user-supplied song similarity
ratings to create a machine learning algorithm able to classify the similarity
between songs belonging to the pop and rock genres with reasonable accuracy.
The project was limited to two closely related genres, as including more
genres would require a quantity of user ratings and computation time beyond
the scope of this study.
1.2.1
Hypothesis
Our hypothesis was that the developed application would be able to determine if songs are very different, but may not accurately select the most similar
songs. This hypothesis was based on several confounding factors encountered
in the initial stages of the project (see chapter 4. Discussion).
4
Chapter 2
Method
This section will describe the approach taken when the algorithm was created
and evaluated, as well as featuring a detailed explanation of the machine
learning algorithm and the features used from the MSD.
2.1
Million Song Dataset
The MSD was used as the dataset of choice since it is both freely available
and contains a vast amount of songs and relevant musical information. The
features utilized to create the similarity rating are described below.
2.1.1
Features
The following part is a detailed description of the MSD features which were
examined and employed in this research. Some features were excluded because they were not considered to be relevant to the perceived similarity
or because their potential applications were deemed too complex; and some
were excluded due to their low availability in the MSD.
Tempo
Tempo is the speed of a musical composition, in the MSD it is measured in
beats per minute (BPM). BPM means that a note is first classified as the
beat, and the amount of beats are the number of notes that must be played
per minute. This feature was chosen due to its significant impact on a musical
piece, as a happy and a sad song often differ significantly in tempo. Happy
songs are generally played faster than sad songs, and at a more energizing
pace. The tempo of a pair of songs was given as ratio, which was calculated
by dividing the tempo of the quicker song with the tempo of the slower song.
5
Loudness
The overall loudness of a song is derived by a weighted combination of the
individual loudness values of notes (start and max loudness). The loudness
values are mapped to a human auditory model and measured in decibel [15].
This feature was chosen as the difference in loudness between two songs likely
is a contributing factor to the perceived similarity of songs. The loudness
for a pair of songs was calculated as a ratio value using the same method as
tempo.
Key
In the context of music theory, the key refers to the tonic note and chord.
Keys range from C to B (C, C#, D, D#, ..., B) and in MSD, C corresponds
to the value 0 and B to 11. This feature was chosen because the relation
between the keys may influence the perceived similarity of songs. A pair of
songs would be assigned the key value of the distance between the songs keys
in the chromatic circle.
Mode
The mode indicates the modality (major or minor) of a track, the type of
scale from which its melodic content is derived [15]. This feature was chosen
because the mode greatly affects the overall mood and feeling of a song. A
pair of songs would be assigned a mode value of 0 if the two songs were of
the same mode and a value of 1 if their modes were not the same.
Timbre
Timbre is the quality of a musical note or sound that distinguishes different
types of musical instruments, or voices. It is also referred to as sound color,
texture, or tone quality, and is derived independently of pitch and loudness.
In the MSD the timbre feature is represented as a 12 dimensional array.
The twelve elements emphasize different musical features such as brightness,
flatness and loudness and are ordered by importance [15]. This feature was
chosen because it is likely one of the most important aspects as it describes
the musical instruments and vocal performance, both of which are of utmost
importance to the perceived similarity of two songs. The timbre values for a
pair of songs were calculated as an absolute value of the difference between
the elements that emphasize the same feature in each song’s array.
6
Timbre Confidence
Because the data in the MSD is automatically generated from songs, there is
a varying degree of uncertainty for the timbre values. The confidence value
is between 0 and 1 and a low certainty means the value should be considered
speculative. This feature was chosen because it allows the algorithm to take
the confidence into consideration when estimating the similarity based on
the timbre. The timbre confidence of a pair of songs was calculated as the
individual confidences of the songs added together.
2.2
Machine Learning
Machine learning is a powerful tool both for handling large quantities of data
and creating algorithms which are too complex for humans, therefore it was
deemed to be the most convenient and efficient way to automatically classify
songs as considering each timbre value manually is a very complex task. In
this study the WEKA suite is used to create and apply the various machine
learning algorithms.
2.2.1
Supervised Machine Learning algorithms
Supervised machine learning is utilized in this study as users are likely good
at determining the similarity of song pairs, which can then be used to train
a classifying algorithm. The supervised machine learning algorithm will use
the training set to weigh the importance of the relations between the similarity rating and the other musical features. The trained supervised machine
learning algorithm will then be used to predict a similarity for the other song
pairs in the test set, for which the similarity is unknown.
K-nearest neighbours (k-NN)
k-NN is an instance based classifier [16] that works by creating a vector with
N values for each known instance, where the N values is every value except the
one being predicted, and placing each of these instances in an N-dimensional
space. When the known instances have been placed, the algorithm works by
also placing the instances being predicted in the N-dimensional plane and
assigning the unknown value of each instance according to the average of
the known values of its k nearest neighbors [17]. The algorithm used in this
project uses the euclidean distance between instances, which means each
parameter included is equally important, and it also weights the nearest
neighbours by the inverse distance to the instance being classified. This
7
causes closer neighbours to be more important when predicting the unknown
instance, which is particularly helpful when the amount of known values is
low. The k-value used in this project was 7 and it was chosen iteratively by
minimizing the root mean square error while still trying to maximizing the
correlation coefficient when cross-validating the training data over 10 folds.
This algorithm was chosen because it is fast, simple and still achieved better
results than a few other algorithms, such as support vector machines and
decision trees, in a small preliminary test.
Bagging
Bagging, also known as Bootstrap Aggregating, was chosen to enhance the
k-NN algorithm because it minimizes prediction error by reducing overfitting.
This is accomplished by creating a new weighted learning set [17]. In machine learning, overfitting occurs when an algorithm overvalues features that
increase the accuracy on the training set but are irrelevant when predicting
values for a test set [18].
Attribute selection
Attribute selection tries to eliminate redundant or irrelevant features from
the subset, which reduces overfitting [19]. Because the k-NN algorithm uses
euclidean distance it is very important to only include important features
and because the actual importance of the features was unknown, attribute
selection was applied.
2.3
Project phases
The project was conducted in three phases. First, data was gathered in the
form of user-submitted similarity ratings. The user data was then used to
teach the machine learning algorithm which parameters carry the most significance when determining how similar two songs are. Finally, the algorithm
was applied on a larger set of songs and the results were evaluated by a final
phase of user testing.
In the evaluation, the users were first presented with a song from the
evaluation tracklist followed by two songs from the subset; one of these songs
was determined to be one of the most similar to the target song, and the
other was one of the least similar. The users were then asked to pick which
one of the two songs from the subset they considered to be most similar to
the target song.
8
2.3.1
Gathering input data
User data was gathered by hosting a web application, detailed in Appendix
B - Web Application, which enabled the users to listen to pairs of song
samples, extracted from the 7digital API [20]. These songs, the learning
tracklist, were composed of a selected set of 10 unique songs. Every song
in the set was matched against every other song adding up to a total of 45
unique pairs for the user to rate on a scale from 0 to 100.
The songs selected for this phase were all chosen from the pop and rock
genres. These genres were selected because they share many similarities and
are quite familiar to most users which is likely to improve the quality of the
user-submitted data. The 10 songs endeavor to provide a reasonable coverage
of the pop and rock genres by varations in speed, mood, vocal performance
and instrumental composition. The table 2.1 lists the artist and tracks which
constitutes the learning tracklist.
Track
The Unforgiven II
The Trooper
White Flag
A New Day Has Come
About you now
Basket Case
Wind Of Change
Here I go again
Smoke
Wonderwall
Artist
Metallica
Iron Maiden
Dido
Celin´e Dion
Timo R¨ais¨anen
Green Day
Scorpions
Whitesnake
Natalie Imbruglia
Oasis
Table 2.1: The learning tracklist
The user is first introduced to the application by three sample pairs which
display a roughly estimated rating in order to allow the user some insight
into what kind of songs are present in the set. Subsequently, the real sample
set is introduced and the ratings are saved. The rating session is matched to
the user’s IP in order to limit the amount of ratings supplied by each user.
2.3.2
Utilizing the user data
Firstly, the training set for the machine learning algorithm was created by
extracting the differences between the two songs in every pair, as described
earlier in the method. Secondly, a data post with the average of the usersubmitted similarity was added to every pair. Additionally, every song was
9
matched with itself and given a similarity rating of 100, in order to supply
the algorithm with a few perfect matches. Lastly, the training set was used to
automatically classify the similarity between pairs composed of the evaluation
songs and the songs in the subset. In order to limit the size of the subset and
keep the research within the pop and rock genres, the evaluation tracklist was
only compared against songs by artists who featured a pop or rock tag, both
supplied by users at MusicBrainz.org [21] and present among the Echo Nest
genre tags. Furthermore, some tags were excluded from the search, due to
being at the very edges of the genres and therefore not being well represented
in the learning set. The subgenres excluded were the following: Grindcore,
Deathgrind, Black metal, Doom metal, Sludge metal, Noise, Black metal,
Screamo, Glitch, Glitchcore, Aggrotech, Metalcore and Death metal.
2.3.3
Subjective evaluation
In addition to the 10 songs used in the first phase, another 40 songs from the
pop and rock genres were added for the evaluation, as presented in Appendix
C - The evaluation tracklist. In the same manner as in the first phase, the
songs were chosen to provide a reasonably good coverage of the genres. The
songs were chosen to be both fairly well known and popular in order to
improve the user experience and therefore make users more likely to continue
rating.
The user evaluations were gathered through a slightly modified version
of the web application from the first phase. The users were presented with a
target song followed by two test songs from a subset of approximately 4000
songs extracted from the Million Song Dataset. One of the test songs were
randomized from the 10 songs with the highest similarity to the target song,
and the other from the 10 songs with the lowest similarity.
10
Chapter 3
Results
In this section the results of the study is presented in three parts, one for
each of the three phases of the research.
3.1
User ratings
In total, 28 users submitted 965 similarity ratings of the 45 song pairs. The
lowest amount of ratings on a pair was 18 and the highest 26. The average
similarity rating of all songs was 37.3 and the average standard deviation
was 11.9. The histogram below, figure 3.1, illustrates the distribution of the
average user-submitted similarity rating of the song pairs in the training set.
Most song pairs received a rather low similarity rating. The highest userrated pair recieved an average of 77.
The pair, “White Flag - Dido” and “Smoke - Natalie Imbruglia” were
rated 77, which was the highest similarity rating. The pair, “A New Day
Has Come - C´eline Dion” and “The Trooper - Iron Maiden” were rated with
the lowest similarity rating, 7.
Both “White Flag - Dido” and “Smoke - Natalie Imbruglia” are performed
by female pop artists, and are quite close in terms of tempo. Both songs are
rather mellow and the instrumental compositions both feature strings and
drums.
“A New Day Has Come - C´eline Dion” is a slow-paced pop ballad with
female vocals. The song is mainly accompanied by piano and a background
of strings. “The Trooper - Iron Maiden” on the other hand is a fast paced
hard rock song with male vocals. The instrumental composition is that of
electric guitar, drums and bass guitar.
11
Figure 3.1: Histogram detailing the distribution of user-ratings for the song
pairs
3.2
Automated similarity rating
196,000 ratings were predicted by the algorithm, 3920 songs were compared to
each of the 50 songs in the evaluation tracklist. The attributes selected by the
attribute selection classifier as the most important were: Tempo, Loudness,
Mode, 6 out of the 12 timbre elements and the similarity value. The predicted
similarity ratings were distributed evenly with the lowest ratings being near
17 and the highest ratings near 98.
3.3
User evaluation
For the evaluation we gathered 514 user comparisons from 21 unique users.
The song the algorithm determined as the most similar to the target song was
picked by the users 365 times. This gives the algorithm a success rate of 71%.
In Appendix C - The Evaluation Tracklist, the prediction accuracy for each
of the 50 target songs is listed. The histogram 3.2 illustrates the distribution
of the accuracy of the algorithm for the song pairs in the evaluation set.
12
Figure 3.2: Histogram detailing the distribution of accuracy for the song
pairs
13
Chapter 4
Discussion
The purpose of the study was to find out if it is possible to automatically
classify the similarity between songs with a reasonable accuracy. While the
model utilized in this study did not conclusively prove that it is possible to
succeed in this endeavor, as a few songs received an accuracy below 50%
which cannot be considered reasonably accurate, the results indicate that it
may be feasible in future work.
The data gathered from the user ratings suggest that users tend to share a
common opinion on which pairs of songs they deem to be similar, suggesting
that perceived song similarity is not solely an individual notion.
Although this study strived towards limiting the tracks to the pop and
rock genres, several tracks that could be considered neither pop or rock were
included. Because the genre tags were only associated with the artist, tracks
which would not be considered actual songs and therefore not be relevant to
an application for rating song similarity. An example of this problem could
be a track which consist of an interview with a rock artist.
In comparison with the previous work of Aucouturier and Pachet [8], the
algorithm of this study performed slightly poorer, 71% compared to their
80% accuracy. However, they analyzed the similarity using only the timbre
of the songs, unlike this study which took many additional variables into
consideration.
Finally, the hypothesis stated in the initial stages of this project proved
to be mostly correct. While the algorithm still often left a lot to be desired
when matching a target song against a supposedly similar song, it was quite
efficient at finding songs which deviated a great deal from the target song.
Only 12 of the 50 target songs had an accuracy of 50% or worse, which
suggests that there were certain elements in these songs that caused them to
be greatly mismatched. In fact, the amount of songs that had an accuracy of
50% or below were fewer than the songs that had an accuracy of 90% or above.
14
Decreasing the variance by improving the accuracy of the worst performing
songs would greatly increase the overall precision of the algorithm, likely to
the point where it would have a respectable accuracy.
4.1
4.1.1
Confounding factors
Million Song Dataset
Most features included in the MSD are automatically extracted from audio
provided by The Echo Nest. As such, many fields are approximated which can
compromise the accuracy of the data. An example of bad approximation is a
song pair of the same song recorded on two different occasions, “Whitesnake
- Here I go again (2008 Digital Remaster)” and “Whitesnake - Here I go
again ‘87 (2007 Digital Remaster)”, which has a BPM ratio of 3 according
to the MSD. Furthermore, the target song which had the worst accuracy in
the evaluation phase was “Take Me Out - Franz Ferdinand”, which has a
tempo of 210 BPM according to the data in the MSD. However, the general
consensus among public sources [22, 23] and our peers is that the song has a
BPM of 105. In fact, Ronnow and Twetman encountered similar issues [24]
when evaluating genre classification.
This indicates that miscalculated BPM in the MSD may cause the algorithm to perform poorly. Unfortunately there is no BPM confidence value
which could allow the algorithm place less weight on potentially erroneous
values.
4.1.2
Feature usage
It is possible that additional features present in the MSD may be utilized
to increase the accuracy of the algorithm. Additionally, the comparisons
between the features of two songs may not be optimal in this study. A
possibly better use of the comparison of the loudness between two songs,
which was calculated as a ratio in this study, would be to calculate the
absolute difference instead.
4.1.3
User data
The user data analyzed in this study had a standard deviation of up to 19,
which means that the perceived similarity may vary a great deal between
different users. This spread could potentially corrupt the training set and
15
therefore cause the algorithm to incorrectly classify the importance of certain
features.
4.1.4
Learning tracklist
The learning tracklist used in this project was limited to only 10 songs which
seemed to be insufficient in covering the entire spectrum of the pop and rock
genres. Songs which were significantly different from all the songs in the
training set were often incorrectly classified. This occasionally resulted in
pop ballads and grindcore metal songs to be matched as similar songs in the
initial stages of the study.
4.1.5
Machine learning algorithm
The machine learning algorithms used in this research were selected through
a small set of empirical studies. Therefore, the chosen algorithm may not be
sufficiently efficient. In addition, the parameters of this study’s algorithms
could possibly be further tweaked to increase predictive accuracy.
16
Chapter 5
Conclusions and future work
The k-NN algorithm created in this study successfully distinguished between
similar and dissimilar songs 71% of the time, with 28% of the evaluated song
pairs receiving an accuracy of 90% or above. However, 24% of the pairs
received an accuracy below 50% which means that the algorithm cannot be
considered accurate. On the other hand, the results achieved by this study is
a strong indication that creating an algorithm that very accurately predicts
song similarity is possible.
While the user-submitted similarity had a rather high standard deviation,
the average rating seemed to be a good indicator of perceived song similarity.
Therefore, using user ratings to train an algorithm is likely a viable method
when a large amount of users are available.
5.1
Future work
If the factors that caused the predictions for a small subset of the songs to be
greatly inaccurate were identified, the algorithm presented in this study could
be much improved. Additionally, another improvement could be achieved
by future studies if confounding factors encountered in this research were
adressed. This could be accomplished by expanding and improving upon the
learning tracklist, gathering a larger quantity of user input and validating
the data used from the MSD, especially for the learning phase. Furthermore,
other compositions of machine learning algorithms may be more suitable
than k-NN for predicting similarity.
17
Acknowledgments
We would like to give special thanks to the many Anders at the Department
of Speech, Music and Hearing(TMH) at the Royal Institute of Technology,
Stockholm, Sweden. To Anders Askenfelt for giving us a head start and
providing valuable insight and feedback. To Anders Friberg and Anders
Elowsson for their invaluable input regarding both machine learning, and
adapting the data in the Million Song Dataset which aided us greatly in
putting it to proper use.
Furthermore, we give our sincerest thanks to everyone who has provided
this project with valuable user data. Thanks to their diligence through the, at
times, tedious task of rating song similarity the project got a solid foundation
to start from and a useful evaluation.
18
References
[1] Bertin-Mahieux, Thierry & Ellis, Daniel P.W, & Whitman, Brian
& Lamere, Paul, The Million Song Dataset, LabROSA, Electrical
Engineering Department, Columbia University, New York, USA &
The Echo Nest,Somerville, USA, 2011. http://www.columbia.edu/
~tb2332/Papers/ismir11.pdf(2013-04-10)
[2] The Echo Nest, the source of the data used in the Million Song Dataset.
http://echonest.com/company/ (2013-04-11)
[3] Mitchell, Tom M., The Discipline of Machine Learning, School
of Computer Science, Carnegie Mellon University, Pittsburgh,
USA, 2006. https://www.cs.cmu.edu/~tom/pubs/MachineLearning.
pdf(2013-04-10)
[4] Nilsson, Nils J., Introduction to Machine Learning, Robotics Laboratory, Department of Computer Science, Stanford University, Stanford, USA, p 1-5, 1998. http://robotics.stanford.edu/~nilsson/
MLBOOK.pdf(2013-04-10)
[5] Mohri, Mehryar, Lecture on: Foundations of Machine Learning: Lecture
1, Courant Institute & Google Research, 2013. http://www.cs.nyu.
edu/~mohri/mls/lecture_1.pdf(2013-04-10)(2013-04-10)
[6] Gentleman, R. & Huber, W. & Carey, V. J, Biconductor Case Studies - Supervised Machine Learning, Springer Science+Business Media LLC, 2008. ,p 121-123. http://link.springer.com/chapter/10.
1007%2F978-0-387-77240-0_9(2013-04-10)
[7] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.
nz/ml/weka(2013-04-10)
[8] Aucouturier, Jean-Julien & Pachet, Francois, Music Similarity Measures: What’s the Use?, France, Paris, SONY Computer Science
19
Labratory, 2002. http://web.cs.swarthmore.edu/~turnbull/cs97/
f09/paper/Aucouturier02.pdf(2013-04-10)
[9] Allamanche, Eric & Herre, J¨
urgen & Hellmuth, Oliver & Fr¨oba,
Bernhard & Kastner, Thorsten & Cremer, Markus, Content-based
Identification of Audio Material Using MPEG-7 Low Level Description, Computer Science Department, Brandeis University,
Germany,
2001. http://www.cs.brandeis.edu/~dilant/cs175/
%5BAlexander-Friedlander%5D.pdf(2013-04-10)
[10] Cano, Pedro & Batlle, Eloi, & Kalker, Tom & Haitsma, Jaap, A
Review of Algorithms for Audio Fingerprinting, Universitat Pompeu
Fabra, Barcelona, Spain & Philips Research Eindhoven, Eindhoven, The
Netherlands, 2002. http://ucbmgm.googlecode.com/svn-history/
r7/trunk/Documentos/Fingerprint-Cano.pdf(2013-04-10)
[11] Cano, Pedro & Koppenberger, Markus & Wack, Nicolas, Contentbased Music Audio Recommendation, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, 2005. http://dl.acm.org/
citation.cfm?id=1101181(2013-04-10)
[12] Tzanetakis, George & Cook, Perry, Musical Genre Classification
of Audio Signals, IEEE Transactions on Speech and Audio Processing, 2002. http://dspace.library.uvic.ca:8080/bitstream/
handle/1828/1344/tsap02gtzan.pdf?sequence=1(2013-04-10)
[13] Li, Tao & Ogihara, Mitsunori, & Qi, Li, A Comparative Study on
Content-Based Music Genre Classification Computer Science Department, University of Rochester, Rochester, USA & Department of CIS,
University of Delaware, Newark, USA, 2003. http://dl.acm.org/
citation.cfm?id=860487&bnc=1(2013-04-10)
[14] Soltau, Hagen & Schultz Tanja &, Westphal, Martin & Waibel,
Alex, Recognition of Music Types, Interactive Systems Laboratories, University of Karlsruhe, Germany, & Carnegie Mellon University, USA, 1998. http://www.ri.cmu.edu/pub_files/pub1/soltau_
hagen_1998_2/soltau_hagen_1998_2.pdf(2013-04-10)
[15] Documentation for the Analyzer used to create the MSD.
http://docs.echonest.com.s3-website-us-east-1.amazonaws.
com/_static/AnalyzeDocumentation.pdf(2013-04-10)
20
[16] Kotsiantis, S. B, Supervised Machine Learning: A Review of Classification Techniques, Department of Computer Science and Technology, University of Peloponnese, Greece, 2007. http://www.
informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%
20Machine%20Learning%20-%20A%20Review%20of...pdf(2013-04-10)
[17] Steele, Brian M, Exact bootstrap k-nearest neighbor learners, Springer
Science+Business Media LLC, 2008. http://link.springer.com/
content/pdf/10.1007%2Fs10994-008-5096-0(2013-04-10)
[18] Singh, Aarti, Lecture on: Practical Issues in Machine Learning Overfitting and Model Selection, Machine Learning Department, Carnegie
Mellon University, Pittsburgh, USA, 2010. http://www.cs.cmu.edu/
~epxing/Class/10701-10s/Lecture/lecture8.pdf(2013-04-10)
[19] Guyon, Isabelle & Elisseeff, Andr´e, An Introduction to Variable and Feature Selection, Empirical Inference for Machine Learning and Perception
Department, Clopinet, Berkeley, USA & Max Planck Institute for Biological Cybernetics, T¨
ubingen, Germany, 2003. http://jmlr.csail.
mit.edu/papers/volume3/guyon03a/guyon03a.pdf(2013-04-10)
[20] API utilized for track previews. http://developer.7digital.
net/(2013-04-10)
[21] MusicBrainz.org, a community-maintained open source encyclopedia of
music information. http://musicbrainz.org/(2013-04-11)
[22] A
BPM
database.
take-me-out(2013-04-10)
http://songbpm.com/franz-ferdinand/
[23] A BPM database. http://www.bpmdatabase.com/search.php?begin=
0&num=1&numBegin=1&artist=franz+ferdinand&title=take+me+
out(2013-04-11)
[24] R¨onnow, Daniel & Twetman, Theodor, Automatic Genre Classification
From Acoustic Features, Royal Institute of Technology, Stockholm,
Sweden, 2012. http://www.csc.kth.se/utbildning/kth/kurser/
DD143X/dkand12/Group7Anders/final/Ronnow_Twetman_grp7_
final.pdf(2013-04-10)
21
Chapter 6
Appendix
6.1
Appendix A - Million Song Dataset Field
List
Field name
analysis sample rate
artist 7digitalid
artist familiarity
artist hotttnesss
artist id
artist latitude
artist location
artist longitude
artist mbid
artist mbtags
artist mbtags count
artist name
artist playmeid
artist terms
artist terms freq
artist terms weight
audio md5
bars confidence
bars start
beats confidence
beats start
Type
float
int
float
float
string
float
string
float
string
array string
array int
string
int
array string
array float
array float
string
array float
array float
array float
array float
22
Description
sample rate of the audio used
ID from 7digital.com or -1
algorithmic estimation
algorithmic estimation
Echo Nest ID
latitude
location name
longitude
ID from musicbrainz.org
tags from musicbrainz.org
tag counts for musicbrainz tags
artist name
ID from playme.com, or -1
Echo Nest tags
Echo Nest tags freqs
Echo Nest tags weight
audio hash code
confidence measure
beginning of bars, usually on a beat
confidence measure
result of beat tracking
Continued on next page
continued from previous page
Field name
Type
Description
danceability
float
algorithmic estimation
duration
float
in seconds
end of fade in
float
seconds at the beginning of the song
energy
float
energy from listener point of view
key
int
key the song is in
key confidence
float
confidence measure
loudness
float
overall loudness in dB
mode
int
major or minor
mode confidence
float
confidence measure
release
string
album name
release 7digitalid
int
ID from 7digital.com or -1
sections confidence
array float
confidence measure
sections start
array float
largest grouping in a song, e.g. verse
segments confidence
array float
confidence measure
segments loudness max
array float
max dB value
segments loudness time
array float
time of max dB value
segments loudness start
array float
dB value at onset
segments pitches
2D array float chroma feature, one value per note
segments start
array float
musical events, note onsets
segments timbre
2D array float texture features (MFCC+PCA-like)
similar artists
array string
Echo Nest artist IDs
song hotttnesss
float
algorithmic estimation
song id
string
Echo Nest song ID
start of fade out
float
time in sec
tatums confidence
array float
confidence measure
tatums start
array float
smallest rythmic element
tempo
float
estimated tempo in BPM
time signature
int
estimate of number of beats per bar
time signature confidence float
confidence measure
title
string
song title
track id
string
Echo Nest track ID
track 7digitalid
int
ID from 7digital.com or -1
year
int
release year from MusicBrainz
23
6.2
6.2.1
Appendix B - Web Application
User rating application
24
6.2.2
User evaluation application
25
6.3
Appendix C - The Evaluation Tracklist
Track
Carrie
Dr. Feelgood
Fast Car
Highway Star
It Takes A Fool To Remain Sane
More Than A Feeling
This Love (Will be your downfall)
White Flag
Whenever, Wherever
My Immortal
Shoreline
A New Day Has Come
A Thousand Miles
About You Now
Black Velvet
Destiny Calling
Misery Business
The Downeaster “Alexa”
Flux
Only You
Here I Go Again
Tom’s Diner
Wonderwall
Angels
Crazy On You
It’s My Life
Erase / Rewind
The Trooper
Africa
Cats In The Cradle
Basket Case
4 In The Morning
Learning To Fly
Scarborough Fair/Canticle
Glory To The Brave
Artist
Europe
M¨otley Cr¨
ue
Tracy Chapman
Deep Purple
The Ark
Boston
Ellie Goulding
Dido
Shakira
Evanescence
Anna Ternheim
C´eline Dion
Vanessa Carlton
Timo R¨ais¨anen
Alannah Myles
Melody Club
Paramore
Billy Joel
Bloc Party
Joshua Radin
Whitesnake
Suzanne Vega, DNA
Oasis
Within Temptation
Heart
Bon Jovi
The Cardigans
Iron Maiden
Toto
Ugly Kid Joe
Green Day
Gwen Stefani
Tom Petty And The
Simon & Garfunkel
Hammerfall
26
Accuracy
100%
100%
100%
100%
100%
100%
100%
100%
90.91%
90.91%
90.91%
90%
90%
90%
88.89%
88.89%
88.89%
88.89%
87.5%
83.33%
83.33%
81.82%
81.82%
81.82%
80%
76.92%
75%
75%
75%
70%
70%
66.67%
Heartbreakers 66.67%
62.5%
60%
Continued on next page
continued from previous page
Track
Artist
Bark At the Moon
Ozzy Osbourne
I Want To Know What Love Is
Foreigner
Slow Dancing In A Burning Room John Mayer
Good Riddance
Green Day
Chariot
Gavin DeGraw
The Unforgiven II
Metallica
Unwritten
Natasha Bedingfield
18 And Life
Skid Row
Hero
Enrique Iglesias
Smoke
Natalie Imbruglia
You’ll Be In My Heart
Phil Collins
Wind Of Change
Scorpions
Make Your Own Kind Of Music
Mama Cass
Rooftops
lostprophets
Take Me Out
Franz Ferdinand
27
Accuracy
58.33%
55.56%
55.56%
50%
50%
44.44%
44.44%
36.36%
33.33%
30%
27.27%
20%
20%
20%
18.18%