Music Information Retrieval: a literature review
by Priscilla Jane Smith
- What is MIR?
- Why is MIR different and difficult?
- Stephen Downie's MIR Challenges
- Representations of Music
- What are the types of MIR systems?
- Human music-seeking activities
- MIR System Implications
- Current State of the Field
- The Future
- Works Cited
Music information retrieval is a small subfield within traditional information retrieval, and the concept of viewing music from an information retrieval standpoint is very interesting and unique. For the purposes of this literature review, the author selected several publications which are overviews of music information retrieval (MIR) as a field, and several publications having to do with user discovery of new music. Musical Works and Information Retrieval is by Richard Smiraglia, a notable information scientist and figure in the field of Knowledge Organization. Stephen Downie, author of Music Information Retrieval, A Sample of Music Information Retrieval Approaches, and The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future, is a professor and associate dean at the iSchool at the University of Illinois at Champaign-Urbana. He is also an officer and founding member of ISMIR, the international society for MIR. Information Retrieval for Motion and Music is a textbook covering information retrieval (IR) concepts and techniques for music. The author, Meinard Müller, is a member of the Multimedia Signal Processing Group at Bonn University.
Various researchers have slightly differing definitions of MIR. According to Richard Smiraglia, MIR is "the activity of automating the retrieval of musical works, or parts of musical works." This encompasses everything from the basic querying of bibliographic databases for information about a piece of music, to advanced music recommendation systems, to query-by-humming systems. Meinard Müller claims that MIR research teams are motivated by the challenges that music poses for IR systems, and efforts are directed towards the development of technologies that allow users to :access and explore music in all of its many facets." Many researchers are drawn to the field because of an interest in the subject matter or because of its inherent challenges, what Stephen Downie calls the "intellectual need to overcome the myriad difficulties posed by the inherent complexity of music."
MIR is a very narrow specialty within IR, and it demands different approaches than other subjects in the field. One might ask: what makes MIR so difficult? There are many answers to this question. Before the rise of the internet and more technologically advanced systems, musical works--for the purposes of libraries--were organized using alphabetico-classified systems. In other words, they were described according to their physical characteristics. Traditionally, systems for bibliographic IR were designed with the physical document in mind. While text-based retrieval of music documents using the composer's name, an opus number, or lyrics can be handled using conventional IR techniques, this text-based approach is insufficient for retrieval of music, in all of its forms. Smiraglia makes the case that instead of conceptualizing music as a physical document--be it a score or a recording--the idea of a musical 'work' should be the "key entity" upon which MIR is based.
Smiraglia defines a musical work as an "intellectual sonic conception." It is an idea created by a composer, and all instances of that idea are just versions of the work. In terms of the Functional Requirements of Bibliographic Records, or FRBR, versions of a work are to be designated as 'manifestations.' This is significant for the purposes of MIR, according to Downie, because a work can be represented in any number of ways. Take, for example, Bach's familiar Minuet in G Major. This piece, as conceived of by Bach, is a 'work.' According to Smiraglia, any manuscripts, printed scores, recordings, or bibliographic representations would simply be instantiations of Bach's 'work.' We must consider, though, that music is more complicated than this. The melody line of A Lover's Concerto, first recorded in 1965 by the American pop girl group The Toys, takes its melody directly from Bach's keyboard piece. Would these two pieces of music be considered the same 'work'? How would a MIR system categorize these two pieces which simultaneously sound so similar and yet so different? The point to be made here is that notions of similarity in music are problematic. To add more complexity to this situation, a work, performed by two different performers, may sound very different. Because musical interpretation is highly subjective, the ways in which a performer may perform a piece are infinite. Making MIR systems recognize this concept is a very sophisticated problem. Smiraglia draws the conclusion that to be useful a MIR system needs to be able to "differentiate among instances in order to allow searchers to make the best possible choice among alternatives." Müller takes this concept a step further in complication, stating that there are "various digital manifestations of a work, differing in format and content." Some examples of these would be MP3, CD, MIDI, and the like. The variety of data representations available for describing music make MIR a serious challenge.
The overarching difficulty of MIR is that researchers need to have a fundamental understanding of music information. Downie has formulated an extensive list of what he calls the 'challenges of MIR.' These different challenges, or problems, include the multi-faceted, multicultural, multidisciplinary, multi-experiential, and multi-representational aspects of music.
In describing the first challenge, Downie states that music information is a "multifaceted amalgam" which includes the following elements: pitch, temporal, harmonic, timbral, editorial, textual, and bibliographic.  The complexity of the interaction between these facets creates a complex problem for the MIR community.
The pitch facet encompasses the qualities of music having to do with the pitch, or perceived quality of a sound, the intervals between these pitches, and key. The key of a piece of music is a designation of a harmonic center given to a piece of music. Information concerning the duration of events within music is categorized within the temporal facet. This includes tempo, meter, pitch duration, and rubato. Temporal information can create access problems within MIR because the information can be notated as an absolute (as in the case of strict metronome markings, i.e. MM=80 beats per minute), general (as in the case of tempo markings like adagio, andante, or langsam), or relative (as in langsamer, schneller, or stringendo) values. The harmonic facet has to do with what is known, in Western music, as polyphony. Polyphony, literally meaning "many voices," is a musical event in which two or more pitches align vertically in a piece of music, forming a chord.
Timbre, the tone quality, color, or aural distinction of a pitch is the subject matter of the timbral facet. This encompasses information about orchestration, or the instruments which are being played in a given piece of music. Also included are performance methods, or playing instructions for specific instruments, including pizzicati, mutings, pedalings, and so on. The editorial facet deals with performance instructions. These may include realizations of cadenzas, textual instructions like crescendo or diminuendo, or any other notations the composer has added to the score which are not part of the actual music. Editorial items may be in the form of icons (!, ^), or text (piano, forte), or both. The complete lack of editorial information is possible, also. Discrepancies of editorial information between different editions of the same work make the choice of an authoritative or definitive version of a work quite problematic. Another important point to make is that the border between the editorial and the timbral facet can be quite blurred. For example, the act of designating the instrumentation of a work (specifying that a work is for flute and violin) is encompassed within the editorial facet, but the aural effect of the sound of a flute and a violin playing that work is encompassed within the timbral facet.
The textual facet, as may seem evident, deals with lyrics in a work. This also includes libretti, which are texts used in extended works of music, such as operas, oratorios, cantatas, or musicals. In some instances, a fragment of a lyric in a work is enough to identify and retrieve a melody. The strong tradition of free interchange of lyrics between pieces of music in Western culture must be mentioned because it may impact the accuracy of retrieved works if there are multiple sets of lyrics or multiple melodies for one set of lyrics. The final facet of music is the bibliographic facet. Music metadata includes information concerning a piece's title, composer, editor, lyricist, publisher, and the like. The information included in this facet is what is traditionally used to describe music in retrieval systems. The bibliographic facet is the only facet which is not derived from the actual content of the work.
The next of Downie's challenges is called the multicultural challenge. Because music is an ancient and worldwide form of expression, there are countless ways that it has been represented throughout all cultures and historical eras. In MIR, however, there is currently a bias towards Western or 'common practice' music. Common practice music can be described as music of the world cultures whose history is strongly influenced by European immigration and settlement. According to Downie, there are three main reasons for this bias. First, there are many styles of music for which symbolic and audio representations are unavailable. Because it is simpler to build MIR systems based on easily accessible and manageable common practice music, these unavailable styles of music have been largely ignored by the MIR community. Second, the current MIR community members are most familiar with common practice music, and are more likely to use this in their research. Third, since common practice music has the largest worldwide audience, developers have taken advantage of this large potential user base.
The next challenge that Downie points out is multi-experiential challenge. Downie states that the "perception, appreciation, and experience of music will vary not only across the multitudes of minds that apprehend it, but will also vary within each mind as the individual's mood, situation, and circumstances change." Simply put, every individual's preference for music is unique, and the everyday situations in which we experience music will affect the types of music we prefer to hear. Additionally, an individual's preferences in music change over time, and this development may be very difficult to track or predict. People seek out and listen to music for many reasons, not simply to experience pleasure. Music can be used to create a nostalgic feeling, re-invoking a past experience. It can also be used to reaffirm familiar traditions, like singing Christmas carols or standing to sing the National Anthem during baseball games in the United States. Music may be used as a means of religious expression as well. This wide variety of musical experiences can be problematic for MIR purposes, according to Downie. The nature of similarity and relevance can change based on the situations in which music is used or the environment of the user.
Many questions arise from the concept of the multi-experiential challenge. What music is similar to other music? How can MIR systems decide what is relevant to a user? How do we compare a user's reaction to different pieces? How does 'mood' factor into musical similarity and relevance? These are all questions that MIR researchers should take into consideration when building their systems.
Another important factor to mention when discussing the difficulties of MIR research is the topic of Downie's next 'challenge.' The multidisciplinary challenge states that judging the contributions of various MIR research projects is difficult because teams originate from various disciplines. This means that the evaluation methods used for MIR research are inconsistent. In fact, the number of published MIR papers drawing on formal IR techniques is very low. This is due to the interdisciplinary nature of the field and the lack of familiarity with IR literature among its members.
The final element of Downie's challenges of MIR is the multi-representational challenge. One of the most unique characteristics of music, and also one of the reasons why MIR is so unique is because all facets of music can be represented multiple ways. Downie gives two categorizations for these; the symbolic representation and the audio representation. The symbolic representation takes the form of printed notes, scores, text, and the like. This representation tends to require few computational resources, so it is simpler to study and test. Audio representation, which includes live performances and analog and digital recordings, is more computationally costly. However, since the majority of the population of music users understand music only in its audio format, this representation is necessary for widespread MIR purposes. The audio representation of music raises serious intellectual property issues for MIR researchers. Should researchers be required to pay for the use of copyrighted recordings in order to test their systems? Should they be required to use music recordings which are in the public domain? In order for MIR systems to be accurate, they must be tested on real, current music. The prevention of accessibility to this music would prevent MIR researchers from making progress in the field.
The representation completeness of a work, according to Downie, is the number of music information facts included in its 'full' representation. In contrast, representational incompleteness is the partial lacking in completeness of representation. Counter intuitively, this incompleteness can actually make MIR systems more effective. Representational completeness is not a requirement for an MIR system, and most (if not all) current systems use an incomplete model. Because many individuals seeking music are musically naive, giving few representational choices allows for less opportunity for user errors. In MIR, there has typically been a bias towards using the pitch, textual, and bibliographic facets. This may occur because they are the most 'memorable' features of musical works, according to Downie.
Meinard Muller takes Downie's two music representations (audio and symbolic) and breaks them down further. He claims that there are three music representations: score format (similar to Downie's symbolic), audio format, and MIDI format. Muller claims that the MIDI format is a hybrid of the score and the audio representations. Musical Instrument Digital Interface, or MIDI, is an electronic music instrument specification that allows digital musical instruments made by different manufacturers to work and play together. MIDI does not represent musical sounds specifically, but represents event messages about pitch, velocity, volume, and the like. Some major problems with MIDI include its inability to encode timbral information. Additionally, it cannot distinguish between enharmonics  and does not represent rests, or periods of time in which no music is being played by a particular instrument.
As there are multiple ways to represent music, there are also multiple types of MIR systems. Different authors categorize these systems in different ways. Richard Smiraglia has built a concept map of MIR, which can be seen in Figure 1; based on the papers submitted to the first and second meetings of ISMIR. He breaks the aspects of MIR down into ten high-level categories; score processing, audio-information retrieval, classification, metadata, queries, recognition of parts of music, automatic transcription, digital libraries, intellectual property and systems design. He states that in MIR "score processing predominates...aural queries will retrieve musical works, classifications and metadata describe musical works; intellectual property inheres in musical works, and...digital libraries and systems designers collaborate to produce virtual collections of musical works." As the field has matured, the organization of its sub-categories has been refined and reorganized.
Stephen Downie breaks MIR systems down into two categories: analytic/production systems and locating systems. Downie defines analytic/production systems as having more complete representational information on music in a database. These systems can be thought of as music research toolkits. Each of these computer tools are each designed to address one of the many processes of an MIR system. These systems are designed for use by musicologists, sound engineers, and other highly skilled professionals who have specific information needs. Some examples of this type of system given by Downie are Humdrum Toolkit, Themefinder, and MAPPET. Because these systems use a more complete musical representation; they cater more towards sophisticated users who "need analytical power over syntactic simplicity."
Downie's locating MIR systems are designed to "assist in the identification, location and retrieval of musical works." The intended users of these systems range from the musical novice to the expert musicologist. Usually, users of these systems wish to make use of the music retrieve, for performance or listeners. Examples of this type of system range from Online Public Access Catalogs (OPACS), Web search engines, and Shazam to RISM, SoundHound, and Meldex. Some examples of queries to these types of systems are:
- Return a list of all compositions by a given composer
- Return a list of all recordings of a given performer
- Identify a song title given a line in that song
- Given a melody, identify the musical work of origin
Some more sophisticated queries are:
- What compositions "sound like" or are in the same style as this piece?
- What compositions will induce happiness in a user?
Although much of the research in the field of MIR is designed to help users find the music they desire, little research is focused on user behavior in real life settings. In their article The Utilitarian and Hedonic Outcomes of Music Information-Seeking in Everyday life, Laplante and Downie claim that "most studies aiming at evaluating the performance of MIR systems have adopted a quantitative approach and focused exclusively on external behavior." For this reason, the tasks that participants are assigned may not be representative of MIR tasks people do on their own. This results in a "lack of knowledge of MIR behavior in context." For this literature review two studies on this subject were examined. The first, by Laplante and Downie, was conducted in 2006-2007 on young adults in Montreal and used an interned study. The second, by Cunningham et. al. entitled Finding New Music: A Diary Study of Everyday Encounters with Novel Songs, was conducted in 2006-2007 on young adults in Hamilton, New Zealand and uses a diary study model.
The main goal of the study by Laplante and Downie was to examine the complete music information-seeking experience from the user perspective. They mentioned several specific questions: What makes music information seeking a satisfying experience? What are the utilitarian (i.e. useful) and hedonic (i.e. pleasurable) outcomes that contribute to making a music information-seeking experience satisfying? What makes a music information-seeking experience unsatisfying? The authors used an interviewing approach to question fifteen individuals, aged 18-29, about their "perceptions of real life music information-seeking experiences with various information systems," including digital libraries, existing IR systems, libraries, music stores and other sources of music information, such as other people and media. The interviews were divided into five sections, described as follows. In the first section, the participants were asked about their musical tastes and attitudes towards music. Second, the participants were asked to recall the last music artist or genre they had discovered and liked and to recall how it happened. Third, they were asked what music information sources were used and how they interacted with them. Fourth, they were asked about the hedonic and utilitarian outcomes that made the music information-seeking activity satisfying or not. Fifth, they gathered background information about participants.
Cunningham et al. used a very different approach to their study. During a three day period, participants selected from an undergraduate Human-Computer Interaction class, were to fill out a diary detailing everyday experiences they had with music and the thoughts and feelings that accompanied the. Cunningham et al. state that this approach may be more effective because it provides a record of events as they occur, rather than retrospectively. Participants in this study recorded each incident in which they encountered unfamiliar music.
Laplante and Downie found that both hedonic and utilitarian outcomes were considered as 'satisfying' to participants. Positive utilitarian outcomes included the acquisition of music information about that music. Participants were pleased in their discovery of information about music for many reasons, including and increased cultural knowledge, an enriched listening experience, and knowledge gained in broader to discover newer, better music in the future. Hedonic outcomes included pleasure and a feeling of engagement.
Cunningham et al. discover much about the discovery of new music in everyday life. Encounters were reported throughout the day and were sometimes linked to availability of leisure time. Background music allowed the opportunity for music discovery during work or study. Some of the most popular locations for discovery were in private residences, en route from location to location, and at clubs or retail environments. As can be expected, participants discovered music in every web and computer-related source imaginable, on the radio and on DVDs and movies. Additionally, Cunningham et al. make a distinction about whether the discovery was planned (active) or was encountered by chance (passive). Live performance discovery was both active and passive and the discovery of music in retail environments was largely passive.
Laplante and Downie make the observation that many current IR systems assume or require that users can quantify their information needs. Their research, however, finds that "users searching for music for everyday life purposes are often motivated by a vague or ill-defined need." They see browsing as a specific requirement for future MIR systems: "systems should be designed to allow users to navigate among music items using a variety of facets and techniques so that those who have no specific need can browse the collection without having to enter an initial query into the system." Cunningham et al. examine the current MIR approaches from the perspective of 'active' and 'passive' music discovery, stating that although active music discovery is well-supported, passive encounters with new music are not. They suggest that the use of a mobile platform lends itself to capturing music encountered passively. The authors propose a new technique for these purposes called 'laid back' searcher. This proposed tool would be a member of Downie's locating MIR system category. This technique would allow users who are not online to record web queries in the moment, on their mobile devices. The interface provided would then allow users to browse search results once a network connection is re-established. The tool would also allow for recording of snippets of new music as it is heard by the users in real time. As with the web queries, these snippets will be analyzed by an audio fingerprinting tool once the device is back online and users will be able to browse results for both audio and symbolic MIR queries. Although they do not propose a new tool in their paper, Laplante and Downie make suggestions for the development of future MIR systems. Because users correlate the utilitarian and hedonic success of music information retrieval with how satisfying it is, they state, future MIR systems should be designed in a way that will be enjoyable so that they will capture and maintain the attention of the user.
Both articles mention the importance of social networking for the purpose of MIR. Laplante and Downie mention an increased cultural knowledge as a positive utilitarian outcome of MI-seeking activities. Given the use of music preference and knowledge as a social badge, they identify a strong need to increase social networking aspects of MIR systems. Cunningham et al. suggest that their tool could monitor what the users' friends were listening to, and users could share or recommend music to each other. Finally they recommend that a future MIR system would allow users to pass from an active to a passive participation (or vice versa) in music searching and this would increase the flexibility of the tool.
Prior to 2000, no society existed for researchers in the field of MIR. Since 2000, the International Society for Music Information Retrieval (ISMIR) has met annually as a forum for research in the field. Several years after this time, researchers in the field desired a way to scientifically compare and contrast the many proposed MIR systems. This led to the development of Music Information Retrieval Evaluation eXchange (MIREX), an annual evaluation campaign for MIR algorithms and its evaluation lab; International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL). IMIRSEL is modeled partially on the TREC model, and is a lab established to provide teams with access to a unique collection of music materials from different formats. These collections, over 30,000 digital audio recordings and music metadata, are taken from Naxos Music Library and All Media Guide.
In their research, Richard Smiraglia and Stephen Downie comment on the future of MIR. In the concluding sections of his paper, Smiraglia reiterates his point that the musical 'work' stands as the linking point between all instantiations of that work, and this needs to be considered in all MIR systems. In this digital age, the opportunity for mutation of musical works is rampant. The sampling of other artist's works and 'mash-up' compositions (made up entirely of snippets from other artist's works) are only a few examples of this fact. Smiraglia states that since displays and searches for music will become more complex, adequately ordered structure of these works must be employed in order to avoid confusion or mistakes.
Stephen Downie looks to the future of MIR by conceptualizing future systems. He states that these future systems will have many social and commercial implications, including the potential to generate vast revenue, allowing access to the large amount of underused music currently existing in the world's libraries and on the Web. These systems also have the potential to benefit musicians, scholars, students, and the general public who consume music.
Although MIR researchers originate from a wide range of fields and backgrounds, they all share an interest in music and the retrieval of music information. Some of the common, far-reaching goals of MIR researchers appear to be the creation of a comprehensive, Google-like MIR system, or to develop the computer generation of music. These two types of systems do not currently exist in any practical fashion.
MIR has the potential to play an even more important role in our daily lives than it already does. Recommendation systems such as iTunes Genius, Grooveshark, Pandora, and Spotify already play a large part in people's everyday lives. As a shift from the purchase of hard copies of music to streaming music live on the Web occurs, Web-based MIR systems have an opportunity to gain a vast market share in the music industry. Additionally, more people are using mobile devices that have music applications built in or downloadable. Through the Web and mobile device technologies, almost every individual in the world has access to MIR systems. MIR developers must take advantage of this, and build fast and accurate systems that will allow all types of users--from the skilled musician to the musically naive child--to meet their music information needs.
 For a more detailed description of these musical terms, please refer to Gauldin, Robert. Harmonic Practice in Tonal Music. London: W. W. Norton & Co., 1997. Print.
 In modern music notation, an enharmonic is a note which is equivalent to some other note, interval, or key signature but is "spelled" or named differently.
Cunningham, Sally Jo, David Bainbridge, and Dana McKay "Finding New Music: A Diary Study of Everyday Encounters with Novel Songs." ISMIR 2007: Proceedings of the 8th International Conference on Music Information Retrieval (2007): 83-88. Print.
Downie, J. Stephen. "Music Information Retrieval." Annual Review of Information Science and Technology 37.1 (2005): 295-340. Wiley Online Library. Web. 28 Jan. 2012.
Downie, J. Stephen. "A Sample of Music Information Retrieval Approaches." Journal of the American Society for Information Science and Technology 55.12 (2004a): 1033-1036. Wiley Online Library. Web. 28 Jan. 2012.
Downie, J. Stephen. "The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future." Computer Music Journal 28.2 (2004b): 12-23. JSTOR. Web. 28 Jan. 2012.
Gauldin, Robert. Harmonic Practice in Tonal Music. London: W. W. Norton & Co., 1997. Print.
Laplante, Audrey, and J. Stephen Downie "The Utilitarian and Hedonic Outcomes of Music Information-seeking in Everyday Life." Library & Information Science Research 33.3 (2011): 202-210. Wiley Online Library. Web. 28 Jan. 2012.
Müller, Meinhard. "Part 1: Analysis and Retrieval Techniques for Music Data." Information Retrieval for Music and Motion. Ed. Michael Clausen. Berlin: Springer, 2007. 17-180. Print.
Smiraglia, Richard P. "Musical Works and Information Retrieval." Notes 58.4 (2002): 747-764. JSTOR. Web. 28 Jan. 2012.