Select Page

Using Audio Fingerprinting to Identify Music

As an early pioneer in CD recognition, Gracenote today is probably best known for its MusicID software which can automatically and rapidly identify songs and return metadata such as artist name, track name, and album cover art. Running in hundreds of millions of car infotainment systems, sound systems, laptops, smartphones, and other devices throughout the world, MusicID resolves over 20 billion queries every month using its 200 million+ track reference database. To identify music, MusicID uses audio fingerprints — compact and unique digital song identifiers. Even with static, noise, and other audio interference, fingerprints allow for fast and accurate music recognition.

Here is how Gracenote’s audio fingerprinting system works:

  1. Audio fingerprints are computed from millions of known songs and stored in a reference database, along with song metadata.
  2. A fingerprint is computed from an unknown query and compared against the reference database to identify a match and return its corresponding metadata.

While MusicID is based on the well-known Philips algorithm (one of the earliest audio fingerprinting systems), a more efficient system dubbed StreamFP is currently in development.

Music Fingerprinting Process

Figure 1: Overview of an audio fingerprinting system.

Identifying Live Music

While MusicID and StreamFP are fast and accurate audio fingerprinting systems, they can only identify known recordings and will not work with alternate versions such as live song recordings. Even from a known artist, a live performance typically exhibits audio variations such as changes in key (e.g., the artist cannot sing as high as she/he used to), tempo (e.g., the band plays faster than usual), or instrumentation (e.g., an acoustic guitar replaces an electric one).

To solve this problem, the music team within the Applied Research group at Gracenote developed a new recognition system which not only compensates for audio interference but can also handle audio variations such as the ones described above. Dubbed LiveID, this new recognition system was initially proposed for a scenario in which a user attending a known artist’s live performance wanted to quickly identify a song using a smartphone.

The sample, in this case, would be compared against the artist’s existing recordings stored in a database, similar to how a traditional audio fingerprinting system works. Early tests on live queries extracted from live albums and smartphone videos showed that the system can achieve high accuracy, even in the presence of large tempo variations (e.g., up to 20% for Bonobo for live album queries), and key variations (e.g., up to 5 semitones for Foreigner). Poor results are typically due to considerable audio variations (e.g., Jefferson Airplane’s extensive improvisations) or audio interference (e.g., a lot of noise for Suprême NTM for queries from smartphone videos). For more details about the system and its evaluation, I refer the reader to the following article:

Zafar Rafii, Bob Coover, and Jinyu Han, “An Audio Fingerprinting System for Live Version Identification using Image Processing Techniques,” in 39th IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, May 4-9, 2014.

artist

live albums
(6 seconds)

live albums
(9 seconds)

smartphone videos
(6 seconds)

smartphone videos
(9 seconds)

AC/DC

0.82

0.92

0.65

0.70

Arcade Fire

0.70

0.84

0.75

0.79

Bonobo

0.75

0.83

0.49

0.60

Eagles

0.88

0.93

0.62

0.70

Foreigner

0.71

0.88

0.50

0.68

Jefferson Airplane

0.60

0.60

0.23

0.40

Led Zeppelin

0.61

0.74

0.24

0.28

Phoenix

0.84

0.88

0.57

0.67

Portishead

0.78

0.92

0.64

0.80

Suprême NTM

0.84

0.87

0.23

0.30

all

0.77

0.86

0.51

0.61

Table 1: Accuracy (top-1 matches) for queries from live albums, for durations of 6 and 9 seconds, and from smartphone videos, for durations of 6 and 9 seconds.

New Developments in Recognition

While LiveID was originally developed to rapidly identify short and noisy song excerpts using a smartphone recording a known artist’s live performance, the system can also be used to identify full recordings of live or cover versions of songs from sources such as YouTube or SoundCloud. The ability to monitor these sources is particularly useful to artists and labels for rights management.

This “cover music identification system” computes audio fingerprints using the LiveID algorithm from successive segments of a given duration from a full recording (e.g., a cover), for example, downloaded from YouTube, and compares them to a given artist’s reference database to identify the song (or songs) being played. A post-processing step removes unlikely candidates (for example, isolated or inconsistent matches) resulting in more accurate identification.

Tests on audio recordings extracted from YouTube videos showed that Gracenote’s system can accurately identify live and cover versions of diverse artists such as Eminem, Katy Perry, Maroon 5, or Taylor Swift, even when the live recording is fairly bad (e.g., a cellphone recording) or when the cover is fairly different (e.g., an acoustic recording). Since this system identifies the recording for every segment, it can also identify multiple references within the same recording (e.g., a full concert). Additionally, a separate cover music recognition system based on the same LiveID audio fingerprint was developed and tested, showing state-of-the-art results on a recent cover song dataset. For more on this topic, I refer the reader to this soon-to-be-published article:

Prem Seetharaman and Zafar Rafii, “Cover Song Identification with 2d Fourier Transform Sequences,” in 42nd IEEE International Conference on Acoustics, Speech and Signal Processing in New Orleans, USA, March 5-9, 2017.

live ID demo

Figure 2: LiveID demo shown at CES 2017. A YouTube search for music videos related to Taylor Swift returned the Punk cover shown above. The system analyzed the corresponding audio track by fingerprinting successive segments, comparing them against the reference database, and returning a match to a reference song for any segment in which it had a high confidence level. In this case, the system successfully identified most of the segments as from the song “I Knew You Were Trouble.”

At Gracenote, we thrive on solving big challenges involving digital media by developing new, technology and data-based solutions. LiveID which did not exist at this time last year is just the latest example of an algorithm we’ve created which has immediate practical applications. If you have anything to say on this topic, sound off below. Otherwise, keep your eyes on this blog for more from our tech team.

by Zafar Rafii | February 23, 2017

Share This

Share this post with your friends!