Update: All code for this article is free and open source. It can be downloaded at https://tandonp.wordpress.com/2010/08/02/intelligent-guitar-software-suite-now-open-source/
Melody Recognition Software – Towards An Advanced Improv System
Imagine a computational music system that allows a human guitar player to jam with their computer. As the guitar player plays, the software would generate accompaniment for their playing. The computer agent and guitar player’s produced sounds could be piped to a central speaker which would allow the overall “jam session” to be heard.
There are several technical sub problems to be solved in constructing such a software system. The system must accomplish both Melody Recognition (figuring out what the human player is playing) and Accompaniment Generation (determining what music must be generated to accompany the human player). There is a lot of complexity to this as each subproblem can be broken down into further subproblems. Many of these sub-subproblems have their own research community, literature base, and associated conference.
The creation of a Melodic Classification component itself is a fascinating endeavor. A melodic classification component would take as input a recording of human guitar player playing a melody and a known library of melodies. It would output a label for that melody that is the “closest match” to one of the melodies in the database. The Melodic classification problem, then, is to determine what melody a person has played from a known library of melodies. The melodic classification problem is the first major functionality I hope to provide with my system.
An associated problem with melodic classification is the population of the melody database to facilitate classification with. There are two key feasible ways to populate a melody database – either a human could prerecord melodies for the database or the database could be generated from some other source of musical data.
One dominant data source is guitar tablature. For those inexperienced with the magic of guitar tabs, guitar tablature is a common technique for writing down guitar notes. It’s kind of the sheet music equivalent for guitar that people who play guitar like to use. There are giant databases of guitar tabs online at sites such as http://www.ultimate-guitar.com/ or guitartabs.net. At these sites, users create their own tabs and post them up. One can find tabs for free for virtually any major guitar composition. Guitar pro is a popular software that can be used to write tabs. Their format *.GTP is a popular format for tabs.The second key functionality that I hope to demonstrate is the automated discovery and extraction of key melodies in a guitar pro tab file.
I. Melody Classification Approach
Let’s start with the solution to problem 1 – recognizing a melody that a guitar player has played.
A melody, fundamentally, is a sequence of notes and rests. Notes and rests have variable durations. A simplfying assumption is that the melody is monophonic (composed of single notes rather than chords). This isn’t a gigantic assumption as many melodies are monophonic. Polyphonic music is much harder to process starting off, but even so, I believe my system would work for polyphonic music even if polyphonic support isn’t built in.
My approach is to record an audio clip, use various signal processing techniques to extract the sequence of played notes in the audio-clip, and then applying pattern matching algorithms to compare with melodies in the melody database.
A. Signal Processing Methods
The base level components are four Note Transcribers – components that take in a recording of a sound and extract the fundamental pitches or notes from the sound files. My implementation consists of four transcribers based on four different signal processing techniques.
1. Simple FFT Transcriber with band-pass filtering – The idea is to take the FFT over the whole audio file and apply band pass filtering on the signal, thus allowing only certain frequencies to pass. While the algorithm does tend to extract the correct fundamental notes, there is no time information in this approach. Thus while the approach does spit out notes, it provides no information to when those notes are played.
2. Spectrogram Peaks with band-pass filtering – Spectrograms utilize the the short-time-fourier-transform (STFT) algorithm to generate a time-frequency graph. Taking the peaks of these gives one the most strongly concentrated frequencies per unit time. These key frequencies are then mapped onto notes. This is a technique improvement over #1 as now time information for pitches is provided. Unfortunately, it turns out that spectrogram peaks are not always the fundamental frequencies of a note. Thus, while spectrogram frequencies might give one the general vicinity of a note, they aren’t that precise with octave information.
3. Pitch-Tracking F0 Estimation with Note Model – Notes played by instruments don’t necessarily have a constant frequency. Rather, notes tend to follow a “lump” shape where they start off with a low frequency, come up to the frequency they stay at (generally the frequency of the note), and then go down again once the note is finished. There are various signal processing algorithms that can track the pitch of the note, having such a model of a note. I’ve found an autocorrelation-based algorithm works best. An autocorrelation algorithm compares parts of a signal to itself to extract frequencies in the time domain. This algorithm gives both octave and fundamental note information.
4. Chroma Signal Processing – The idea behind chroma processing is to map pitches onto fundamentals. Doing this creates algorithms that are highly robust to noise and distortion. Chroma-based methods underly most commercial music information retrieval systems. However, chroma based methods map to fundamental notes of which there are only 12 bins. Thus, while this algorithm works extremely well to determine what note was played, it provides no octave information.
Each of the four algorithms has pros and cons. How can one get the best of all the worlds? This can be cast as a sensor fusion problem!
Sensor Fusion is the combining of sensory data or data derived from sensory data from disparate sources such that the resulting information is in some sense better than would be possible when these sources were used individually.
The sensor fusion algorithm is a current work in progress but the idea is to weight the output of each transcriber with respect to its benefits. Thus, the chroma-based transcriber is really good at extract notes effectively. However, it has no octave information. The chroma-based transcriber could identify the note and my autocorrelation f0 estimator provides the octave.
The output of the sensor fusion algorithm is a transcription of the notes present in the audio clip.
B. Melody Classification Methods
Melody Classification is the problem of identifying a melody based on a known library of melodies. A melody, once again, is a sequence of notes and rests.
Humans can recognize key riffs in a piece even if the musician plays them a little differently. Melody notes can have slight perturbations. Notes can have different lengths as can the rests. Furthermore, humans can identify melodies in a scale-invariant sense. The same melody can be played in the key of G as in the key of C and a human could generally classify it correctly.
My melodic classification algorithm hopes to classify a newly-played melody given these parameters. For accomplishing this task, a Distance-Based classifier approach has been useful. Two different distance metrics – Levenshtein Distance and Melodic Transposition Distance have been especially useful.
1. Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).
A Levenshtein distance (let’s call it L-distance for short) algorithm takes as input two strings S1 and S2 and computes the minimum number of operations necessary to convert one string to another based on only operations to insert a character, delete a character, and substituting a character. These is an efficient dynamic programming algorithm to solve this problem.
For example, the L-distance between “ABC” and “ABD” would be 1 because it takes one substitution to convert “C” to “D.” The L-distance between “ABCD” and “ABE” is 2 because it requires the deletion of a character (either C or E) and substitution of the remaining character to E.
L-distance makes sense intuitively and performs well. It is intuitive.Two different melodies would tend to have high Levenshtein distances while similar melodies would have low distances. However, the L-distance isn’t very musical. For example, “ABC” and “ABF” would have the same L-distance. However, there is a huge auditory difference between the “B-C” interval and the “B-F” interval. One is a musical second whereas the other is a musical fifth.
The second metric I’ve been experimenting with is the transposition distance between melodies. The transposition distance between two notes N1 and N2 is the number of musical half steps between them. The transposition distance between two melodies is the sum of such differences over all notes in the melody.
So far I’ve had mixed results with each metric and the performance is very melody dependent. I am working on a fusion technique for this as well. I’ve found L-distance works as a good baseline.
The above method tends to do well to ensure time and note invariance in melody recognition. For example, if the musician elongates a note, plays a different note here and there, and otherwise makes not-too-terrible mistakes, the classifier works successfully to apply the correct melody tag.
Extending the Distance classifier to accomodate pitch-invariance is surprisingly easy. Transposition formulas can be used to transpose a melody in one key to another key. The melodic database can simply be populated with all transpositions of a melody. Thus, the distance classifier can work to recognize a melody in any key!
II. Melody Discovery and Extraction
The Melody Discovery and Extraction problem is to recognize and segment out the key melodies or “riffs” in a guitar tab. Riffs tend to be the “catchy” parts of a song that people tend to like to hum. Riffs, depending on the song, tend be 3 to 15 note melodies (on average) that are repeated. A piece can be segmented into several note sequences that are the “riffs” in the piece.
I have been experimenting with two key approaches to melody extraction. The first is Rest-Based segmentation (segmenting based on the presence of rests) and Sequence-Based segmentation (segmentation based on note sequences).
A. Rest-Based Segmentation
Rest-Based segmentation involves breaking up a piece into note sequences based on rests in the piece. This is a rather simple metric but works surprisingly well for many pieces. It accords well to how many musicians write music. The underlying design concept is to play a certain theme, give the listener some time to process and reflect upon it, and then present the next musical idea.
The performance of the Rest-Based segmentation algorithm is hit and miss. If the piece is designed with the above design concept in mind, the algorithm works. Otherwise, it overshoots or undershoots the length of a riff in a piece.
An extension to improve this algorithm is to utilize note sequence content.
B. Sequence-Based Segmentation
Sequence-Based segmentation looks at the sequence of notes in a piece. The idea is that there are patterns in the underlying sequences of note that have been designed in by the musician. Sequences can be repeated several times for emphasis.
Indeed, much of rock or jazz music is well-structured, revolving around a couple key chord changes and phrasing on those chord changes. Musicians tend to describe a piece as having structure “AABA” if there is a chord of melody that is played three times and another chord or melody played once.
To extract, riffs, n-gram techniques can be used on the notes of melodies.
An n-gram is a list of the n consecutive elements from a source list or string. Thus, the 2-grams of “ABCD” are {“AB”, “BC”, “CD”}, the 3-grams are {“ABC”, “BCD”} and so forth.
The 3->15-grams of the notes of a melody can be extracted. That can be a LOT of n-grams depending on the piece. However, riffs tend to be repeated. Riffs can occur several times in a piece. Thus, high count n-grams should reflect some intended structure in the piece. We can thus figure out the key melodies in a piece by looking at the highest-count n-grams.
This method, unfortunately, isn’t without its problems. One problem that occurs is n-Grams overlap at different scales. Let’s say we have “ABCABCD”. A high ranking 2-gram is “AB” which appears twice. The high ranking 3-gram is “ABC” which occurs twice as well. However, is the riff “ABC” or “AB” globally?
I’ve seen this problem involving hierarchial tagging several times before in other computational areas requiring tagging of sequences of characters. It’s actually a fundamental problem in several areas. Natural Language Processing deals with the problem of extracting collocations, sequences of words that form a phrase that is more than the sum of its parts. For example a “hot dog” is neither “hot” nor a “dog.” DNA Sequencing deals with the problem of extracting key coding sequences.
After looking at some problem solving techniques in these areas, a possible heuristic came to mind: If a lower order n-gram occurs with the exact same frequency as a higher order n-gram, there is reason to believe that the higher order n-gram is the true design of the musician. For example, if “A” occurs 37 times and “BA” occurs 37 times, that means that every time an A appears, a B appears. It’s likely that the musician designed it that way because “BA” is their real unit of measurement. If “BACDEF” occurs 37 times too then maybe that’s the real underlying “riff” in the piece.
The results of this heuristic are fantastic. For some of the pieces I’ve tried I’ve seen the ngram count drop from 235234 to around 67.
III. Current Implementation and Future Plan
My system currently is completely software based. It is mostly written in java with matlab being used as a giant calculator for efficiently computing the signal processing math. There is a GUI component that allows a user to import tabs and add extracted melodies from that tab to their own customized melody database. The melody database file can then be read and used by a real-time recognizer component that records 5-10 second clips of a guitar playing and outputs melody labels for those clips based on the database.
The cool thing to do would be to build an actual hardware pedal using my recognition algorithm. For this implementation, the user could construct their melody database on their computer, use either usb or wifi to send the melody database to the pedal, and then use the pedal in a real-time performance system. The pedal could be used with a DJ or other accompaniment board to generate automated accompaniment based on recognized melody.
This could be quite an endeavor depending on the chosen hardware. One simple implementation could be with java Sunspots (http://sunspotworld.com/). This would involve converting the matlab signal processing algorithms to java which isn’t too hard given many of their boxes are open source. I believe the algorithms are industrial strength so most of the work would be doing the embedded programming.
—
I shall have screenshots and a downloadable demo here soon.
Great work so far! I can’t wait for updates!