Extracting information from MIDI files¶

In our first lesson we want to get to know how to work with datasets so we can

parse a dataset
analyse the dataset and make sure it matches our expectations
check if errors occured during parsing

After this is done we want to find out how we can generate new drum patterns from the existing one. But in order to do this we need to take a look at the quantisation (?) of our patterns.

In such experiments, the cleaning of the dataset and setting the data up properly takes most of the time. But if we make mistakes here those mistakes will propagate through our system – so it’s a good idea to spend some time with this task.

import os
import glob

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)  # makes the randomness deterministic

%matplotlib inline
# todo: try %matplotlib widget
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['axes.grid'] = True

Getting the dataset¶

As our methods in this machine learning based workshop rely on data, we need a way to obtain such data. This will be on of our first endeavours. Thankfully there are search engines which help us to find data in the internet easily. When searching for “midi dataset”, one search result is https://colinraffel.com/projects/lmd/.

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. […]

Of course the cultural bias of such a dataset is a thing we need to be aware of. What kind of music is transcribable into the MIDI format and what kind of music is transcribed as MIDI at all? Maybe we can shed some light on the last question by inspecting the dataset. But before we can do this, we need to download and understand the data.

We use a function which will check if the files are already downloaded and if not will download and extract the directory.

import urllib.request
import subprocess

def download_dataset(download_path: str = "../datasets/lmd"):
    os.makedirs(download_path, exist_ok=True)
    archive_dir = os.path.join(download_path, "lmd_full")
    
    dl_files = {
        "midi": {
            "path": os.path.join(download_path, "lmd_full.tap.gz"),
            "url": "http://hog.ee.columbia.edu/craffel/lmd/lmd_full.tar.gz"
        },
        "json": {
            "path": os.path.join(download_path, "md5_to_path.json"),
            "url": "http://hog.ee.columbia.edu/craffel/lmd/md5_to_paths.json"
        },
    }
    
    for dl_name, dl in dl_files.items():
        if os.path.isfile(dl["path"]) or (dl_name == "midi" and os.path.isdir(archive_dir)):
            print(f"{dl_name} already downloaded to {dl['path']}")
            continue
        print(f"Start downloading {dl_name} to {dl['path']} - this can take multiple minutes!")
        urllib.request.urlretrieve(dl['url'], dl['path'])
        print(f"Finished downloading")
    
    if not os.path.isdir(archive_dir):
        print("Start extracting the files of archive - this will take some minutes")
        # todo: windows has no tar
        subprocess.check_output([
            'tar', '-xzf', dl_files["midi"]["path"],
            '-C', os.path.join(download_path)
        ])
        print("Finished extracting")
    
    if os.path.isfile(dl_files["midi"]["path"]):
        print("Remove archive")
        os.remove(dl_files["midi"]["path"])

download_dataset()

midi already downloaded to ../datasets/lmd/lmd_full.tap.gz
json already downloaded to ../datasets/lmd/md5_to_path.json

Parsing the dataset¶

When working with large sets of files the unix utility glob comes in handy as we can describe the pattern of the file paths we want to match instead of listing all files.

When we take a quick look at the pattern it seems they all follow a structure like

../datasets/lmd/lmd_full/2/4a0cbb3f083d14d57858c87b26f85873.mid

Soon we will understand why the filename has this cryptic format, but for now we simply want to parse all available files into an array.

midi_files = glob.glob('../datasets/lmd/lmd_full/*/*.mid')
print(f'Found {len(midi_files)} midi files in dataset')

Found 178561 midi files in dataset

# select 5 random files
np.random.choice(midi_files, 5)

array(['../datasets/lmd/lmd_full/4/4a69b1a10e4dbe409ac922de4f6256d8.mid',
       '../datasets/lmd/lmd_full/b/b102261b4c27ea58bad4777b5df5be5e.mid',
       '../datasets/lmd/lmd_full/3/3a8ccaab480919c35f37fdf08c238e29.mid',
       '../datasets/lmd/lmd_full/d/dc6416232e05b3d44fb62007ca84b474.mid',
       '../datasets/lmd/lmd_full/4/4fb744a04c1afbb0309b581410b33363.mid'],
      dtype='<U63')

But how are sure that we matched every midi file in our folder? We can simply match every file in our dataset directory and show us the differences by using sets.

missed_files = set(glob.glob('../datasets/lmd/lmd_full/**/*.*', recursive=True)) - set(midi_files)
print(f'Did not match {len(missed_files)} files')

Did not match 0 files

Thankfully we have a rather clean dataset here which does not force us to do file acrobatics.

But we did not just download MIDI files, we also downloaded a JSON file which we will now take a look into.

import json

with open("../datasets/lmd/md5_to_path.json") as f:
    md5_filenames = json.load(f)

example_midi_file = '../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid'
print(f"Choosen {example_midi_file} as example MIDI file")
# with split we remove the folder name and the file extension
md5_filenames[example_midi_file.split('/')[-1].split('.')[0]]

Choosen ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid as example MIDI file

['QUALITY MIDI/EClapton-Promises.mid']

So - why do we have an additional file which tells us the original filename? This is actually a good trick in order to avoid duplicates in the dataset - we use a so called hash function which will return a cryptic string depending on the data that we hash. Two files will result in the same hash output - so if we simple rename each file to its hash output, we will automatically delete and detect all duplicates in our dataset. This is a common practice if the data was obtained by scraping multiple sources for data.

Let’s just take a look if we can verify this claim.

import hashlib

with open(example_midi_file, 'rb') as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
        
print(f"Hash for {example_midi_file} is {file_hash.hexdigest()}")

Hash for ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid is ed05335f8f6c273f506a997137b2805b

This also gives us a nice additional information - a common MIDI file will probably have lots of different names as it was scraped from multiple files which probably changed the filenames a bit.

Loading MIDI files in Python¶

Now as we have all file paths available it is a good practice to take a first look at the data. We need to understand how the information we are interested in is accessible. Also it is probably not clean and standardized.

As there is no build in functionality in Python to work with MIDI data we will need to rely on a library. There are a numerous libraries for working with MIDI data in Python, but we will rely on pretty_midi to inspect the MIDI files.

As we have quite a lot of data to process, it is worth some research on how we can reduce the amount of data before we process the whole dataset and bring it into a good format we can work with. Here, we will concentrate on percussion tracks. From the wikipedia article on General MIDI (GM) we can see that they should be on channel 10 - so for now we will simply ignore the other tracks.

In a former version of this document we used the library music21 for parsing and loading of our MIDI files but it turns out that music21 is too slow for our big dataset.

We also start annotating our dataset using pandas, which is a library for data analysis and manipulation. We can use it to keep track of any metadata the files. It will also help us on the analysis of the metadafiles.

Additionally we will use caching because the meta-analyis of our MIDI files takes about 2 hours on a recent MacBook. Caching means that we will store the results on the hard drive so once we calculated them we do not need to re-calculate them because we will just load the files from the hard drive. It is a common practice in data science to write a function which either loads the data from a file or generates the file we are using as generating or calculating the data is a step which takes often a long time.

We will start by inspecting how we can access MIDI information from within pretty_midi.

import pretty_midi as pm

midi_stream = pm.PrettyMIDI(example_midi_file)
midi_stream

<pretty_midi.pretty_midi.PrettyMIDI at 0x159b7d760>

After we have now loaded a MIDI file with pretty_midi we now take a look at how we can access the data. Therefore the documentation of pretty_midi will help us to know how to access the MIDI information.

midi_stream.get_end_time()

189.431068

will give us the time in seconds of the MIDI file

midi_stream.get_beats()[0:50]  # limit to 50 entries for printing

array([0.0000000e+00, 1.2500000e-02, 4.4107100e-01, 8.6964200e-01,
       1.2982130e+00, 1.7267840e+00, 2.1553550e+00, 2.5839260e+00,
       3.0124970e+00, 3.4410680e+00, 3.8410680e+00, 4.2410680e+00,
       4.6410680e+00, 5.0410680e+00, 5.4410680e+00, 5.8410680e+00,
       6.2410680e+00, 6.6410680e+00, 7.0410680e+00, 7.4410680e+00,
       7.8410680e+00, 8.2410680e+00, 8.6410680e+00, 9.0410680e+00,
       9.4410680e+00, 9.8410680e+00, 1.0241068e+01, 1.0641068e+01,
       1.1041068e+01, 1.1441068e+01, 1.1841068e+01, 1.2241068e+01,
       1.2641068e+01, 1.3041068e+01, 1.3441068e+01, 1.3841068e+01,
       1.4241068e+01, 1.4641068e+01, 1.5041068e+01, 1.5441068e+01,
       1.5841068e+01, 1.6241068e+01, 1.6641068e+01, 1.7041068e+01,
       1.7441068e+01, 1.7841068e+01, 1.8241068e+01, 1.8641068e+01,
       1.9041068e+01, 1.9441068e+01])

will give us the position in secounds of all beats (so any action) in a MIDI file

midi_stream.get_tempo_changes()

(array([0.      , 0.0125  , 3.441068]),
 array([120.     , 140.00014, 150.     ]))

will tell us any changes in the tempo in bpm. Keep in mind that we do not have two tempo changes here but that the function returns two arrays - the first one with the mark in seconds where the tempo gets changed and another one to what bpm it changes - for now we are only interested in the second array.

We should also prepare for the case that the MIDI file does not provide any tempo information.

midi_stream.time_signature_changes

[TimeSignature(numerator=4, denominator=4, time=0.0125),
 TimeSignature(numerator=2, denominator=4, time=21.041068000000003),
 TimeSignature(numerator=4, denominator=4, time=21.841068000000003),
 TimeSignature(numerator=2, denominator=4, time=36.241068000000006),
 TimeSignature(numerator=4, denominator=4, time=37.041068),
 TimeSignature(numerator=2, denominator=4, time=51.441068),
 TimeSignature(numerator=4, denominator=4, time=52.241068000000006),
 TimeSignature(numerator=2, denominator=4, time=66.641068),
 TimeSignature(numerator=4, denominator=4, time=67.441068),
 TimeSignature(numerator=2, denominator=4, time=107.441068),
 TimeSignature(numerator=4, denominator=4, time=108.24106800000001),
 TimeSignature(numerator=2, denominator=4, time=122.641068),
 TimeSignature(numerator=4, denominator=4, time=123.44106800000002)]

tells us the time signatures of the MIDI file.

time_signature = midi_stream.time_signature_changes[0]
print(f'{time_signature.numerator}/{time_signature.denominator}')

4/4

As with the tempo we should also prepare for the case that no time signature information is given on the MIDI file.

midi_stream.instruments

[Instrument(program=62, is_drum=False, name=""),
 Instrument(program=34, is_drum=False, name=""),
 Instrument(program=25, is_drum=False, name=""),
 Instrument(program=28, is_drum=False, name=""),
 Instrument(program=25, is_drum=False, name=""),
 Instrument(program=30, is_drum=False, name=""),
 Instrument(program=26, is_drum=False, name=""),
 Instrument(program=18, is_drum=False, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name="")]

gives us access to all instruments with its names and if this is a drum track or not.

We can also take a look at the notes of each instrument

midi_stream.instruments[0].notes[0:10]

[Note(start=9.834401, end=9.937735, pitch=62, velocity=109),
 Note(start=10.037735, end=10.311068, pitch=62, velocity=107),
 Note(start=10.437735, end=10.847735, pitch=62, velocity=113),
 Note(start=10.834401, end=11.057735, pitch=60, velocity=109),
 Note(start=11.034401, end=11.237735, pitch=59, velocity=106),
 Note(start=11.237735, end=11.594401, pitch=62, velocity=108),
 Note(start=11.627735, end=11.834401, pitch=60, velocity=101),
 Note(start=11.827735, end=12.247735, pitch=59, velocity=108),
 Note(start=12.224401, end=12.544401, pitch=62, velocity=106),
 Note(start=13.024401, end=13.131068, pitch=64, velocity=102)]

Performance¶

Now we know how we can use pretty_midi to load a MIDI file and access the information of a single MIDI file. But the problem is that a loading of a MIDI file is pretty slow in Python - we will do some measurements now to have a quick way to access the information we are interested in.

We can use a built in tool of Jupyter which is %%timeit which will do multiple runs of a cell and measure the time it takes for each run.

%%timeit

pm.PrettyMIDI(example_midi_file)

186 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Of course this time to parse the MIDI file depends on the complexity of our MIDI file. But now its time to do some maths. For the sake of simplicity we will assume that it takes around 100ms to load a MIDI file.

\[ 0.1 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 17800 \text{ sec} \approx 296 \text{ minutes} \approx 4 \text{ hours}\]

Note that this assumes that we do not process multiple files in paralell, which is possible but also tricky. Also this is only for the loading of the data, we not have acessed yet any data.

Maybe there is a file format to which we can store the MIDI files and load them quicker whenever we need them. note_seq from the magenata project provide such a format called note sequence which nicely interacts with pretty_midi.

import note_seq

%%timeit

notes = note_seq.midi_to_note_sequence(midi_stream)

14.4 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the conversion of this file is fast compared to the parsing of our MIDI file. The problem is that we still need the MIDI file parsed and we have not yet taken a look at the saving and loading of a note sequence.

note_seq is using a special kind of format for this which is called protobuf - a binary file format to store and exchange files developed by Google. The advantage is that this is really fast and efficient as it is a binary format and we can exchange the files to other programming languages - think of it like a binary version of JSON but with type safety included.

We start by saving the note sequence into a protobuf and try loading it again from our harddrive.

notes = note_seq.midi_to_note_sequence(midi_stream)

with open('note_seq_test.protobuf', 'wb') as f:
    f.write(notes.SerializeToString())

%%timeit

with open('note_seq_test.protobuf', 'rb') as f:
    notes_loaded = note_seq.NoteSequence.FromString(f.read())

3.47 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Lets say it takes about 4 ms to load a protobuf file - now we will do the calculation from before again.

\[ 0.004 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 712 \text{ sec} \approx 11 \text{ minutes}\]

which is much more acceptable. And we have not yet considered any parallelisation of the code.

But before we convert everything to a note sequence as a protobuf we should compare the file size of the two file formats.

for f in [example_midi_file, 'note_seq_test.protobuf']:
    print(f'File size of {f}: {os.path.getsize(f)/1024} kbytes')

File size of ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid: 46.6513671875 kbytes
File size of note_seq_test.protobuf: 195.958984375 kbytes

It turns out that we are trading storage in favour of computation power.

Before we can decide if we really want to trade, we may need to take a look at the file size of our MIDI dataset. For this we can use a functionality of Jupyter to execute commands in a shell from within Jupyter by appending a ! in front of the command. In order to get the size of a dictionary we will use the unix tool du with the arguments s (sum everything within the path) and h (convert number of bytes to a human readable format).

!du -sh ../datasets/lmd/lmd_full

5.9G	../datasets/lmd/lmd_full

We have around 6 gigabytes of files here - when we convert them to protobuf we will blow them up by a factor of \(\approx 4\) so we will have about 24 GB of protobuf files. This is still acceptable somehow.

Processing all files¶

So after we now know how to load, access and convert the files to a nicer format, we still need to take a look at how we can do this on our 180k files and maybe save some time here by a little bit of programming effort.

First we will start by programming a function in which we combine everything we did before and return to us all necessary information as a dictionary.

PROTO_SAVE_PATH = '../datasets/lmd/proto/'

# make sure that the folder we want to save to actually exists
os.makedirs(PROTO_SAVE_PATH, exist_ok=True)

from typing import Dict

from mido import KeySignatureError

def parse_midi_file(midi_file_path: str, proto_save_path: str=PROTO_SAVE_PATH) -> Dict:
    r = {
        'midi_path': midi_file_path,
        'midi_error': True,
    }
    midi_name = os.path.splitext(midi_file_path.split(os.sep)[-1])[0]
    proto_path = os.path.join(PROTO_SAVE_PATH, f'{midi_name}.protobuf')
    
    r['original_names'] = md5_filenames[midi_name]
    
    try:
        stream = pm.PrettyMIDI(midi_file_path)
        r['midi_error'] = False
    except (OSError, ValueError, IndexError, KeySignatureError, EOFError, ZeroDivisionError):
        return r
    
    try:
        # r['beat_start'] = stream.estimate_beat_start() # omitted b/c adds 200ms to parsing
        r['estimate_tempo'] = stream.estimate_tempo()
        r['tempi_sec'], r['tempi'] = stream.get_tempo_changes() 
        r['end_time'] = stream.get_end_time()
        r['drums'] = any([i.is_drum for i in stream.instruments])
        r['resolution'] = stream.resolution
        r['instrument_names'] = [i.name.strip() for i in stream.instruments]
        r['num_time_signature_changes'] = len(stream.time_signature_changes)
    except ValueError as e:
        # ValueError: Can't estimate beat start when there are no notes.
        # ValueError: Can't provide a global tempo estimate when there are fewer than two notes.
        print(f"Could not parse MIDI file {midi_file_path}: {e}")
    
    notes = note_seq.midi_to_note_sequence(stream)
    with open(proto_path, 'wb') as f:
        f.write(notes.SerializeToString())
    r['proto_path'] = proto_path
    
    return r

Now we want to try out the function on a single MIDI file.

parse_midi_file(example_midi_file)

{'midi_path': '../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid',
 'midi_error': False,
 'original_names': ['QUALITY MIDI/EClapton-Promises.mid'],
 'estimate_tempo': 159.6553471870248,
 'tempi_sec': array([0.      , 0.0125  , 3.441068]),
 'tempi': array([120.     , 140.00014, 150.     ]),
 'end_time': 189.431068,
 'drums': True,
 'resolution': 120,
 'instrument_names': ['', '', '', '', '', '', '', '', '', '', '', '', ''],
 'num_time_signature_changes': 13,
 'proto_path': '../datasets/lmd/proto/ed05335f8f6c273f506a997137b2805b.protobuf'}

So this works as expected - time to time it as well.

%%timeit

parse_midi_file(example_midi_file)

233 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It seems that accessing the MIDI data via pretty_midi does not come for free. Repeating the calculation from before:

\[ 0.2 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 35600 \text{ sec} \approx 593 \text{ minutes} \approx 10 \text{ hours}\]

This is not unheard of but we can reduce the time by parallelizing it for which we need to do some tricks into which we will not go detail. On a recent i9 processor the next step will take around 90 minutes if executed for the first time.

import math
import multiprocessing
from typing import Optional

from helpers.helpers import *

def load_midi_files(parquet_path: str = 'midi_files_{}.parquet', limit: Optional[int] = None) -> pd.DataFrame:
    parquet_path = parquet_path.format(limit if limit else 'full')
    
    if os.path.isfile(parquet_path):
        print(f'Use cached parquet file {parquet_path}')
        return pd.read_parquet(parquet_path)
    
    midi_files = glob.glob('../datasets/lmd/lmd_full/*/*.mid')
    if limit:
        print(f'Limit to {limit} files')
        midi_files = np.random.choice(midi_files, limit)
    print(f'Parse {len(midi_files)} midi files')
    
    cpu_count = math.ceil(multiprocessing.cpu_count()*3/4)
    with multiprocessing.Pool(cpu_count, maxtasksperchild=5) as p:
        midi_meta = p.map(parse_midi_file_async, midi_files, chunksize=255)
    print(f'Finished parsing midi files')
    
    midi_meta = pd.DataFrame(midi_meta)
    midi_meta.to_parquet(parquet_path)
    
    return midi_meta

# if you want to work with a random subset set the limit argument to num of files as a limit
midi_df = load_midi_files(limit=None)

midi_df.sample(15)

Use cached parquet file midi_files_full.parquet

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path
81923	../datasets/lmd/lmd_full/f/f4b72836981feedb9e1...	False	175.232953	[0.0]	[82.00003553334874]	128.780432	True	384.0	[harmnica, accordn2, drum, acoubass, pianogrd,...	1.0	../datasets/lmd/proto/f4b72836981feedb9e1046f1...
171833	../datasets/lmd/lmd_full/5/5f3eaf75cbc1b2554df...	False	213.043443	[0.0, 297.04290665, 297.1153704, 297.188899816...	[69.99998833333528, 69.00001725000433, 67.9999...	302.411045	True	120.0	[YO TE AMO;CHAYANE, YO TE AMO;CHAYANE, YO TE A...	1.0	../datasets/lmd/proto/5f3eaf75cbc1b2554dfa60bd...
161501	../datasets/lmd/lmd_full/2/2f02103cfb1764e661d...	False	170.691458	[0.0, 0.004166666666666667, 2.0015625]	[75.0, 120.0, 75.0]	174.718229	True	384.0	[, , , , , , , , , , , , , , , ]	1.0	../datasets/lmd/proto/2f02103cfb1764e661d54151...
39942	../datasets/lmd/lmd_full/6/609b19dde84d2374b7d...	False	237.529957	[0.0, 0.0026041666666666665]	[120.0, 110.00091667430561]	254.776049	True	384.0	[S-File, S-File, S-File, S-File, S-File, S-Fil...	1.0	../datasets/lmd/proto/609b19dde84d2374b7d19124...
100593	../datasets/lmd/lmd_full/d/de4dde0f1e960e41360...	False	170.384180	[0.0, 3.0, 183.33574275, 183.66386775]	[120.0, 67.00002903334591, 60.0, 49.0000318500...	190.137078	False	192.0	[, , , , , ]	0.0	../datasets/lmd/proto/de4dde0f1e960e413602c493...
44272	../datasets/lmd/lmd_full/6/6f74186f4c163863630...	False	242.119770	[0.0]	[122.00006913337249]	228.161756	True	480.0	[drummix, bass, ac piano, Vocal, back voc, ac ...	1.0	../datasets/lmd/proto/6f74186f4c163863630e8baa...
15131	../datasets/lmd/lmd_full/0/0d8e9c4eb761a3e632b...	False	245.724188	[0.0, 3.9344240000000004, 14.843497999999999, ...	[122.00006913337249, 121.00018755029072, 123.0...	251.814031	True	96.0	[TAKE ME HOME Bextor/Aller/...	1.0	../datasets/lmd/proto/0d8e9c4eb761a3e632ba4f10...
34735	../datasets/lmd/lmd_full/6/6d3149b41d088c63627...	False	84.000025	[0.0, 89.2856875, 89.88351320833333, 90.410473...	[42.00001260000378, 46.00002913335179, 51.0000...	340.282369	False	96.0	[Voice (Alto), Voice (Alto), Voice (Alto), Pia...	6.0	../datasets/lmd/proto/6d3149b41d088c6362792faf...
45905	../datasets/lmd/lmd_full/1/1a931072e21340f256b...	False	207.446994	[0.0]	[120.0]	280.114583	True	384.0	[A.PIANO 1, FINGERDBAS, A.PIANO 1, PAN FLUTE, ...	1.0	../datasets/lmd/proto/1a931072e21340f256b96104...
12844	../datasets/lmd/lmd_full/0/0a7c38887ba4882921e...	False	224.582071	[0.0]	[127.96996971377385]	193.497545	True	480.0	[CANDOMBE, PARA, GARDEL, http://fberni.tripod....	1.0	../datasets/lmd/proto/0a7c38887ba4882921ee2c82...
114123	../datasets/lmd/lmd_full/4/455fcb4f4c5ad4d10da...	False	153.020774	[0.0, 7.15475625, 7.3302190000000005, 7.506771...	[85.95680670463092, 85.48823040787859, 84.9608...	100.750126	False	1024.0	[]	1.0	../datasets/lmd/proto/455fcb4f4c5ad4d10da384b6...
14339	../datasets/lmd/lmd_full/0/029c4baa3089eca3817...	False	199.999987	[0.0, 6.461539, 7.961539, 9.790110733333334, 9...	[64.99999458333379, 40.0, 69.99998833333528, 6...	330.377789	True	120.0	[, Bass, Bass, Strings, Melodia, Slowstrings, ...	5.0	../datasets/lmd/proto/029c4baa3089eca381772e2b...
131099	../datasets/lmd/lmd_full/3/353116ea8d2eb4041c9...	False	260.000260	[0.0]	[130.00013000013]	51.692256	True	96.0	[MIDI Ch. 1, MIDI Ch. 2, MIDI Drums]	0.0	../datasets/lmd/proto/353116ea8d2eb4041c9f4622...
124606	../datasets/lmd/lmd_full/3/3574baf9f9cdfa4c3e1...	False	207.262310	[0.0, 0.16129033333333334, 0.9425403333333333,...	[30.999997933333468, 32.0, 32.99999670000033, ...	273.626126	True	120.0	[Rhodes Piano, Synth Bass H, Synth Bass L, Aco...	1.0	../datasets/lmd/proto/3574baf9f9cdfa4c3e1da456...
136078	../datasets/lmd/lmd_full/e/eb59c9f6854d897a03a...	False	193.775528	[0.0, 3.5416666666666665]	[120.0, 109.99990833340973]	217.860027	True	240.0	[Bass / Acoustic bass, Piano / Rhodes, Melody ...	1.0	../datasets/lmd/proto/eb59c9f6854d897a03a3ed78...

We now also add the original file names from the md5 json file to our dataframe.

original_files = []
for _, midi_path in midi_df['midi_path'].iteritems():
    midi_name = midi_path.split(os.sep)[-1].split('.')[0]
    original_files.append(md5_filenames[midi_name])

midi_df['original_files'] = original_files

midi_df.sample(15)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
31900	../datasets/lmd/lmd_full/7/745f66980de9ca27fe0...	False	169.981663	[0.0]	[151.00037750094376]	141.733917	True	96.0	[Acc Bass, Piano, Melody, Sax, Solo, E.Guitar,...	1.0	../datasets/lmd/proto/745f66980de9ca27fe055cc6...	[BonJovi/Runaway3.mid, MikeDoyle/MyLittleRunAw...
33585	../datasets/lmd/lmd_full/6/68b145d7ffdf4c43e1f...	False	204.472843	[0.0]	[99.99999999999999]	163.656250	True	96.0	[Remelexo, Remelexo, Remelexo, Remelexo, Remel...	2.0	../datasets/lmd/proto/68b145d7ffdf4c43e1f5f59c...	[C/Cesar E Paulinho - Remelexo.mid, Midis Româ...
67436	../datasets/lmd/lmd_full/a/a373ed7cd11fb80fac3...	False	159.432272	[0.0]	[159.0002067002687]	154.150743	True	192.0	[Crash Cymbal, Ride Cymbal, Open HiHat, Closed...	1.0	../datasets/lmd/proto/a373ed7cd11fb80fac3dca06...	[t/trust_me.mid, T/trust_me.mid]
142153	../datasets/lmd/lmd_full/e/ec2fff2c295943aedf4...	False	194.483086	[0.0]	[88.70005632453577]	259.977287	True	120.0	[You Can Leave Your Hat On, You Can Leave Your...	1.0	../datasets/lmd/proto/ec2fff2c295943aedf4797c3...	[Y/You Can Leave Your Hat on Pm L.mid, Y/You C...
127493	../datasets/lmd/lmd_full/3/3fae950c18f739286ec...	False	160.190137	[0.0]	[160.0]	100.425000	True	480.0	[Sax, Sax, French Horn, Piano, Bass, Drums]	1.0	../datasets/lmd/proto/3fae950c18f739286ec42107...	[2009 MIDI/barbra_Ann2-Regents-F160.mid]
138301	../datasets/lmd/lmd_full/e/ec16ec22a846e4d4d91...	False	164.382517	[0.0, 186.66648]	[162.000162000162, 120.0]	186.666480	True	120.0	[Vocal, ElecGtr 1, ElecGtr 2, PickedBass, Orga...	1.0	../datasets/lmd/proto/ec16ec22a846e4d4d91959cc...	[c/chaingm.mid, C/chaingm.mid]
83006	../datasets/lmd/lmd_full/f/fd8a6fcf912a7ddc602...	False	160.171123	[0.0]	[80.0]	274.525000	True	120.0	[Corazon Partio, Corazon Partio, Corazon Parti...	7.0	../datasets/lmd/proto/fd8a6fcf912a7ddc60242146...	[C/Corazon Partio Pm L.mid, Midis Latinas/Cora...
13575	../datasets/lmd/lmd_full/0/0dff276c070ccc16e2b...	False	203.014003	[0.0]	[190.0002850004275]	48.976900	True	96.0	[Bop on the Rocks, Bop on the Rocks, Bop on th...	3.0	../datasets/lmd/proto/0dff276c070ccc16e2bd6fa1...	[Jazz/Bop on.mid, b/bop-on.mid, Jazz/BOP_ON.MI...
134536	../datasets/lmd/lmd_full/e/ed793dadf58b913ea24...	False	192.000192	[0.0]	[140.00014000014]	6.861600	False	96.0	[z3ta+ (MIDI)]	0.0	../datasets/lmd/proto/ed793dadf58b913ea2460ad1...	[M/Machinehead - Headwave (Zatox Mix).mid, M/M...
116286	../datasets/lmd/lmd_full/4/4c4d48e046e6f5a2fc2...	False	255.499940	[0.0]	[145.9999659333413]	6.575344	False	96.0	[MIDI out]	0.0	../datasets/lmd/proto/4c4d48e046e6f5a2fc2f89cb...	[J/jan_johnston__flesh__tilt_remix__bamford.mi...
126553	../datasets/lmd/lmd_full/3/312b4e4b6f9bbd3abe3...	False	192.617363	[0.0]	[96.0]	87.500000	True	384.0	[ACOU BASS, A.PIANO 1, A.PIANO 1, JAZZ GTR, SL...	2.0	../datasets/lmd/proto/312b4e4b6f9bbd3abe3af2bb...	[Diversen/Strike-Up-The-Band.mid, DIVERSEN/STR...
150997	../datasets/lmd/lmd_full/b/bf3f61d4bf0c7a99c7c...	False	196.007331	[0.0]	[98.0036653370836]	364.744011	True	384.0	[SHOUT, SHOUT, SHOUT, SHOUT, SHOUT, SHOUT, SHO...	1.0	../datasets/lmd/proto/bf3f61d4bf0c7a99c7c5dded...	[080/Shout.mid, 080/Shout.mid]
66108	../datasets/lmd/lmd_full/8/80fd25fb1ff70896ed8...	False	180.717774	[0.0, 0.008333333333333333, 11.500865666666666...	[60.0, 67.00002903334591, 65.99999340000066, 6...	109.561609	False	120.0	[English (open lyrics window), French (open ly...	16.0	../datasets/lmd/proto/80fd25fb1ff70896ed8b1115...	[R/Ravel Maurice Chanson Des Cueilleuses De Le...
5427	../datasets/lmd/lmd_full/9/91100d1888462c59031...	False	246.623850	[0.0]	[120.0]	32.005208	False	480.0	[Nylon-Str. Gt., Bass, *Piano, Warm Pad, Strings]	1.0	../datasets/lmd/proto/91100d1888462c5903132be8...	[COREL COLLECTION/MOOD_06.MID]
78360	../datasets/lmd/lmd_full/f/f577f833d678672ef3b...	False	194.594595	[0.0]	[120.0]	200.916667	True	384.0	[Snare / Tom/cymbal, Snare / Tom/cymbal, Snare...	3.0	../datasets/lmd/proto/f577f833d678672ef3b2787c...	[D/Double Vision.mid, D/Double Vision.mid, D/D...

Analysis of metadata¶

Let’s examine the extracted metadata of those files. Before we start plotting it is always a good idea to take a look a the dataframe and its description.

midi_df.sample(15)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
160813	../datasets/lmd/lmd_full/2/2b03576c2eac6b20c55...	False	220.504202	[0.0]	[99.99999999999999]	200.200000	True	384.0	[Lead, Bass, Guitar, Rhythm guitar, Voice oohs...	1.0	../datasets/lmd/proto/2b03576c2eac6b20c5537514...	[M/Monday Monday.mid]
83189	../datasets/lmd/lmd_full/f/fc7280752baa136165c...	False	149.380190	[0.0]	[160.01024065540196]	150.326316	True	48.0	[SYNTH, SYNTH, SYNTH, SYNTH, SYNTH, SYNTH, SYN...	1.0	../datasets/lmd/proto/fc7280752baa136165c911e0...	[e/entreaty.mid]
96858	../datasets/lmd/lmd_full/c/cb226a46053507ae8a4...	False	170.000085	[0.0]	[170.0000850000425]	11.294112	False	96.0	[Neo Cortex - Elements (Styles & Breeze Remix)]	0.0	../datasets/lmd/proto/cb226a46053507ae8a47a16c...	[N/Neo Cortex - Elements (Breeze & Styles Remi...
34038	../datasets/lmd/lmd_full/6/6d12ce4d832f9c39ca6...	False	231.914925	[0.0, 2.368422, 3.493422, 11.1907935, 11.19735...	[151.99993920002433, 80.0, 151.99993920002433,...	503.284151	False	120.0	[Primo RH, Primo LH G, Primo LH F, Secondo RH ...	1.0	../datasets/lmd/proto/6d12ce4d832f9c39ca683d53...	[Mendelsonn/Allegro Brillant for Piano 4-hands...
171000	../datasets/lmd/lmd_full/5/586c46e8f1c65beff59...	False	236.159582	[0.0]	[130.0023833770286]	91.378132	True	384.0	[ACCORDION, A.PIANO 1, FINGERDBAS, CRYSTAL, DR...	1.0	../datasets/lmd/proto/586c46e8f1c65beff59afe0b...	[Byrd, Charlie/The-Girl-From-Ipanema.mid, BYRD...
103720	../datasets/lmd/lmd_full/d/dabbeaf50e6bad0b54d...	True	NaN	None	None	NaN	None	NaN	None	NaN	None	[Sure.Polyphone.Midi/Entertainer.mid, Various ...
71588	../datasets/lmd/lmd_full/a/aaf2e0398a707b34bbe...	False	174.928646	[0.0]	[124.000248000496]	131.612640	True	384.0	[ESTCE PR HASARD, ESTCE PR HASARD, ESTCE PR HA...	1.0	../datasets/lmd/proto/aaf2e0398a707b34bbeaefb9...	[E/Est Ce Par Hasard Dave.mid]
139917	../datasets/lmd/lmd_full/e/ebc5edd3c533092d324...	False	196.431974	[0.0]	[85.00028333427778]	276.574446	True	384.0	[TORN, TORN, TORN, TORN, TORN, TORN, TORN, TOR...	1.0	../datasets/lmd/proto/ebc5edd3c533092d32461eee...	[X/Xgtorn.mid, 097/XGtorn.mid, 097/XGtorn.mid]
45209	../datasets/lmd/lmd_full/1/1450ab4b1e52e25438d...	False	195.096618	[0.0]	[107.9999136000691]	224.166846	False	96.0	[Soprano, Soprano, Soprano, Soprano, Soprano, ...	38.0	../datasets/lmd/proto/1450ab4b1e52e25438de8e7e...	[civilwar2/61sfttobfa.mid]
74319	../datasets/lmd/lmd_full/a/aae666057efdbc7268d...	False	240.000000	[0.0]	[120.0]	66.010417	False	96.0	[3xOsc (MIDI), 3xOsc #2 (MIDI), 3xOsc #3 (MIDI...	1.0	../datasets/lmd/proto/aae666057efdbc7268d62b8f...	[K/Kara_Sun_-_Into_The_Sun_(Airbase_Dub_Mix)__...
90199	../datasets/lmd/lmd_full/c/c750e5a017155f06e63...	False	256.000000	[0.0]	[128.0]	15.117188	False	96.0	[Sylenth1 4, Sylenth1 3, Sylenth1 1]	1.0	../datasets/lmd/proto/c750e5a017155f06e63ae69c...	[C/Cascada_-_BadBoy__DjMixdOut_20130211014527....
99470	../datasets/lmd/lmd_full/c/cc6b412aeeb834e8fec...	False	275.408685	[0.0, 182.14267500000003]	[140.00014000014, 108.99994368336242]	185.995888	True	120.0	[, , , , , , , , , , ]	1.0	../datasets/lmd/proto/cc6b412aeeb834e8fecfd308...	[VerucaSalt/Seether.mid, VerucaSalt/Seether.mi...
175912	../datasets/lmd/lmd_full/5/500a1212bae787f7ef8...	False	240.000000	[0.0]	[120.0]	98.494792	False	384.0	[The Flowers, Of The Heath, Sequenced By, Barr...	1.0	../datasets/lmd/proto/500a1212bae787f7ef8fb099...	[h/heath.mid, H/heath.mid]
30403	../datasets/lmd/lmd_full/7/7dc2e8000f630ab4e57...	False	200.644292	[0.0, 9.69696, 38.49696, 48.193920000000006, 5...	[198.000198000198, 199.99999999999997, 198.000...	144.326112	True	384.0	[Drums - Travis, Bass - Mark, Guitar 1 - Tom, ...	1.0	../datasets/lmd/proto/7dc2e8000f630ab4e57eebd5...	[B/blink_182-dumpweed.mid]
108557	../datasets/lmd/lmd_full/d/d864a2adc13ad8b9797...	False	251.560269	[0.0]	[126.99998518333504]	205.980339	True	120.0	[The Corrs, "Breathless", , ****************, ...	1.0	../datasets/lmd/proto/d864a2adc13ad8b9797d6be2...	[Corrs/The Corrs - Breathless.mid, Various/The...

midi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178561 entries, 0 to 178560
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   midi_path                   178561 non-null  object 
 1   midi_error                  178561 non-null  bool   
 2   estimate_tempo              173984 non-null  float64
 3   tempi_sec                   173984 non-null  object 
 4   tempi                       173984 non-null  object 
 5   end_time                    173984 non-null  float64
 6   drums                       173984 non-null  object 
 7   resolution                  173984 non-null  float64
 8   instrument_names            173984 non-null  object 
 9   num_time_signature_changes  173984 non-null  float64
 10  proto_path                  174476 non-null  object 
 11  original_files              178561 non-null  object 
dtypes: bool(1), float64(4), object(7)
memory usage: 15.2+ MB

With a pandas dataframe we can simply aggregate and plot data into different formats. Let’s start by taking a look at our success rate of parsing the MIDI files.

midi_df['midi_error'].value_counts().plot.pie()
plt.title('Faulty MIDI files');

midi_df.midi_error.value_counts().plot.bar()
plt.title("Number of faulty MIDI files");

After we know the amount of failed MIDI files we could inspect those further to see if they are indeed corrupted or if our library has a bug.

midi_df[midi_df['midi_error'] == True]

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
3	../datasets/lmd/lmd_full/9/911cd08fa1fae36e5e0...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[s/sou.mid, S/sou.mid]
24	../datasets/lmd/lmd_full/9/94862530febd2b295b9...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[e/ELVIS_PRESLEY__Dont_Be_Cruel.mid, ElvisPres...
25	../datasets/lmd/lmd_full/9/906e72809900e01f919...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[dream theater/uagm1.mid]
98	../datasets/lmd/lmd_full/9/99e40264f321a4bfc5d...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[V/vong_tay_cau_hon.mid]
106	../datasets/lmd/lmd_full/9/9f22ccab9572cafafce...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[W/Wippenberg - Pong (Tocadisco Remix).mid, W/...
...	...	...	...	...	...	...	...	...	...	...	...	...
178315	../datasets/lmd/lmd_full/5/5063a2b400597a372e7...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[f/faded.mid, 2009 MIDI/faded_love1-D120patsy_...
178377	../datasets/lmd/lmd_full/5/523ce5adc656c872a69...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[l/laputa.mid, L/laputa.mid]
178419	../datasets/lmd/lmd_full/5/546df09d78a32141369...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[h/heraldic.mid, H/heraldic.mid]
178484	../datasets/lmd/lmd_full/5/5a4e49112c6cf832341...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[w/WESTERN.MID, 2009 MIDI/good_bad_and_ugly4-C...
178525	../datasets/lmd/lmd_full/5/5cae276ae00fb1fd464...	True	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[soad/Chic_N_Stu.mid]

4084 rows × 12 columns

For now, we will simply ignore those files and focus on the ones that we could sucessfully parse.

For our experiments it is important that the MIDI files contain a drum track - only those files that we could parse succssfully and that contain a drum track can be used for training of our model.

midi_df.drums.value_counts().plot.pie()
plt.title("MIDI files that contain drums");

midi_df.drums.value_counts().plot.bar();
plt.title("Number of MIDI files that contain drums");

We should verify that the files in which we did not detect any drums have indeed no drums by listening to a random subset of them.

midi_df[midi_df['drums']==False].sample(10)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
170937	../datasets/lmd/lmd_full/5/5c7f00cfbfb1c7a4c8d...	False	219.517230	[0.0, 216.0]	[80.0, 82.00003553334875]	422.539545	False	480.0	[Piano]	1.0	../datasets/lmd/proto/5c7f00cfbfb1c7a4c8df5cc0...	[C/Casablan.MID, C/CASABLAN.MID, C/CASABLAN.MID]
173537	../datasets/lmd/lmd_full/5/5ccabab69a29eb49782...	False	134.749985	[0.0, 0.4225350000000001, 2.9225340000000006, ...	[71.00003550001773, 72.0000288000115, 68.00007...	271.779287	False	192.0	[Flute, Celeste, Harp, Vibes]	10.0	../datasets/lmd/proto/5ccabab69a29eb49782f7ac2...	[A/AS_OP19.MID, A/AS_OP19.MID]
160054	../datasets/lmd/lmd_full/2/248de65be7464e72e13...	False	288.075739	[0.0]	[143.99988480009216]	13.333344	False	96.0	[MIDI out]	0.0	../datasets/lmd/proto/248de65be7464e72e13026ee...	[H/Heaven's Cry - I Dont Need This No More.mid...
104986	../datasets/lmd/lmd_full/d/d62040c47cc8ccc1256...	False	230.565249	[0.0, 1.1940279999999999, 1.6062414999999999, ...	[201.00031155048293, 131.00007641671124, 18.00...	418.198189	False	240.0	[Diskant]	1.0	../datasets/lmd/proto/d62040c47cc8ccc12566150b...	[r/raffsonatacl.mid, R/raffsonatacl.mid]
177415	../datasets/lmd/lmd_full/5/5d3a4dcffc36dc21507...	False	183.485497	[0.0, 26.052641999999995, 26.252641999999994, ...	[75.99996960001218, 75.0, 73.99998273333736, 7...	355.954855	False	96.0	[PRIMI MANDOLINI, SECONDI MANDOLINI, MANDOLE, ...	3.0	../datasets/lmd/proto/5d3a4dcffc36dc21507ef683...	[p/primiracconti.mid, P/primiracconti.mid]
25978	../datasets/lmd/lmd_full/7/7a6107b2ebe71033dd6...	False	133.333333	[0.0]	[100.0]	46.500000	False	120.0	[, , , ]	1.0	../datasets/lmd/proto/7a6107b2ebe71033dd6d7a5b...	[h/himno442.mid, H/himno442.mid]
17961	../datasets/lmd/lmd_full/0/09634cc18ee4b3e3d8e...	False	224.868701	[0.0]	[132.000132000132]	198.167415	False	192.0	[Clarinet, Trumpet, Tuba, Strings, Strings]	1.0	../datasets/lmd/proto/09634cc18ee4b3e3d8e84072...	[a/Airship_Remix.mid, A/Airship_Remix.mid]
34639	../datasets/lmd/lmd_full/6/603e957bc609b539948...	False	100.000000	[0.0, 62.5, 65.5, 86.5, 91.5, 112.5, 113.5, 11...	[60.0, 30.0, 60.0, 30.0, 60.0, 30.0, 60.0, 30....	191.763885	False	96.0	[Soprano, Alto, Tenor, Bass, Piano (hi), Piano...	5.0	../datasets/lmd/proto/603e957bc609b5399481f26f...	[bliss/ppb74mhamt.mid]
161340	../datasets/lmd/lmd_full/2/24c247fe99ec2cfb909...	False	141.750142	[0.0]	[126.00012600012599]	63.805492	False	240.0	[Piano, Piano]	1.0	../datasets/lmd/proto/24c247fe99ec2cfb909c4aa4...	[W/waltz_05.mid, brahms/waltz_05.mid]
29071	../datasets/lmd/lmd_full/7/7d3cac11d3db19406bb...	False	106.219130	[0.0]	[91.04496862742124]	42.176960	False	96.0	[, , , ]	1.0	../datasets/lmd/proto/7d3cac11d3db19406bb568a5...	[Christian/Inmyhart.mid, Various Artists/inmyh...

Lets take also a look at the other metadata we have extracted.

midi_df.explode('tempi')['tempi'].plot.box(vert=False);
plt.title('Boxplot of tempi');

midi_df.explode('tempi')['tempi'].plot.hist(bins=5000, xlim=(0, 300));
plt.title('Histogram of tempi');

We notice a big spork at around 120 bpm which is rather interesting and should be inspected what the exact casue of this spike is.

It is important to get an understanding and feeling for the data. We have an expectation of our data which can help us to see if it makes sense and if we parsed it correctly. But that can also be problematic because we project certain simplifying expectations on our data.

Let’s count the occurences of tempo changes in each song.

midi_df.explode('tempi').groupby('midi_path').size().value_counts().plot.box(vert=False);
plt.title('Boxplot of number of time chages per song');

midi_df.explode('tempi').groupby('midi_path').size().sort_values(ascending=False).head(10)

midi_path
../datasets/lmd/lmd_full/9/95cf26881a51f376a54e4e3f049f2d48.mid    6597
../datasets/lmd/lmd_full/3/34c04cd56b1087851f79780360f48225.mid    6096
../datasets/lmd/lmd_full/2/2677cf785791e1cfca13a0566aafbc85.mid    6096
../datasets/lmd/lmd_full/3/34a3df2a3a1e2267cf63657984f608cf.mid    5912
../datasets/lmd/lmd_full/c/c615436a609bf4e82c102503ea01d855.mid    5758
../datasets/lmd/lmd_full/1/1e40fd0edc293c8733a9c1b66517890b.mid    5758
../datasets/lmd/lmd_full/6/6662d98153d0c84d93e794d0c2b3940f.mid    5758
../datasets/lmd/lmd_full/8/8d956a0e5ee409a19adda9db287f7108.mid    4301
../datasets/lmd/lmd_full/0/001549d8bc6ba6dc62ade443cf01a51e.mid    4058
../datasets/lmd/lmd_full/0/0b0fdfe28eb0318ea3ac86046053724b.mid    3922
dtype: int64

Its worth to inspect those files in e.g. MuseScore and try to figure out what happens in them.

midi_df['end_time'].plot.box(vert=False);
plt.title('Boxplot of end_time');

We see that the end_time has some extreme values that are probably wrong - let’s inspect those examples closer to see if we can see a pattern.

midi_df.sort_values('end_time', ascending=False).head(5)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
84737	../datasets/lmd/lmd_full/f/f7b41341a20201f860c...	False	225.041860	[0.0]	[128.0]	31055.625000	False	120.0	[Melody, On the Sun Road, Vivian Lai, Joseph L...	2.0	../datasets/lmd/proto/f7b41341a20201f860cb375e...	[s/sunroad.mid, S/sunroad.mid]
28204	../datasets/lmd/lmd_full/7/7abab3566b77073f827...	False	220.109820	[0.0, 1.411764, 67.095876, 129.0154569375, 129...	[170.0000850000425, 95.00014250021376, 70.0000...	25714.187790	True	96.0	[, , , , , , , ]	9.0	../datasets/lmd/proto/7abab3566b77073f8274c097...	[FErnszt/Olala.mid]
69901	../datasets/lmd/lmd_full/a/af6b689cfbb13c302bd...	False	229.041171	[0.0, 5.052624, 40.420992, 94.420992, 118.4209...	[95.00014250021376, 190.0002850004275, 80.0, 1...	20384.791244	True	96.0	[Acoustic Bass, Brass, Grand piano, Rock Organ...	7.0	../datasets/lmd/proto/af6b689cfbb13c302bd41b86...	[FErnszt/Swingin.mid]
26973	../datasets/lmd/lmd_full/7/73f4f536d3d42293bd7...	False	144.311366	[0.0]	[122.00006913337249]	19652.447880	True	192.0	[tk1, tk2, tk3, tk4, tk10, tk11]	1.0	../datasets/lmd/proto/73f4f536d3d42293bd789b34...	[Clapton Eric/Bellbottom Blues.mid, Various Ar...
24515	../datasets/lmd/lmd_full/7/7486b07d9a4060fa614...	False	181.059483	[0.0, 0.4829544166666666, 1.9996389166666664, ...	[88.00002346667293, 89.00994241056726, 90.0099...	19623.803343	True	192.0	[Steel Str.Guitar, Fretless Bass, Piano, Violi...	13.0	../datasets/lmd/proto/7486b07d9a4060fa6142ac46...	[K/Kevin Parent Les Doigts 2.mid]

Regarding this, it seems we should filter out those really long pieces as those are most probably wrong files.

Those errors could be caused by our parsing or by the MIDI files itself - but as we have a big enough corpora it is justifiable by not including those outliers.

Resolution corresponds to the PPQN of the MIDI file. Everything above 1000 should be suspicious, so let’s take a look at those examples.

midi_df.sort_values('resolution', ascending=False).head(5)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
33871	../datasets/lmd/lmd_full/6/610987ebaf0a326db40...	False	239.221606	[0.0]	[120.0]	135.368840	False	25000.0	[, , , ]	1.0	../datasets/lmd/proto/610987ebaf0a326db4089867...	[b/beethovenalladanza.mid, B/beethovenalladanz...
45187	../datasets/lmd/lmd_full/1/1283eddbc4ab9042dcc...	False	225.608072	[0.0]	[112.00005973336519]	79.280092	True	24576.0	[, , ]	1.0	../datasets/lmd/proto/1283eddbc4ab9042dccaea09...	[Pop_and_Top40/2 Pac - Changes.mid, Pop_and_To...
67351	../datasets/lmd/lmd_full/a/a5c6eb45c94d84474cb...	False	235.213491	[0.0]	[120.0]	26.987488	False	16384.0	[, , ]	2.0	../datasets/lmd/proto/a5c6eb45c94d84474cbe00a3...	[a/amigo3.mid, A/amigo3.mid]
159033	../datasets/lmd/lmd_full/2/2076f0fa330be484c1c...	False	236.923314	[0.0]	[140.00014000014]	26.999973	False	15360.0	[Main Lead]	1.0	../datasets/lmd/proto/2076f0fa330be484c1c60242...	[M/M.I.D.O.R. - Far East.mid, M/m.i.d.o.r.__fa...
15472	../datasets/lmd/lmd_full/0/0c3f038f4eaed7d7f8c...	False	272.000290	[0.0]	[136.0001450668214]	28.014676	False	15360.0	[Melody]	1.0	../datasets/lmd/proto/0c3f038f4eaed7d7f8c90478...	[W/whiteroom__someday_intro__ambia.mid, W/Whit...

Another important metric is the number of time signatures.

If we use a step-sequencer like approach, we will need our data to have one time signature only.

midi_df['num_time_signature_changes'].plot.box(vert=False);

midi_df.sort_values('num_time_signature_changes', ascending=False).head(5)

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
72284	../datasets/lmd/lmd_full/a/a85612fb0992100bded...	False	222.143725	[0.0, 1.54624, 1.6513706666666668, 1.779370666...	[194.0190397350993, 190.23944805194805, 195.31...	204.310165	True	48.0	[STRAUSS, STRAUSS, STRAUSS, STRAUSS, STRAUSS, ...	959.0	../datasets/lmd/proto/a85612fb0992100bded76798...	[S/STRAUSS.MID, S/STRAUSS.MID]
26776	../datasets/lmd/lmd_full/7/7d42a95293cc47bcacf...	False	199.481296	[0.0, 1.6240640000000002, 2.071552, 2.22173866...	[160.0922131147541, 134.08180778032036, 133.16...	229.441685	True	48.0	[NACHT, NACHT, NACHT, NACHT, NACHT, NACHT, NACHT]	782.0	../datasets/lmd/proto/7d42a95293cc47bcacfaa568...	[Hollands/NACHT.MID]
168045	../datasets/lmd/lmd_full/5/5693291bc82548fea37...	False	242.258977	[0.0]	[136.0001450668214]	255.266272	True	96.0	[VIBRA SLAP, VIBRA SLAP, VIBRA SLAP, VIBRA SLA...	479.0	../datasets/lmd/proto/5693291bc82548fea376ec42...	[S/Striving.MID, Midis Diversas/STRIVING.MID, ...
60293	../datasets/lmd/lmd_full/8/85b2d4e8637008f6b4e...	False	218.486679	[0.0, 2.5, 73.5, 75.586952, 89.884816, 131.884...	[96.0, 240.0, 230.00049833441304, 235.00013708...	283.476487	False	1024.0	[, ]	424.0	../datasets/lmd/proto/85b2d4e8637008f6b4e54e5c...	[B/beevar2.mid]
44034	../datasets/lmd/lmd_full/6/69df67cd25dd33c1217...	False	228.828269	[0.0, 1.25, 3.65, 102.78041, 105.1804100000000...	[96.0, 100.0, 115.0000287500072, 100.0, 115.00...	407.079384	False	1024.0	[, ]	382.0	../datasets/lmd/proto/69df67cd25dd33c12177e159...	[Schumann/Schuman Toccata op7.mid, Schumann/Sc...

Let’s check the amount of files we would loose if we limit our files to exactly one time signature.

(midi_df.num_time_signature_changes <= 1).value_counts().plot.pie();
plt.title('Tracks without time signature changes');

We can also take a look at the most common instrument names in our dataset.

midi_df.explode('instrument_names').instrument_names.value_counts().head(20)

                  354676
Bass               29532
Drums              19438
untitled           18471
Piano              16344
Strings            10922
Soprano            10478
Guitar             10126
Alto                9368
Tenor               8987
DRUMS               8013
Melody              6693
WinJammer Demo      6488
Voice               6270
Piano (hi)          4905
Piano (lo)          4900
Italian             4679
STRINGS             4003
bass                3785
MELODY              3775
Name: instrument_names, dtype: int64

And also take a look at the most common names of our original filenames.

midi_df.explode('original_files').original_files.str.replace('[^a-zA-Z]', ' ').str.lower().str.split(' ').explode().value_counts().head(50)

<ipython-input-48-4f9f4be492b3>:1: FutureWarning: The default value of regex will change from True to False in a future version.
  midi_df.explode('original_files').original_files.str.replace('[^a-zA-Z]', ' ').str.lower().str.split(' ').explode().value_counts().head(50)

              1091324
mid            570786
midi            56765
various         42479
artists         36610
a               34883
the             34148
l               29181
s               28479
polyphone       24771
sure            23089
i               22765
m               20958
e               20792
t               19164
midis           19014
c               18162
b               17950
d               17712
n               17012
h               15452
o               15425
g               13303
you             13228
p               12267
k               12032
of              11890
poly            11836
in              11073
f               10433
beatles          9696
love             9246
r                8911
me               8901
j                8816
diversen         8643
my               8172
to               7585
divers           7025
w                6894
analisadas       6320
no               6073
de               5601
on               5373
and              5349
midirip          5017
it               4853
bach             4692
classical        4542
unsorted         4419
Name: original_files, dtype: int64

An interesting aspect of this analysis reveals that the most common words in a text corpora do not often contain much information what the text is about. This is because the words that belong to the grammatical structure of the text introduce noise to the dataset.

There are basic algorithms such as tf-idf which allows us to filter out the noise by asserting that a word which is common in many documents has little no meaning (e.g. the or in our case midi) and therefore give it a low score. Although if a significant subset of our file names contain a word that is not so common in the other file names (e.g. beatles) this token will receive a high rating.

For now we don’t want to inspect this further, but its always good to know such kind of algorithms as it helps to filter out the relevant data.

Its also good to make some sanity checks on our data. One assumption would be that every file in which we did not detect a drum track also does not contain an instrument with the name drums.

explodeded_df = midi_df.explode('instrument_names')
explodeded_df[
    (explodeded_df['instrument_names'].str.lower().isin(['drums', 'drum']))&
    (explodeded_df['drums'] == False)
].groupby('midi_path').first().reset_index()

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
0	../datasets/lmd/lmd_full/0/0093ffc81428aabfa14...	False	190.000285	[0.0]	[190.0002850004275]	137.151110	False	192.0	Drums	1.0	../datasets/lmd/proto/0093ffc81428aabfa14aaa13...	[M/Martha My Dear 3.mid, Beatles +GeorgeJohnPa...
1	../datasets/lmd/lmd_full/0/032e298d9560fdbe954...	False	120.000180	[0.0]	[190.0002850004275]	114.947196	False	120.0	Drums	1.0	../datasets/lmd/proto/032e298d9560fdbe954d7551...	[s/smetmoth.mid, S/smetmoth.mid]
2	../datasets/lmd/lmd_full/0/043e1c01e58a4bb4499...	False	139.999977	[0.0, 0.4285715, 7.071422000000001, 39.071422]	[69.99998833333528, 140.00014000014, 60.0, 69....	196.778591	False	120.0	Drums	1.0	../datasets/lmd/proto/043e1c01e58a4bb44995ee38...	[Songs+300/Tunnel_o.mid]
3	../datasets/lmd/lmd_full/0/06f86377b1dbad0ebf0...	False	226.482465	[0.0, 1.0, 15.608691999999998]	[120.0, 115.0000287500072, 120.0]	59.483692	False	192.0	DRUMS	1.0	../datasets/lmd/proto/06f86377b1dbad0ebf08fdcd...	[s/seaquest2.mid, S/seaquest2.mid]
4	../datasets/lmd/lmd_full/0/078e2cc7c46eb5de726...	False	204.052854	[0.0]	[98.00014373354414]	281.211322	False	192.0	Drums	1.0	../datasets/lmd/proto/078e2cc7c46eb5de72621329...	[Pop/SLEGDEHM.MID, PeterGabriel/Sledgehammer6....
...	...	...	...	...	...	...	...	...	...	...	...	...
148	../datasets/lmd/lmd_full/f/f2a97c0f7fa26244a77...	False	218.459582	[0.0]	[70.00007000007]	213.553358	False	192.0	drums	1.0	../datasets/lmd/proto/f2a97c0f7fa26244a77a8d0f...	[u/unchainedmelody3.mid, U/UnchainedMelody3.mid]
149	../datasets/lmd/lmd_full/f/f6702b2445a6ce9aa8e...	False	211.143695	[0.0]	[120.0]	21.241667	False	480.0	Drums	1.0	../datasets/lmd/proto/f6702b2445a6ce9aa8e6cad8...	[e/ElectroJam.Mid, E/ElectroJam.Mid]
150	../datasets/lmd/lmd_full/f/f8a56c3ebc00bfef70d...	False	157.793103	[0.0]	[80.0]	263.847656	False	192.0	Drums	1.0	../datasets/lmd/proto/f8a56c3ebc00bfef70d033a8...	[t/tearinhand.mid, T/tearinhand.mid]
151	../datasets/lmd/lmd_full/f/fa06799e6825c6a4095...	False	284.928064	[0.0]	[148.99983858350816]	383.315852	False	240.0	drums	16.0	../datasets/lmd/proto/fa06799e6825c6a4095e2934...	[l/louder.mid, L/louder.mid]
152	../datasets/lmd/lmd_full/f/fd37e5d1258ba97fc0e...	False	235.685752	[0.0]	[100.0]	191.795000	False	120.0	Drums	1.0	../datasets/lmd/proto/fd37e5d1258ba97fc0ef9779...	[Christian/Easyway.mid, Various Artists/easywa...

153 rows × 12 columns

Listening to those examples reveal that some of them use another instrument than drums for their drums - either by artistic choice or by mistake. As this only applies to 153 files we can live with this error and ignore the files were we did not detect the drum track.

Filter and extract the data¶

After inspecting all availabe files we should filter out some noisy files that are either not interesting to us (they lack a drum track) or the parsing of the MIDI files did not work as expected. With the analysis above we can verify borders of data we want to allow.

filtered_midi_df = midi_df[
    (midi_df.drums == True)
    & (midi_df.midi_error == False)
    & (midi_df.end_time.between(30, 800))
    & (midi_df.num_time_signature_changes<=1)
]
filtered_midi_df

	midi_path	midi_error	estimate_tempo	tempi_sec	tempi	end_time	drums	resolution	instrument_names	num_time_signature_changes	proto_path	original_files
5	../datasets/lmd/lmd_full/9/9d30679480d0e55a07b...	False	224.502048	[0.0]	[115.00002875000717]	296.347752	True	120.0	[Piano, Organ, Big Brass, Horns, Horns2, Horns...	1.0	../datasets/lmd/proto/9d30679480d0e55a07be1414...	[S/Smooth .mid]
9	../datasets/lmd/lmd_full/9/974f5b2466a5d5670df...	False	184.213982	[0.0, 2.714283, 170.86248999999998, 171.960050...	[210.00021000021, 80.99997165000993, 82.000035...	225.896259	True	120.0	[, , , , , , , , , ]	1.0	../datasets/lmd/proto/974f5b2466a5d5670dfddee0...	[Sure.Polyphone.Midi/A Winter's Tale.mid, w/wi...
10	../datasets/lmd/lmd_full/9/952df587b9fff5adbeb...	False	265.636110	[0.0]	[144.0023040368646]	216.460946	True	480.0	[Track 2, Track 3, Track 4, Track 5, Track 6, ...	1.0	../datasets/lmd/proto/952df587b9fff5adbeb5f40b...	[E/Enola Gay 63.mid, divers midi 3/OMD_-_Enola...
11	../datasets/lmd/lmd_full/9/93e2cb3a0c36af361bb...	False	165.000165	[0.0]	[165.000165000165]	310.029993	True	480.0	[Chorus Guitar, Overdrive Guita, Overdrive Gui...	1.0	../datasets/lmd/proto/93e2cb3a0c36af361bb49c5e...	[z/zombie05.mid, Z/zombie05.mid]
12	../datasets/lmd/lmd_full/9/97c1f795ffb676dfaa5...	False	212.877687	[0.0]	[110.00011000011]	222.766823	True	192.0	[BASS, E.PIANO, GUITAR, STRINGS, PIANO, New Tr...	1.0	../datasets/lmd/proto/97c1f795ffb676dfaa5a0202...	[J/J. Rivers Poor Side of Town.mid, divers mid...
...	...	...	...	...	...	...	...	...	...	...	...	...
178555	../datasets/lmd/lmd_full/5/5bd52f314b616bf50c4...	False	141.332152	[0.0, 183.428388, 186.85695800000002, 241.9283...	[70.00007000007, 35.00001458333941, 70.0000700...	250.221166	True	480.0	[Track 1, Track 2, Track 3, Track 4, Track 6, ...	1.0	../datasets/lmd/proto/5bd52f314b616bf50c40104c...	[T/taylor_swift-tim_mcgraw.mid]
178556	../datasets/lmd/lmd_full/5/5eea68ee369af3ee5cc...	False	184.252715	[0.0, 210.73867815, 210.75516165, 210.77763355...	[90.00009000009, 91.00009100009099, 89.0000400...	299.296908	True	240.0	[You've Made Me So Very Happy, You've Made Me ...	1.0	../datasets/lmd/proto/5eea68ee369af3ee5cc577ba...	[Y/Youvemad L.mid, Y/YOUVEMAD L.mid, Y/YOUVEMA...
178557	../datasets/lmd/lmd_full/5/5309fb3f62bf73780e6...	False	183.172080	[0.0]	[81.01003309259853]	139.390142	True	120.0	[, , , , , , , , , ]	1.0	../datasets/lmd/proto/5309fb3f62bf73780e61f85b...	[R/Roberto Carlos - Propuesta L.mid, Midis Jov...
178559	../datasets/lmd/lmd_full/5/5538b0174111bcef760...	False	274.353398	[0.0]	[142.00007100003546]	172.389879	True	96.0	[Std Drums, Jazz Gtr, Muted Gtr (Lead Dbl), Mu...	1.0	../datasets/lmd/proto/5538b0174111bcef760a5f29...	[B/Boys 2.mid, b/boys_2.mid, Beatles +GeorgeJo...
178560	../datasets/lmd/lmd_full/5/51e756897765aed30af...	False	229.821409	[0.0]	[64.99999458333379]	293.317332	True	96.0	[, , , , , ]	1.0	../datasets/lmd/proto/51e756897765aed30af232df...	[G/Gerry Boulet Deadline.mid]

97848 rows × 12 columns

len(filtered_midi_df)/len(midi_df)

0.5479696014247232

We have now filtered half of our available examples, just by looking at the metadata of the MIDI files. The next step is to transform the drum track of each MIDI file into a mathematical representation.

The notatation within the MIDI file is bound to time, but in music we often represent the time domain in relation to a tempo which remaps the time measured in seconds. A common unit for this is beats per minute (bpm) where 60 bpm maps each second to a beat - if we want to shorten the time between two beats (commonly refered to as faster) we increase the bpm and vice versa.

Another instance for the segmentation of events in time is the time signature of a track.

One notation which tries to simplify this notation is in the form of a step seqencer in which we snap notes to the nearest step on a grid.

For this we can use note_seq with the function midi_file_to_drum_track.

Conclusion¶

We filtered out all noisy examples of our dataset and transformed the drum track into a mathematical, simplified notation.

In the next step we will start to analyse the drum patterns that we extracted.

Musikinformatik SoSe2021