Extracting information from MIDI files

In our first lesson we want to get to know how to work with datasets so we can

  • parse a dataset

  • analyse the dataset and make sure it matches our expectations

  • check if errors occured during parsing

After this is done we want to find out how we can generate new drum patterns from the existing one. But in order to do this we need to take a look at the quantisation (?) of our patterns.

In such experiments, the cleaning of the dataset and setting the data up properly takes most of the time. But if we make mistakes here those mistakes will propagate through our system – so it’s a good idea to spend some time with this task.

import os
import glob

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)  # makes the randomness deterministic

%matplotlib inline
# todo: try %matplotlib widget
plt.rcParams['figure.figsize'] = (15, 5)
plt.rcParams['axes.grid'] = True

Getting the dataset

As our methods in this machine learning based workshop rely on data, we need a way to obtain such data. This will be on of our first endeavours. Thankfully there are search engines which help us to find data in the internet easily. When searching for “midi dataset”, one search result is https://colinraffel.com/projects/lmd/.

The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. […]

Of course the cultural bias of such a dataset is a thing we need to be aware of. What kind of music is transcribable into the MIDI format and what kind of music is transcribed as MIDI at all? Maybe we can shed some light on the last question by inspecting the dataset. But before we can do this, we need to download and understand the data.

We use a function which will check if the files are already downloaded and if not will download and extract the directory.

import urllib.request
import subprocess

def download_dataset(download_path: str = "../datasets/lmd"):
    os.makedirs(download_path, exist_ok=True)
    archive_dir = os.path.join(download_path, "lmd_full")
    
    dl_files = {
        "midi": {
            "path": os.path.join(download_path, "lmd_full.tap.gz"),
            "url": "http://hog.ee.columbia.edu/craffel/lmd/lmd_full.tar.gz"
        },
        "json": {
            "path": os.path.join(download_path, "md5_to_path.json"),
            "url": "http://hog.ee.columbia.edu/craffel/lmd/md5_to_paths.json"
        },
    }
    
    for dl_name, dl in dl_files.items():
        if os.path.isfile(dl["path"]) or (dl_name == "midi" and os.path.isdir(archive_dir)):
            print(f"{dl_name} already downloaded to {dl['path']}")
            continue
        print(f"Start downloading {dl_name} to {dl['path']} - this can take multiple minutes!")
        urllib.request.urlretrieve(dl['url'], dl['path'])
        print(f"Finished downloading")
    
    if not os.path.isdir(archive_dir):
        print("Start extracting the files of archive - this will take some minutes")
        # todo: windows has no tar
        subprocess.check_output([
            'tar', '-xzf', dl_files["midi"]["path"],
            '-C', os.path.join(download_path)
        ])
        print("Finished extracting")
    
    if os.path.isfile(dl_files["midi"]["path"]):
        print("Remove archive")
        os.remove(dl_files["midi"]["path"])

download_dataset()
midi already downloaded to ../datasets/lmd/lmd_full.tap.gz
json already downloaded to ../datasets/lmd/md5_to_path.json

Parsing the dataset

When working with large sets of files the unix utility glob comes in handy as we can describe the pattern of the file paths we want to match instead of listing all files.

When we take a quick look at the pattern it seems they all follow a structure like

../datasets/lmd/lmd_full/2/4a0cbb3f083d14d57858c87b26f85873.mid

Soon we will understand why the filename has this cryptic format, but for now we simply want to parse all available files into an array.

midi_files = glob.glob('../datasets/lmd/lmd_full/*/*.mid')
print(f'Found {len(midi_files)} midi files in dataset')
Found 178561 midi files in dataset
# select 5 random files
np.random.choice(midi_files, 5)
array(['../datasets/lmd/lmd_full/4/4a69b1a10e4dbe409ac922de4f6256d8.mid',
       '../datasets/lmd/lmd_full/b/b102261b4c27ea58bad4777b5df5be5e.mid',
       '../datasets/lmd/lmd_full/3/3a8ccaab480919c35f37fdf08c238e29.mid',
       '../datasets/lmd/lmd_full/d/dc6416232e05b3d44fb62007ca84b474.mid',
       '../datasets/lmd/lmd_full/4/4fb744a04c1afbb0309b581410b33363.mid'],
      dtype='<U63')

But how are sure that we matched every midi file in our folder? We can simply match every file in our dataset directory and show us the differences by using sets.

missed_files = set(glob.glob('../datasets/lmd/lmd_full/**/*.*', recursive=True)) - set(midi_files)
print(f'Did not match {len(missed_files)} files')
Did not match 0 files

Thankfully we have a rather clean dataset here which does not force us to do file acrobatics.

But we did not just download MIDI files, we also downloaded a JSON file which we will now take a look into.

import json

with open("../datasets/lmd/md5_to_path.json") as f:
    md5_filenames = json.load(f)

example_midi_file = '../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid'
print(f"Choosen {example_midi_file} as example MIDI file")
# with split we remove the folder name and the file extension
md5_filenames[example_midi_file.split('/')[-1].split('.')[0]]
Choosen ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid as example MIDI file
['QUALITY MIDI/EClapton-Promises.mid']

So - why do we have an additional file which tells us the original filename? This is actually a good trick in order to avoid duplicates in the dataset - we use a so called hash function which will return a cryptic string depending on the data that we hash. Two files will result in the same hash output - so if we simple rename each file to its hash output, we will automatically delete and detect all duplicates in our dataset. This is a common practice if the data was obtained by scraping multiple sources for data.

Let’s just take a look if we can verify this claim.

import hashlib

with open(example_midi_file, 'rb') as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)
        
print(f"Hash for {example_midi_file} is {file_hash.hexdigest()}")
Hash for ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid is ed05335f8f6c273f506a997137b2805b

This also gives us a nice additional information - a common MIDI file will probably have lots of different names as it was scraped from multiple files which probably changed the filenames a bit.

Loading MIDI files in Python

Now as we have all file paths available it is a good practice to take a first look at the data. We need to understand how the information we are interested in is accessible. Also it is probably not clean and standardized.

As there is no build in functionality in Python to work with MIDI data we will need to rely on a library. There are a numerous libraries for working with MIDI data in Python, but we will rely on pretty_midi to inspect the MIDI files.

As we have quite a lot of data to process, it is worth some research on how we can reduce the amount of data before we process the whole dataset and bring it into a good format we can work with. Here, we will concentrate on percussion tracks. From the wikipedia article on General MIDI (GM) we can see that they should be on channel 10 - so for now we will simply ignore the other tracks.

In a former version of this document we used the library music21 for parsing and loading of our MIDI files but it turns out that music21 is too slow for our big dataset.

We also start annotating our dataset using pandas, which is a library for data analysis and manipulation. We can use it to keep track of any metadata the files. It will also help us on the analysis of the metadafiles.

Additionally we will use caching because the meta-analyis of our MIDI files takes about 2 hours on a recent MacBook. Caching means that we will store the results on the hard drive so once we calculated them we do not need to re-calculate them because we will just load the files from the hard drive. It is a common practice in data science to write a function which either loads the data from a file or generates the file we are using as generating or calculating the data is a step which takes often a long time.

We will start by inspecting how we can access MIDI information from within pretty_midi.

import pretty_midi as pm

midi_stream = pm.PrettyMIDI(example_midi_file)
midi_stream
<pretty_midi.pretty_midi.PrettyMIDI at 0x159b7d760>

After we have now loaded a MIDI file with pretty_midi we now take a look at how we can access the data. Therefore the documentation of pretty_midi will help us to know how to access the MIDI information.

midi_stream.get_end_time()
189.431068

will give us the time in seconds of the MIDI file

midi_stream.get_beats()[0:50]  # limit to 50 entries for printing
array([0.0000000e+00, 1.2500000e-02, 4.4107100e-01, 8.6964200e-01,
       1.2982130e+00, 1.7267840e+00, 2.1553550e+00, 2.5839260e+00,
       3.0124970e+00, 3.4410680e+00, 3.8410680e+00, 4.2410680e+00,
       4.6410680e+00, 5.0410680e+00, 5.4410680e+00, 5.8410680e+00,
       6.2410680e+00, 6.6410680e+00, 7.0410680e+00, 7.4410680e+00,
       7.8410680e+00, 8.2410680e+00, 8.6410680e+00, 9.0410680e+00,
       9.4410680e+00, 9.8410680e+00, 1.0241068e+01, 1.0641068e+01,
       1.1041068e+01, 1.1441068e+01, 1.1841068e+01, 1.2241068e+01,
       1.2641068e+01, 1.3041068e+01, 1.3441068e+01, 1.3841068e+01,
       1.4241068e+01, 1.4641068e+01, 1.5041068e+01, 1.5441068e+01,
       1.5841068e+01, 1.6241068e+01, 1.6641068e+01, 1.7041068e+01,
       1.7441068e+01, 1.7841068e+01, 1.8241068e+01, 1.8641068e+01,
       1.9041068e+01, 1.9441068e+01])

will give us the position in secounds of all beats (so any action) in a MIDI file

midi_stream.get_tempo_changes()
(array([0.      , 0.0125  , 3.441068]),
 array([120.     , 140.00014, 150.     ]))

will tell us any changes in the tempo in bpm. Keep in mind that we do not have two tempo changes here but that the function returns two arrays - the first one with the mark in seconds where the tempo gets changed and another one to what bpm it changes - for now we are only interested in the second array.

We should also prepare for the case that the MIDI file does not provide any tempo information.

midi_stream.time_signature_changes
[TimeSignature(numerator=4, denominator=4, time=0.0125),
 TimeSignature(numerator=2, denominator=4, time=21.041068000000003),
 TimeSignature(numerator=4, denominator=4, time=21.841068000000003),
 TimeSignature(numerator=2, denominator=4, time=36.241068000000006),
 TimeSignature(numerator=4, denominator=4, time=37.041068),
 TimeSignature(numerator=2, denominator=4, time=51.441068),
 TimeSignature(numerator=4, denominator=4, time=52.241068000000006),
 TimeSignature(numerator=2, denominator=4, time=66.641068),
 TimeSignature(numerator=4, denominator=4, time=67.441068),
 TimeSignature(numerator=2, denominator=4, time=107.441068),
 TimeSignature(numerator=4, denominator=4, time=108.24106800000001),
 TimeSignature(numerator=2, denominator=4, time=122.641068),
 TimeSignature(numerator=4, denominator=4, time=123.44106800000002)]

tells us the time signatures of the MIDI file.

time_signature = midi_stream.time_signature_changes[0]
print(f'{time_signature.numerator}/{time_signature.denominator}')
4/4

As with the tempo we should also prepare for the case that no time signature information is given on the MIDI file.

midi_stream.instruments
[Instrument(program=62, is_drum=False, name=""),
 Instrument(program=34, is_drum=False, name=""),
 Instrument(program=25, is_drum=False, name=""),
 Instrument(program=28, is_drum=False, name=""),
 Instrument(program=25, is_drum=False, name=""),
 Instrument(program=30, is_drum=False, name=""),
 Instrument(program=26, is_drum=False, name=""),
 Instrument(program=18, is_drum=False, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name=""),
 Instrument(program=0, is_drum=True, name="")]

gives us access to all instruments with its names and if this is a drum track or not.

We can also take a look at the notes of each instrument

midi_stream.instruments[0].notes[0:10]
[Note(start=9.834401, end=9.937735, pitch=62, velocity=109),
 Note(start=10.037735, end=10.311068, pitch=62, velocity=107),
 Note(start=10.437735, end=10.847735, pitch=62, velocity=113),
 Note(start=10.834401, end=11.057735, pitch=60, velocity=109),
 Note(start=11.034401, end=11.237735, pitch=59, velocity=106),
 Note(start=11.237735, end=11.594401, pitch=62, velocity=108),
 Note(start=11.627735, end=11.834401, pitch=60, velocity=101),
 Note(start=11.827735, end=12.247735, pitch=59, velocity=108),
 Note(start=12.224401, end=12.544401, pitch=62, velocity=106),
 Note(start=13.024401, end=13.131068, pitch=64, velocity=102)]

Performance

Now we know how we can use pretty_midi to load a MIDI file and access the information of a single MIDI file. But the problem is that a loading of a MIDI file is pretty slow in Python - we will do some measurements now to have a quick way to access the information we are interested in.

We can use a built in tool of Jupyter which is %%timeit which will do multiple runs of a cell and measure the time it takes for each run.

%%timeit

pm.PrettyMIDI(example_midi_file)
186 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Of course this time to parse the MIDI file depends on the complexity of our MIDI file. But now its time to do some maths. For the sake of simplicity we will assume that it takes around 100ms to load a MIDI file.

\[ 0.1 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 17800 \text{ sec} \approx 296 \text{ minutes} \approx 4 \text{ hours}\]

Note that this assumes that we do not process multiple files in paralell, which is possible but also tricky. Also this is only for the loading of the data, we not have acessed yet any data.

Maybe there is a file format to which we can store the MIDI files and load them quicker whenever we need them. note_seq from the magenata project provide such a format called note sequence which nicely interacts with pretty_midi.

import note_seq
%%timeit

notes = note_seq.midi_to_note_sequence(midi_stream)
14.4 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So the conversion of this file is fast compared to the parsing of our MIDI file. The problem is that we still need the MIDI file parsed and we have not yet taken a look at the saving and loading of a note sequence.

note_seq is using a special kind of format for this which is called protobuf - a binary file format to store and exchange files developed by Google. The advantage is that this is really fast and efficient as it is a binary format and we can exchange the files to other programming languages - think of it like a binary version of JSON but with type safety included.

We start by saving the note sequence into a protobuf and try loading it again from our harddrive.

notes = note_seq.midi_to_note_sequence(midi_stream)

with open('note_seq_test.protobuf', 'wb') as f:
    f.write(notes.SerializeToString())
%%timeit

with open('note_seq_test.protobuf', 'rb') as f:
    notes_loaded = note_seq.NoteSequence.FromString(f.read())
3.47 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Lets say it takes about 4 ms to load a protobuf file - now we will do the calculation from before again.

\[ 0.004 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 712 \text{ sec} \approx 11 \text{ minutes}\]

which is much more acceptable. And we have not yet considered any parallelisation of the code.

But before we convert everything to a note sequence as a protobuf we should compare the file size of the two file formats.

for f in [example_midi_file, 'note_seq_test.protobuf']:
    print(f'File size of {f}: {os.path.getsize(f)/1024} kbytes')
File size of ../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid: 46.6513671875 kbytes
File size of note_seq_test.protobuf: 195.958984375 kbytes

It turns out that we are trading storage in favour of computation power.

Before we can decide if we really want to trade, we may need to take a look at the file size of our MIDI dataset. For this we can use a functionality of Jupyter to execute commands in a shell from within Jupyter by appending a ! in front of the command. In order to get the size of a dictionary we will use the unix tool du with the arguments s (sum everything within the path) and h (convert number of bytes to a human readable format).

!du -sh ../datasets/lmd/lmd_full
5.9G	../datasets/lmd/lmd_full

We have around 6 gigabytes of files here - when we convert them to protobuf we will blow them up by a factor of \(\approx 4\) so we will have about 24 GB of protobuf files. This is still acceptable somehow.

Processing all files

So after we now know how to load, access and convert the files to a nicer format, we still need to take a look at how we can do this on our 180k files and maybe save some time here by a little bit of programming effort.

First we will start by programming a function in which we combine everything we did before and return to us all necessary information as a dictionary.

PROTO_SAVE_PATH = '../datasets/lmd/proto/'

# make sure that the folder we want to save to actually exists
os.makedirs(PROTO_SAVE_PATH, exist_ok=True)
from typing import Dict

from mido import KeySignatureError

def parse_midi_file(midi_file_path: str, proto_save_path: str=PROTO_SAVE_PATH) -> Dict:
    r = {
        'midi_path': midi_file_path,
        'midi_error': True,
    }
    midi_name = os.path.splitext(midi_file_path.split(os.sep)[-1])[0]
    proto_path = os.path.join(PROTO_SAVE_PATH, f'{midi_name}.protobuf')
    
    r['original_names'] = md5_filenames[midi_name]
    
    try:
        stream = pm.PrettyMIDI(midi_file_path)
        r['midi_error'] = False
    except (OSError, ValueError, IndexError, KeySignatureError, EOFError, ZeroDivisionError):
        return r
    
    try:
        # r['beat_start'] = stream.estimate_beat_start() # omitted b/c adds 200ms to parsing
        r['estimate_tempo'] = stream.estimate_tempo()
        r['tempi_sec'], r['tempi'] = stream.get_tempo_changes() 
        r['end_time'] = stream.get_end_time()
        r['drums'] = any([i.is_drum for i in stream.instruments])
        r['resolution'] = stream.resolution
        r['instrument_names'] = [i.name.strip() for i in stream.instruments]
        r['num_time_signature_changes'] = len(stream.time_signature_changes)
    except ValueError as e:
        # ValueError: Can't estimate beat start when there are no notes.
        # ValueError: Can't provide a global tempo estimate when there are fewer than two notes.
        print(f"Could not parse MIDI file {midi_file_path}: {e}")
    
    notes = note_seq.midi_to_note_sequence(stream)
    with open(proto_path, 'wb') as f:
        f.write(notes.SerializeToString())
    r['proto_path'] = proto_path
    
    return r

Now we want to try out the function on a single MIDI file.

parse_midi_file(example_midi_file)
{'midi_path': '../datasets/lmd/lmd_full/e/ed05335f8f6c273f506a997137b2805b.mid',
 'midi_error': False,
 'original_names': ['QUALITY MIDI/EClapton-Promises.mid'],
 'estimate_tempo': 159.6553471870248,
 'tempi_sec': array([0.      , 0.0125  , 3.441068]),
 'tempi': array([120.     , 140.00014, 150.     ]),
 'end_time': 189.431068,
 'drums': True,
 'resolution': 120,
 'instrument_names': ['', '', '', '', '', '', '', '', '', '', '', '', ''],
 'num_time_signature_changes': 13,
 'proto_path': '../datasets/lmd/proto/ed05335f8f6c273f506a997137b2805b.protobuf'}

So this works as expected - time to time it as well.

%%timeit

parse_midi_file(example_midi_file)
233 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It seems that accessing the MIDI data via pretty_midi does not come for free. Repeating the calculation from before:

\[ 0.2 \frac{\text{sec}}{\text{file}} * 178000 \text{ files} = 35600 \text{ sec} \approx 593 \text{ minutes} \approx 10 \text{ hours}\]

This is not unheard of but we can reduce the time by parallelizing it for which we need to do some tricks into which we will not go detail. On a recent i9 processor the next step will take around 90 minutes if executed for the first time.

import math
import multiprocessing
from typing import Optional

from helpers.helpers import *

def load_midi_files(parquet_path: str = 'midi_files_{}.parquet', limit: Optional[int] = None) -> pd.DataFrame:
    parquet_path = parquet_path.format(limit if limit else 'full')
    
    if os.path.isfile(parquet_path):
        print(f'Use cached parquet file {parquet_path}')
        return pd.read_parquet(parquet_path)
    
    midi_files = glob.glob('../datasets/lmd/lmd_full/*/*.mid')
    if limit:
        print(f'Limit to {limit} files')
        midi_files = np.random.choice(midi_files, limit)
    print(f'Parse {len(midi_files)} midi files')
    
    cpu_count = math.ceil(multiprocessing.cpu_count()*3/4)
    with multiprocessing.Pool(cpu_count, maxtasksperchild=5) as p:
        midi_meta = p.map(parse_midi_file_async, midi_files, chunksize=255)
    print(f'Finished parsing midi files')
    
    midi_meta = pd.DataFrame(midi_meta)
    midi_meta.to_parquet(parquet_path)
    
    return midi_meta

# if you want to work with a random subset set the limit argument to num of files as a limit
midi_df = load_midi_files(limit=None)

midi_df.sample(15)
Use cached parquet file midi_files_full.parquet
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path
81923 ../datasets/lmd/lmd_full/f/f4b72836981feedb9e1... False 175.232953 [0.0] [82.00003553334874] 128.780432 True 384.0 [harmnica, accordn2, drum, acoubass, pianogrd,... 1.0 ../datasets/lmd/proto/f4b72836981feedb9e1046f1...
171833 ../datasets/lmd/lmd_full/5/5f3eaf75cbc1b2554df... False 213.043443 [0.0, 297.04290665, 297.1153704, 297.188899816... [69.99998833333528, 69.00001725000433, 67.9999... 302.411045 True 120.0 [YO TE AMO;CHAYANE, YO TE AMO;CHAYANE, YO TE A... 1.0 ../datasets/lmd/proto/5f3eaf75cbc1b2554dfa60bd...
161501 ../datasets/lmd/lmd_full/2/2f02103cfb1764e661d... False 170.691458 [0.0, 0.004166666666666667, 2.0015625] [75.0, 120.0, 75.0] 174.718229 True 384.0 [, , , , , , , , , , , , , , , ] 1.0 ../datasets/lmd/proto/2f02103cfb1764e661d54151...
39942 ../datasets/lmd/lmd_full/6/609b19dde84d2374b7d... False 237.529957 [0.0, 0.0026041666666666665] [120.0, 110.00091667430561] 254.776049 True 384.0 [S-File, S-File, S-File, S-File, S-File, S-Fil... 1.0 ../datasets/lmd/proto/609b19dde84d2374b7d19124...
100593 ../datasets/lmd/lmd_full/d/de4dde0f1e960e41360... False 170.384180 [0.0, 3.0, 183.33574275, 183.66386775] [120.0, 67.00002903334591, 60.0, 49.0000318500... 190.137078 False 192.0 [, , , , , ] 0.0 ../datasets/lmd/proto/de4dde0f1e960e413602c493...
44272 ../datasets/lmd/lmd_full/6/6f74186f4c163863630... False 242.119770 [0.0] [122.00006913337249] 228.161756 True 480.0 [drummix, bass, ac piano, Vocal, back voc, ac ... 1.0 ../datasets/lmd/proto/6f74186f4c163863630e8baa...
15131 ../datasets/lmd/lmd_full/0/0d8e9c4eb761a3e632b... False 245.724188 [0.0, 3.9344240000000004, 14.843497999999999, ... [122.00006913337249, 121.00018755029072, 123.0... 251.814031 True 96.0 [TAKE ME HOME Bextor/Aller/... 1.0 ../datasets/lmd/proto/0d8e9c4eb761a3e632ba4f10...
34735 ../datasets/lmd/lmd_full/6/6d3149b41d088c63627... False 84.000025 [0.0, 89.2856875, 89.88351320833333, 90.410473... [42.00001260000378, 46.00002913335179, 51.0000... 340.282369 False 96.0 [Voice (Alto), Voice (Alto), Voice (Alto), Pia... 6.0 ../datasets/lmd/proto/6d3149b41d088c6362792faf...
45905 ../datasets/lmd/lmd_full/1/1a931072e21340f256b... False 207.446994 [0.0] [120.0] 280.114583 True 384.0 [A.PIANO 1, FINGERDBAS, A.PIANO 1, PAN FLUTE, ... 1.0 ../datasets/lmd/proto/1a931072e21340f256b96104...
12844 ../datasets/lmd/lmd_full/0/0a7c38887ba4882921e... False 224.582071 [0.0] [127.96996971377385] 193.497545 True 480.0 [CANDOMBE, PARA, GARDEL, http://fberni.tripod.... 1.0 ../datasets/lmd/proto/0a7c38887ba4882921ee2c82...
114123 ../datasets/lmd/lmd_full/4/455fcb4f4c5ad4d10da... False 153.020774 [0.0, 7.15475625, 7.3302190000000005, 7.506771... [85.95680670463092, 85.48823040787859, 84.9608... 100.750126 False 1024.0 [] 1.0 ../datasets/lmd/proto/455fcb4f4c5ad4d10da384b6...
14339 ../datasets/lmd/lmd_full/0/029c4baa3089eca3817... False 199.999987 [0.0, 6.461539, 7.961539, 9.790110733333334, 9... [64.99999458333379, 40.0, 69.99998833333528, 6... 330.377789 True 120.0 [, Bass, Bass, Strings, Melodia, Slowstrings, ... 5.0 ../datasets/lmd/proto/029c4baa3089eca381772e2b...
131099 ../datasets/lmd/lmd_full/3/353116ea8d2eb4041c9... False 260.000260 [0.0] [130.00013000013] 51.692256 True 96.0 [MIDI Ch. 1, MIDI Ch. 2, MIDI Drums] 0.0 ../datasets/lmd/proto/353116ea8d2eb4041c9f4622...
124606 ../datasets/lmd/lmd_full/3/3574baf9f9cdfa4c3e1... False 207.262310 [0.0, 0.16129033333333334, 0.9425403333333333,... [30.999997933333468, 32.0, 32.99999670000033, ... 273.626126 True 120.0 [Rhodes Piano, Synth Bass H, Synth Bass L, Aco... 1.0 ../datasets/lmd/proto/3574baf9f9cdfa4c3e1da456...
136078 ../datasets/lmd/lmd_full/e/eb59c9f6854d897a03a... False 193.775528 [0.0, 3.5416666666666665] [120.0, 109.99990833340973] 217.860027 True 240.0 [Bass / Acoustic bass, Piano / Rhodes, Melody ... 1.0 ../datasets/lmd/proto/eb59c9f6854d897a03a3ed78...

We now also add the original file names from the md5 json file to our dataframe.

original_files = []
for _, midi_path in midi_df['midi_path'].iteritems():
    midi_name = midi_path.split(os.sep)[-1].split('.')[0]
    original_files.append(md5_filenames[midi_name])

midi_df['original_files'] = original_files

midi_df.sample(15)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
31900 ../datasets/lmd/lmd_full/7/745f66980de9ca27fe0... False 169.981663 [0.0] [151.00037750094376] 141.733917 True 96.0 [Acc Bass, Piano, Melody, Sax, Solo, E.Guitar,... 1.0 ../datasets/lmd/proto/745f66980de9ca27fe055cc6... [BonJovi/Runaway3.mid, MikeDoyle/MyLittleRunAw...
33585 ../datasets/lmd/lmd_full/6/68b145d7ffdf4c43e1f... False 204.472843 [0.0] [99.99999999999999] 163.656250 True 96.0 [Remelexo, Remelexo, Remelexo, Remelexo, Remel... 2.0 ../datasets/lmd/proto/68b145d7ffdf4c43e1f5f59c... [C/Cesar E Paulinho - Remelexo.mid, Midis Româ...
67436 ../datasets/lmd/lmd_full/a/a373ed7cd11fb80fac3... False 159.432272 [0.0] [159.0002067002687] 154.150743 True 192.0 [Crash Cymbal, Ride Cymbal, Open HiHat, Closed... 1.0 ../datasets/lmd/proto/a373ed7cd11fb80fac3dca06... [t/trust_me.mid, T/trust_me.mid]
142153 ../datasets/lmd/lmd_full/e/ec2fff2c295943aedf4... False 194.483086 [0.0] [88.70005632453577] 259.977287 True 120.0 [You Can Leave Your Hat On, You Can Leave Your... 1.0 ../datasets/lmd/proto/ec2fff2c295943aedf4797c3... [Y/You Can Leave Your Hat on Pm L.mid, Y/You C...
127493 ../datasets/lmd/lmd_full/3/3fae950c18f739286ec... False 160.190137 [0.0] [160.0] 100.425000 True 480.0 [Sax, Sax, French Horn, Piano, Bass, Drums] 1.0 ../datasets/lmd/proto/3fae950c18f739286ec42107... [2009 MIDI/barbra_Ann2-Regents-F160.mid]
138301 ../datasets/lmd/lmd_full/e/ec16ec22a846e4d4d91... False 164.382517 [0.0, 186.66648] [162.000162000162, 120.0] 186.666480 True 120.0 [Vocal, ElecGtr 1, ElecGtr 2, PickedBass, Orga... 1.0 ../datasets/lmd/proto/ec16ec22a846e4d4d91959cc... [c/chaingm.mid, C/chaingm.mid]
83006 ../datasets/lmd/lmd_full/f/fd8a6fcf912a7ddc602... False 160.171123 [0.0] [80.0] 274.525000 True 120.0 [Corazon Partio, Corazon Partio, Corazon Parti... 7.0 ../datasets/lmd/proto/fd8a6fcf912a7ddc60242146... [C/Corazon Partio Pm L.mid, Midis Latinas/Cora...
13575 ../datasets/lmd/lmd_full/0/0dff276c070ccc16e2b... False 203.014003 [0.0] [190.0002850004275] 48.976900 True 96.0 [Bop on the Rocks, Bop on the Rocks, Bop on th... 3.0 ../datasets/lmd/proto/0dff276c070ccc16e2bd6fa1... [Jazz/Bop on.mid, b/bop-on.mid, Jazz/BOP_ON.MI...
134536 ../datasets/lmd/lmd_full/e/ed793dadf58b913ea24... False 192.000192 [0.0] [140.00014000014] 6.861600 False 96.0 [z3ta+ (MIDI)] 0.0 ../datasets/lmd/proto/ed793dadf58b913ea2460ad1... [M/Machinehead - Headwave (Zatox Mix).mid, M/M...
116286 ../datasets/lmd/lmd_full/4/4c4d48e046e6f5a2fc2... False 255.499940 [0.0] [145.9999659333413] 6.575344 False 96.0 [MIDI out] 0.0 ../datasets/lmd/proto/4c4d48e046e6f5a2fc2f89cb... [J/jan_johnston__flesh__tilt_remix__bamford.mi...
126553 ../datasets/lmd/lmd_full/3/312b4e4b6f9bbd3abe3... False 192.617363 [0.0] [96.0] 87.500000 True 384.0 [ACOU BASS, A.PIANO 1, A.PIANO 1, JAZZ GTR, SL... 2.0 ../datasets/lmd/proto/312b4e4b6f9bbd3abe3af2bb... [Diversen/Strike-Up-The-Band.mid, DIVERSEN/STR...
150997 ../datasets/lmd/lmd_full/b/bf3f61d4bf0c7a99c7c... False 196.007331 [0.0] [98.0036653370836] 364.744011 True 384.0 [SHOUT, SHOUT, SHOUT, SHOUT, SHOUT, SHOUT, SHO... 1.0 ../datasets/lmd/proto/bf3f61d4bf0c7a99c7c5dded... [080/Shout.mid, 080/Shout.mid]
66108 ../datasets/lmd/lmd_full/8/80fd25fb1ff70896ed8... False 180.717774 [0.0, 0.008333333333333333, 11.500865666666666... [60.0, 67.00002903334591, 65.99999340000066, 6... 109.561609 False 120.0 [English (open lyrics window), French (open ly... 16.0 ../datasets/lmd/proto/80fd25fb1ff70896ed8b1115... [R/Ravel Maurice Chanson Des Cueilleuses De Le...
5427 ../datasets/lmd/lmd_full/9/91100d1888462c59031... False 246.623850 [0.0] [120.0] 32.005208 False 480.0 [Nylon-Str. Gt., Bass, *Piano, Warm Pad, Strings] 1.0 ../datasets/lmd/proto/91100d1888462c5903132be8... [COREL COLLECTION/MOOD_06.MID]
78360 ../datasets/lmd/lmd_full/f/f577f833d678672ef3b... False 194.594595 [0.0] [120.0] 200.916667 True 384.0 [Snare / Tom/cymbal, Snare / Tom/cymbal, Snare... 3.0 ../datasets/lmd/proto/f577f833d678672ef3b2787c... [D/Double Vision.mid, D/Double Vision.mid, D/D...

Analysis of metadata

Let’s examine the extracted metadata of those files. Before we start plotting it is always a good idea to take a look a the dataframe and its description.

midi_df.sample(15)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
160813 ../datasets/lmd/lmd_full/2/2b03576c2eac6b20c55... False 220.504202 [0.0] [99.99999999999999] 200.200000 True 384.0 [Lead, Bass, Guitar, Rhythm guitar, Voice oohs... 1.0 ../datasets/lmd/proto/2b03576c2eac6b20c5537514... [M/Monday Monday.mid]
83189 ../datasets/lmd/lmd_full/f/fc7280752baa136165c... False 149.380190 [0.0] [160.01024065540196] 150.326316 True 48.0 [SYNTH, SYNTH, SYNTH, SYNTH, SYNTH, SYNTH, SYN... 1.0 ../datasets/lmd/proto/fc7280752baa136165c911e0... [e/entreaty.mid]
96858 ../datasets/lmd/lmd_full/c/cb226a46053507ae8a4... False 170.000085 [0.0] [170.0000850000425] 11.294112 False 96.0 [Neo Cortex - Elements (Styles & Breeze Remix)] 0.0 ../datasets/lmd/proto/cb226a46053507ae8a47a16c... [N/Neo Cortex - Elements (Breeze & Styles Remi...
34038 ../datasets/lmd/lmd_full/6/6d12ce4d832f9c39ca6... False 231.914925 [0.0, 2.368422, 3.493422, 11.1907935, 11.19735... [151.99993920002433, 80.0, 151.99993920002433,... 503.284151 False 120.0 [Primo RH, Primo LH G, Primo LH F, Secondo RH ... 1.0 ../datasets/lmd/proto/6d12ce4d832f9c39ca683d53... [Mendelsonn/Allegro Brillant for Piano 4-hands...
171000 ../datasets/lmd/lmd_full/5/586c46e8f1c65beff59... False 236.159582 [0.0] [130.0023833770286] 91.378132 True 384.0 [ACCORDION, A.PIANO 1, FINGERDBAS, CRYSTAL, DR... 1.0 ../datasets/lmd/proto/586c46e8f1c65beff59afe0b... [Byrd, Charlie/The-Girl-From-Ipanema.mid, BYRD...
103720 ../datasets/lmd/lmd_full/d/dabbeaf50e6bad0b54d... True NaN None None NaN None NaN None NaN None [Sure.Polyphone.Midi/Entertainer.mid, Various ...
71588 ../datasets/lmd/lmd_full/a/aaf2e0398a707b34bbe... False 174.928646 [0.0] [124.000248000496] 131.612640 True 384.0 [ESTCE PR HASARD, ESTCE PR HASARD, ESTCE PR HA... 1.0 ../datasets/lmd/proto/aaf2e0398a707b34bbeaefb9... [E/Est Ce Par Hasard Dave.mid]
139917 ../datasets/lmd/lmd_full/e/ebc5edd3c533092d324... False 196.431974 [0.0] [85.00028333427778] 276.574446 True 384.0 [TORN, TORN, TORN, TORN, TORN, TORN, TORN, TOR... 1.0 ../datasets/lmd/proto/ebc5edd3c533092d32461eee... [X/Xgtorn.mid, 097/XGtorn.mid, 097/XGtorn.mid]
45209 ../datasets/lmd/lmd_full/1/1450ab4b1e52e25438d... False 195.096618 [0.0] [107.9999136000691] 224.166846 False 96.0 [Soprano, Soprano, Soprano, Soprano, Soprano, ... 38.0 ../datasets/lmd/proto/1450ab4b1e52e25438de8e7e... [civilwar2/61sfttobfa.mid]
74319 ../datasets/lmd/lmd_full/a/aae666057efdbc7268d... False 240.000000 [0.0] [120.0] 66.010417 False 96.0 [3xOsc (MIDI), 3xOsc #2 (MIDI), 3xOsc #3 (MIDI... 1.0 ../datasets/lmd/proto/aae666057efdbc7268d62b8f... [K/Kara_Sun_-_Into_The_Sun_(Airbase_Dub_Mix)__...
90199 ../datasets/lmd/lmd_full/c/c750e5a017155f06e63... False 256.000000 [0.0] [128.0] 15.117188 False 96.0 [Sylenth1 4, Sylenth1 3, Sylenth1 1] 1.0 ../datasets/lmd/proto/c750e5a017155f06e63ae69c... [C/Cascada_-_BadBoy__DjMixdOut_20130211014527....
99470 ../datasets/lmd/lmd_full/c/cc6b412aeeb834e8fec... False 275.408685 [0.0, 182.14267500000003] [140.00014000014, 108.99994368336242] 185.995888 True 120.0 [, , , , , , , , , , ] 1.0 ../datasets/lmd/proto/cc6b412aeeb834e8fecfd308... [VerucaSalt/Seether.mid, VerucaSalt/Seether.mi...
175912 ../datasets/lmd/lmd_full/5/500a1212bae787f7ef8... False 240.000000 [0.0] [120.0] 98.494792 False 384.0 [The Flowers, Of The Heath, Sequenced By, Barr... 1.0 ../datasets/lmd/proto/500a1212bae787f7ef8fb099... [h/heath.mid, H/heath.mid]
30403 ../datasets/lmd/lmd_full/7/7dc2e8000f630ab4e57... False 200.644292 [0.0, 9.69696, 38.49696, 48.193920000000006, 5... [198.000198000198, 199.99999999999997, 198.000... 144.326112 True 384.0 [Drums - Travis, Bass - Mark, Guitar 1 - Tom, ... 1.0 ../datasets/lmd/proto/7dc2e8000f630ab4e57eebd5... [B/blink_182-dumpweed.mid]
108557 ../datasets/lmd/lmd_full/d/d864a2adc13ad8b9797... False 251.560269 [0.0] [126.99998518333504] 205.980339 True 120.0 [The Corrs, "Breathless", , ****************, ... 1.0 ../datasets/lmd/proto/d864a2adc13ad8b9797d6be2... [Corrs/The Corrs - Breathless.mid, Various/The...
midi_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178561 entries, 0 to 178560
Data columns (total 12 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   midi_path                   178561 non-null  object 
 1   midi_error                  178561 non-null  bool   
 2   estimate_tempo              173984 non-null  float64
 3   tempi_sec                   173984 non-null  object 
 4   tempi                       173984 non-null  object 
 5   end_time                    173984 non-null  float64
 6   drums                       173984 non-null  object 
 7   resolution                  173984 non-null  float64
 8   instrument_names            173984 non-null  object 
 9   num_time_signature_changes  173984 non-null  float64
 10  proto_path                  174476 non-null  object 
 11  original_files              178561 non-null  object 
dtypes: bool(1), float64(4), object(7)
memory usage: 15.2+ MB

With a pandas dataframe we can simply aggregate and plot data into different formats. Let’s start by taking a look at our success rate of parsing the MIDI files.

midi_df['midi_error'].value_counts().plot.pie()
plt.title('Faulty MIDI files');
../_images/01_midi_drums_60_0.png
midi_df.midi_error.value_counts().plot.bar()
plt.title("Number of faulty MIDI files");
../_images/01_midi_drums_61_0.png

After we know the amount of failed MIDI files we could inspect those further to see if they are indeed corrupted or if our library has a bug.

midi_df[midi_df['midi_error'] == True]
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
3 ../datasets/lmd/lmd_full/9/911cd08fa1fae36e5e0... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [s/sou.mid, S/sou.mid]
24 ../datasets/lmd/lmd_full/9/94862530febd2b295b9... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [e/ELVIS_PRESLEY__Dont_Be_Cruel.mid, ElvisPres...
25 ../datasets/lmd/lmd_full/9/906e72809900e01f919... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [dream theater/uagm1.mid]
98 ../datasets/lmd/lmd_full/9/99e40264f321a4bfc5d... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [V/vong_tay_cau_hon.mid]
106 ../datasets/lmd/lmd_full/9/9f22ccab9572cafafce... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [W/Wippenberg - Pong (Tocadisco Remix).mid, W/...
... ... ... ... ... ... ... ... ... ... ... ... ...
178315 ../datasets/lmd/lmd_full/5/5063a2b400597a372e7... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [f/faded.mid, 2009 MIDI/faded_love1-D120patsy_...
178377 ../datasets/lmd/lmd_full/5/523ce5adc656c872a69... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [l/laputa.mid, L/laputa.mid]
178419 ../datasets/lmd/lmd_full/5/546df09d78a32141369... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [h/heraldic.mid, H/heraldic.mid]
178484 ../datasets/lmd/lmd_full/5/5a4e49112c6cf832341... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [w/WESTERN.MID, 2009 MIDI/good_bad_and_ugly4-C...
178525 ../datasets/lmd/lmd_full/5/5cae276ae00fb1fd464... True NaN NaN NaN NaN NaN NaN NaN NaN NaN [soad/Chic_N_Stu.mid]

4084 rows × 12 columns

For now, we will simply ignore those files and focus on the ones that we could sucessfully parse.

For our experiments it is important that the MIDI files contain a drum track - only those files that we could parse succssfully and that contain a drum track can be used for training of our model.

midi_df.drums.value_counts().plot.pie()
plt.title("MIDI files that contain drums");
../_images/01_midi_drums_65_0.png
midi_df.drums.value_counts().plot.bar();
plt.title("Number of MIDI files that contain drums");
../_images/01_midi_drums_66_0.png

We should verify that the files in which we did not detect any drums have indeed no drums by listening to a random subset of them.

midi_df[midi_df['drums']==False].sample(10)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
170937 ../datasets/lmd/lmd_full/5/5c7f00cfbfb1c7a4c8d... False 219.517230 [0.0, 216.0] [80.0, 82.00003553334875] 422.539545 False 480.0 [Piano] 1.0 ../datasets/lmd/proto/5c7f00cfbfb1c7a4c8df5cc0... [C/Casablan.MID, C/CASABLAN.MID, C/CASABLAN.MID]
173537 ../datasets/lmd/lmd_full/5/5ccabab69a29eb49782... False 134.749985 [0.0, 0.4225350000000001, 2.9225340000000006, ... [71.00003550001773, 72.0000288000115, 68.00007... 271.779287 False 192.0 [Flute, Celeste, Harp, Vibes] 10.0 ../datasets/lmd/proto/5ccabab69a29eb49782f7ac2... [A/AS_OP19.MID, A/AS_OP19.MID]
160054 ../datasets/lmd/lmd_full/2/248de65be7464e72e13... False 288.075739 [0.0] [143.99988480009216] 13.333344 False 96.0 [MIDI out] 0.0 ../datasets/lmd/proto/248de65be7464e72e13026ee... [H/Heaven's Cry - I Dont Need This No More.mid...
104986 ../datasets/lmd/lmd_full/d/d62040c47cc8ccc1256... False 230.565249 [0.0, 1.1940279999999999, 1.6062414999999999, ... [201.00031155048293, 131.00007641671124, 18.00... 418.198189 False 240.0 [Diskant] 1.0 ../datasets/lmd/proto/d62040c47cc8ccc12566150b... [r/raffsonatacl.mid, R/raffsonatacl.mid]
177415 ../datasets/lmd/lmd_full/5/5d3a4dcffc36dc21507... False 183.485497 [0.0, 26.052641999999995, 26.252641999999994, ... [75.99996960001218, 75.0, 73.99998273333736, 7... 355.954855 False 96.0 [PRIMI MANDOLINI, SECONDI MANDOLINI, MANDOLE, ... 3.0 ../datasets/lmd/proto/5d3a4dcffc36dc21507ef683... [p/primiracconti.mid, P/primiracconti.mid]
25978 ../datasets/lmd/lmd_full/7/7a6107b2ebe71033dd6... False 133.333333 [0.0] [100.0] 46.500000 False 120.0 [, , , ] 1.0 ../datasets/lmd/proto/7a6107b2ebe71033dd6d7a5b... [h/himno442.mid, H/himno442.mid]
17961 ../datasets/lmd/lmd_full/0/09634cc18ee4b3e3d8e... False 224.868701 [0.0] [132.000132000132] 198.167415 False 192.0 [Clarinet, Trumpet, Tuba, Strings, Strings] 1.0 ../datasets/lmd/proto/09634cc18ee4b3e3d8e84072... [a/Airship_Remix.mid, A/Airship_Remix.mid]
34639 ../datasets/lmd/lmd_full/6/603e957bc609b539948... False 100.000000 [0.0, 62.5, 65.5, 86.5, 91.5, 112.5, 113.5, 11... [60.0, 30.0, 60.0, 30.0, 60.0, 30.0, 60.0, 30.... 191.763885 False 96.0 [Soprano, Alto, Tenor, Bass, Piano (hi), Piano... 5.0 ../datasets/lmd/proto/603e957bc609b5399481f26f... [bliss/ppb74mhamt.mid]
161340 ../datasets/lmd/lmd_full/2/24c247fe99ec2cfb909... False 141.750142 [0.0] [126.00012600012599] 63.805492 False 240.0 [Piano, Piano] 1.0 ../datasets/lmd/proto/24c247fe99ec2cfb909c4aa4... [W/waltz_05.mid, brahms/waltz_05.mid]
29071 ../datasets/lmd/lmd_full/7/7d3cac11d3db19406bb... False 106.219130 [0.0] [91.04496862742124] 42.176960 False 96.0 [, , , ] 1.0 ../datasets/lmd/proto/7d3cac11d3db19406bb568a5... [Christian/Inmyhart.mid, Various Artists/inmyh...

Lets take also a look at the other metadata we have extracted.

midi_df.explode('tempi')['tempi'].plot.box(vert=False);
plt.title('Boxplot of tempi');
../_images/01_midi_drums_70_0.png
midi_df.explode('tempi')['tempi'].plot.hist(bins=5000, xlim=(0, 300));
plt.title('Histogram of tempi');
../_images/01_midi_drums_71_0.png

We notice a big spork at around 120 bpm which is rather interesting and should be inspected what the exact casue of this spike is.

It is important to get an understanding and feeling for the data. We have an expectation of our data which can help us to see if it makes sense and if we parsed it correctly. But that can also be problematic because we project certain simplifying expectations on our data.

Let’s count the occurences of tempo changes in each song.

midi_df.explode('tempi').groupby('midi_path').size().value_counts().plot.box(vert=False);
plt.title('Boxplot of number of time chages per song');
../_images/01_midi_drums_73_0.png
midi_df.explode('tempi').groupby('midi_path').size().sort_values(ascending=False).head(10)
midi_path
../datasets/lmd/lmd_full/9/95cf26881a51f376a54e4e3f049f2d48.mid    6597
../datasets/lmd/lmd_full/3/34c04cd56b1087851f79780360f48225.mid    6096
../datasets/lmd/lmd_full/2/2677cf785791e1cfca13a0566aafbc85.mid    6096
../datasets/lmd/lmd_full/3/34a3df2a3a1e2267cf63657984f608cf.mid    5912
../datasets/lmd/lmd_full/c/c615436a609bf4e82c102503ea01d855.mid    5758
../datasets/lmd/lmd_full/1/1e40fd0edc293c8733a9c1b66517890b.mid    5758
../datasets/lmd/lmd_full/6/6662d98153d0c84d93e794d0c2b3940f.mid    5758
../datasets/lmd/lmd_full/8/8d956a0e5ee409a19adda9db287f7108.mid    4301
../datasets/lmd/lmd_full/0/001549d8bc6ba6dc62ade443cf01a51e.mid    4058
../datasets/lmd/lmd_full/0/0b0fdfe28eb0318ea3ac86046053724b.mid    3922
dtype: int64

Its worth to inspect those files in e.g. MuseScore and try to figure out what happens in them.

midi_df['end_time'].plot.box(vert=False);
plt.title('Boxplot of end_time');
../_images/01_midi_drums_76_0.png

We see that the end_time has some extreme values that are probably wrong - let’s inspect those examples closer to see if we can see a pattern.

midi_df.sort_values('end_time', ascending=False).head(5)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
84737 ../datasets/lmd/lmd_full/f/f7b41341a20201f860c... False 225.041860 [0.0] [128.0] 31055.625000 False 120.0 [Melody, On the Sun Road, Vivian Lai, Joseph L... 2.0 ../datasets/lmd/proto/f7b41341a20201f860cb375e... [s/sunroad.mid, S/sunroad.mid]
28204 ../datasets/lmd/lmd_full/7/7abab3566b77073f827... False 220.109820 [0.0, 1.411764, 67.095876, 129.0154569375, 129... [170.0000850000425, 95.00014250021376, 70.0000... 25714.187790 True 96.0 [, , , , , , , ] 9.0 ../datasets/lmd/proto/7abab3566b77073f8274c097... [FErnszt/Olala.mid]
69901 ../datasets/lmd/lmd_full/a/af6b689cfbb13c302bd... False 229.041171 [0.0, 5.052624, 40.420992, 94.420992, 118.4209... [95.00014250021376, 190.0002850004275, 80.0, 1... 20384.791244 True 96.0 [Acoustic Bass, Brass, Grand piano, Rock Organ... 7.0 ../datasets/lmd/proto/af6b689cfbb13c302bd41b86... [FErnszt/Swingin.mid]
26973 ../datasets/lmd/lmd_full/7/73f4f536d3d42293bd7... False 144.311366 [0.0] [122.00006913337249] 19652.447880 True 192.0 [tk1, tk2, tk3, tk4, tk10, tk11] 1.0 ../datasets/lmd/proto/73f4f536d3d42293bd789b34... [Clapton Eric/Bellbottom Blues.mid, Various Ar...
24515 ../datasets/lmd/lmd_full/7/7486b07d9a4060fa614... False 181.059483 [0.0, 0.4829544166666666, 1.9996389166666664, ... [88.00002346667293, 89.00994241056726, 90.0099... 19623.803343 True 192.0 [Steel Str.Guitar, Fretless Bass, Piano, Violi... 13.0 ../datasets/lmd/proto/7486b07d9a4060fa6142ac46... [K/Kevin Parent Les Doigts 2.mid]

Regarding this, it seems we should filter out those really long pieces as those are most probably wrong files.

Those errors could be caused by our parsing or by the MIDI files itself - but as we have a big enough corpora it is justifiable by not including those outliers.

Resolution corresponds to the PPQN of the MIDI file. Everything above 1000 should be suspicious, so let’s take a look at those examples.

midi_df.sort_values('resolution', ascending=False).head(5)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
33871 ../datasets/lmd/lmd_full/6/610987ebaf0a326db40... False 239.221606 [0.0] [120.0] 135.368840 False 25000.0 [, , , ] 1.0 ../datasets/lmd/proto/610987ebaf0a326db4089867... [b/beethovenalladanza.mid, B/beethovenalladanz...
45187 ../datasets/lmd/lmd_full/1/1283eddbc4ab9042dcc... False 225.608072 [0.0] [112.00005973336519] 79.280092 True 24576.0 [, , ] 1.0 ../datasets/lmd/proto/1283eddbc4ab9042dccaea09... [Pop_and_Top40/2 Pac - Changes.mid, Pop_and_To...
67351 ../datasets/lmd/lmd_full/a/a5c6eb45c94d84474cb... False 235.213491 [0.0] [120.0] 26.987488 False 16384.0 [, , ] 2.0 ../datasets/lmd/proto/a5c6eb45c94d84474cbe00a3... [a/amigo3.mid, A/amigo3.mid]
159033 ../datasets/lmd/lmd_full/2/2076f0fa330be484c1c... False 236.923314 [0.0] [140.00014000014] 26.999973 False 15360.0 [Main Lead] 1.0 ../datasets/lmd/proto/2076f0fa330be484c1c60242... [M/M.I.D.O.R. - Far East.mid, M/m.i.d.o.r.__fa...
15472 ../datasets/lmd/lmd_full/0/0c3f038f4eaed7d7f8c... False 272.000290 [0.0] [136.0001450668214] 28.014676 False 15360.0 [Melody] 1.0 ../datasets/lmd/proto/0c3f038f4eaed7d7f8c90478... [W/whiteroom__someday_intro__ambia.mid, W/Whit...

Another important metric is the number of time signatures.

If we use a step-sequencer like approach, we will need our data to have one time signature only.

midi_df['num_time_signature_changes'].plot.box(vert=False);
../_images/01_midi_drums_83_0.png
midi_df.sort_values('num_time_signature_changes', ascending=False).head(5)
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
72284 ../datasets/lmd/lmd_full/a/a85612fb0992100bded... False 222.143725 [0.0, 1.54624, 1.6513706666666668, 1.779370666... [194.0190397350993, 190.23944805194805, 195.31... 204.310165 True 48.0 [STRAUSS, STRAUSS, STRAUSS, STRAUSS, STRAUSS, ... 959.0 ../datasets/lmd/proto/a85612fb0992100bded76798... [S/STRAUSS.MID, S/STRAUSS.MID]
26776 ../datasets/lmd/lmd_full/7/7d42a95293cc47bcacf... False 199.481296 [0.0, 1.6240640000000002, 2.071552, 2.22173866... [160.0922131147541, 134.08180778032036, 133.16... 229.441685 True 48.0 [NACHT, NACHT, NACHT, NACHT, NACHT, NACHT, NACHT] 782.0 ../datasets/lmd/proto/7d42a95293cc47bcacfaa568... [Hollands/NACHT.MID]
168045 ../datasets/lmd/lmd_full/5/5693291bc82548fea37... False 242.258977 [0.0] [136.0001450668214] 255.266272 True 96.0 [VIBRA SLAP, VIBRA SLAP, VIBRA SLAP, VIBRA SLA... 479.0 ../datasets/lmd/proto/5693291bc82548fea376ec42... [S/Striving.MID, Midis Diversas/STRIVING.MID, ...
60293 ../datasets/lmd/lmd_full/8/85b2d4e8637008f6b4e... False 218.486679 [0.0, 2.5, 73.5, 75.586952, 89.884816, 131.884... [96.0, 240.0, 230.00049833441304, 235.00013708... 283.476487 False 1024.0 [, ] 424.0 ../datasets/lmd/proto/85b2d4e8637008f6b4e54e5c... [B/beevar2.mid]
44034 ../datasets/lmd/lmd_full/6/69df67cd25dd33c1217... False 228.828269 [0.0, 1.25, 3.65, 102.78041, 105.1804100000000... [96.0, 100.0, 115.0000287500072, 100.0, 115.00... 407.079384 False 1024.0 [, ] 382.0 ../datasets/lmd/proto/69df67cd25dd33c12177e159... [Schumann/Schuman Toccata op7.mid, Schumann/Sc...

Let’s check the amount of files we would loose if we limit our files to exactly one time signature.

(midi_df.num_time_signature_changes <= 1).value_counts().plot.pie();
plt.title('Tracks without time signature changes');
../_images/01_midi_drums_86_0.png

We can also take a look at the most common instrument names in our dataset.

midi_df.explode('instrument_names').instrument_names.value_counts().head(20)
                  354676
Bass               29532
Drums              19438
untitled           18471
Piano              16344
Strings            10922
Soprano            10478
Guitar             10126
Alto                9368
Tenor               8987
DRUMS               8013
Melody              6693
WinJammer Demo      6488
Voice               6270
Piano (hi)          4905
Piano (lo)          4900
Italian             4679
STRINGS             4003
bass                3785
MELODY              3775
Name: instrument_names, dtype: int64

And also take a look at the most common names of our original filenames.

midi_df.explode('original_files').original_files.str.replace('[^a-zA-Z]', ' ').str.lower().str.split(' ').explode().value_counts().head(50)
<ipython-input-48-4f9f4be492b3>:1: FutureWarning: The default value of regex will change from True to False in a future version.
  midi_df.explode('original_files').original_files.str.replace('[^a-zA-Z]', ' ').str.lower().str.split(' ').explode().value_counts().head(50)
              1091324
mid            570786
midi            56765
various         42479
artists         36610
a               34883
the             34148
l               29181
s               28479
polyphone       24771
sure            23089
i               22765
m               20958
e               20792
t               19164
midis           19014
c               18162
b               17950
d               17712
n               17012
h               15452
o               15425
g               13303
you             13228
p               12267
k               12032
of              11890
poly            11836
in              11073
f               10433
beatles          9696
love             9246
r                8911
me               8901
j                8816
diversen         8643
my               8172
to               7585
divers           7025
w                6894
analisadas       6320
no               6073
de               5601
on               5373
and              5349
midirip          5017
it               4853
bach             4692
classical        4542
unsorted         4419
Name: original_files, dtype: int64

An interesting aspect of this analysis reveals that the most common words in a text corpora do not often contain much information what the text is about. This is because the words that belong to the grammatical structure of the text introduce noise to the dataset.

There are basic algorithms such as tf-idf which allows us to filter out the noise by asserting that a word which is common in many documents has little no meaning (e.g. the or in our case midi) and therefore give it a low score. Although if a significant subset of our file names contain a word that is not so common in the other file names (e.g. beatles) this token will receive a high rating.

For now we don’t want to inspect this further, but its always good to know such kind of algorithms as it helps to filter out the relevant data.

Its also good to make some sanity checks on our data. One assumption would be that every file in which we did not detect a drum track also does not contain an instrument with the name drums.

explodeded_df = midi_df.explode('instrument_names')
explodeded_df[
    (explodeded_df['instrument_names'].str.lower().isin(['drums', 'drum']))&
    (explodeded_df['drums'] == False)
].groupby('midi_path').first().reset_index()
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
0 ../datasets/lmd/lmd_full/0/0093ffc81428aabfa14... False 190.000285 [0.0] [190.0002850004275] 137.151110 False 192.0 Drums 1.0 ../datasets/lmd/proto/0093ffc81428aabfa14aaa13... [M/Martha My Dear 3.mid, Beatles +GeorgeJohnPa...
1 ../datasets/lmd/lmd_full/0/032e298d9560fdbe954... False 120.000180 [0.0] [190.0002850004275] 114.947196 False 120.0 Drums 1.0 ../datasets/lmd/proto/032e298d9560fdbe954d7551... [s/smetmoth.mid, S/smetmoth.mid]
2 ../datasets/lmd/lmd_full/0/043e1c01e58a4bb4499... False 139.999977 [0.0, 0.4285715, 7.071422000000001, 39.071422] [69.99998833333528, 140.00014000014, 60.0, 69.... 196.778591 False 120.0 Drums 1.0 ../datasets/lmd/proto/043e1c01e58a4bb44995ee38... [Songs+300/Tunnel_o.mid]
3 ../datasets/lmd/lmd_full/0/06f86377b1dbad0ebf0... False 226.482465 [0.0, 1.0, 15.608691999999998] [120.0, 115.0000287500072, 120.0] 59.483692 False 192.0 DRUMS 1.0 ../datasets/lmd/proto/06f86377b1dbad0ebf08fdcd... [s/seaquest2.mid, S/seaquest2.mid]
4 ../datasets/lmd/lmd_full/0/078e2cc7c46eb5de726... False 204.052854 [0.0] [98.00014373354414] 281.211322 False 192.0 Drums 1.0 ../datasets/lmd/proto/078e2cc7c46eb5de72621329... [Pop/SLEGDEHM.MID, PeterGabriel/Sledgehammer6....
... ... ... ... ... ... ... ... ... ... ... ... ...
148 ../datasets/lmd/lmd_full/f/f2a97c0f7fa26244a77... False 218.459582 [0.0] [70.00007000007] 213.553358 False 192.0 drums 1.0 ../datasets/lmd/proto/f2a97c0f7fa26244a77a8d0f... [u/unchainedmelody3.mid, U/UnchainedMelody3.mid]
149 ../datasets/lmd/lmd_full/f/f6702b2445a6ce9aa8e... False 211.143695 [0.0] [120.0] 21.241667 False 480.0 Drums 1.0 ../datasets/lmd/proto/f6702b2445a6ce9aa8e6cad8... [e/ElectroJam.Mid, E/ElectroJam.Mid]
150 ../datasets/lmd/lmd_full/f/f8a56c3ebc00bfef70d... False 157.793103 [0.0] [80.0] 263.847656 False 192.0 Drums 1.0 ../datasets/lmd/proto/f8a56c3ebc00bfef70d033a8... [t/tearinhand.mid, T/tearinhand.mid]
151 ../datasets/lmd/lmd_full/f/fa06799e6825c6a4095... False 284.928064 [0.0] [148.99983858350816] 383.315852 False 240.0 drums 16.0 ../datasets/lmd/proto/fa06799e6825c6a4095e2934... [l/louder.mid, L/louder.mid]
152 ../datasets/lmd/lmd_full/f/fd37e5d1258ba97fc0e... False 235.685752 [0.0] [100.0] 191.795000 False 120.0 Drums 1.0 ../datasets/lmd/proto/fd37e5d1258ba97fc0ef9779... [Christian/Easyway.mid, Various Artists/easywa...

153 rows × 12 columns

Listening to those examples reveal that some of them use another instrument than drums for their drums - either by artistic choice or by mistake. As this only applies to 153 files we can live with this error and ignore the files were we did not detect the drum track.

Filter and extract the data

After inspecting all availabe files we should filter out some noisy files that are either not interesting to us (they lack a drum track) or the parsing of the MIDI files did not work as expected. With the analysis above we can verify borders of data we want to allow.

filtered_midi_df = midi_df[
    (midi_df.drums == True)
    & (midi_df.midi_error == False)
    & (midi_df.end_time.between(30, 800))
    & (midi_df.num_time_signature_changes<=1)
]
filtered_midi_df
midi_path midi_error estimate_tempo tempi_sec tempi end_time drums resolution instrument_names num_time_signature_changes proto_path original_files
5 ../datasets/lmd/lmd_full/9/9d30679480d0e55a07b... False 224.502048 [0.0] [115.00002875000717] 296.347752 True 120.0 [Piano, Organ, Big Brass, Horns, Horns2, Horns... 1.0 ../datasets/lmd/proto/9d30679480d0e55a07be1414... [S/Smooth .mid]
9 ../datasets/lmd/lmd_full/9/974f5b2466a5d5670df... False 184.213982 [0.0, 2.714283, 170.86248999999998, 171.960050... [210.00021000021, 80.99997165000993, 82.000035... 225.896259 True 120.0 [, , , , , , , , , ] 1.0 ../datasets/lmd/proto/974f5b2466a5d5670dfddee0... [Sure.Polyphone.Midi/A Winter's Tale.mid, w/wi...
10 ../datasets/lmd/lmd_full/9/952df587b9fff5adbeb... False 265.636110 [0.0] [144.0023040368646] 216.460946 True 480.0 [Track 2, Track 3, Track 4, Track 5, Track 6, ... 1.0 ../datasets/lmd/proto/952df587b9fff5adbeb5f40b... [E/Enola Gay 63.mid, divers midi 3/OMD_-_Enola...
11 ../datasets/lmd/lmd_full/9/93e2cb3a0c36af361bb... False 165.000165 [0.0] [165.000165000165] 310.029993 True 480.0 [Chorus Guitar, Overdrive Guita, Overdrive Gui... 1.0 ../datasets/lmd/proto/93e2cb3a0c36af361bb49c5e... [z/zombie05.mid, Z/zombie05.mid]
12 ../datasets/lmd/lmd_full/9/97c1f795ffb676dfaa5... False 212.877687 [0.0] [110.00011000011] 222.766823 True 192.0 [BASS, E.PIANO, GUITAR, STRINGS, PIANO, New Tr... 1.0 ../datasets/lmd/proto/97c1f795ffb676dfaa5a0202... [J/J. Rivers Poor Side of Town.mid, divers mid...
... ... ... ... ... ... ... ... ... ... ... ... ...
178555 ../datasets/lmd/lmd_full/5/5bd52f314b616bf50c4... False 141.332152 [0.0, 183.428388, 186.85695800000002, 241.9283... [70.00007000007, 35.00001458333941, 70.0000700... 250.221166 True 480.0 [Track 1, Track 2, Track 3, Track 4, Track 6, ... 1.0 ../datasets/lmd/proto/5bd52f314b616bf50c40104c... [T/taylor_swift-tim_mcgraw.mid]
178556 ../datasets/lmd/lmd_full/5/5eea68ee369af3ee5cc... False 184.252715 [0.0, 210.73867815, 210.75516165, 210.77763355... [90.00009000009, 91.00009100009099, 89.0000400... 299.296908 True 240.0 [You've Made Me So Very Happy, You've Made Me ... 1.0 ../datasets/lmd/proto/5eea68ee369af3ee5cc577ba... [Y/Youvemad L.mid, Y/YOUVEMAD L.mid, Y/YOUVEMA...
178557 ../datasets/lmd/lmd_full/5/5309fb3f62bf73780e6... False 183.172080 [0.0] [81.01003309259853] 139.390142 True 120.0 [, , , , , , , , , ] 1.0 ../datasets/lmd/proto/5309fb3f62bf73780e61f85b... [R/Roberto Carlos - Propuesta L.mid, Midis Jov...
178559 ../datasets/lmd/lmd_full/5/5538b0174111bcef760... False 274.353398 [0.0] [142.00007100003546] 172.389879 True 96.0 [Std Drums, Jazz Gtr, Muted Gtr (Lead Dbl), Mu... 1.0 ../datasets/lmd/proto/5538b0174111bcef760a5f29... [B/Boys 2.mid, b/boys_2.mid, Beatles +GeorgeJo...
178560 ../datasets/lmd/lmd_full/5/51e756897765aed30af... False 229.821409 [0.0] [64.99999458333379] 293.317332 True 96.0 [, , , , , ] 1.0 ../datasets/lmd/proto/51e756897765aed30af232df... [G/Gerry Boulet Deadline.mid]

97848 rows × 12 columns

len(filtered_midi_df)/len(midi_df)
0.5479696014247232

We have now filtered half of our available examples, just by looking at the metadata of the MIDI files. The next step is to transform the drum track of each MIDI file into a mathematical representation.

The notatation within the MIDI file is bound to time, but in music we often represent the time domain in relation to a tempo which remaps the time measured in seconds. A common unit for this is beats per minute (bpm) where 60 bpm maps each second to a beat - if we want to shorten the time between two beats (commonly refered to as faster) we increase the bpm and vice versa.

Another instance for the segmentation of events in time is the time signature of a track.

One notation which tries to simplify this notation is in the form of a step seqencer in which we snap notes to the nearest step on a grid.

For this we can use note_seq with the function midi_file_to_drum_track.

Conclusion

We filtered out all noisy examples of our dataset and transformed the drum track into a mathematical, simplified notation.

In the next step we will start to analyse the drum patterns that we extracted.