Critical Machine Learning

project website

Every NYTimes front pages

During a conversation with a classmate I had noticed that the New York Times offers a scanned image of every front page they have published. I had a vague sense that scraping every one of them and doing something could lead to something cool, and indeed someone did come up with something cool: Every NYT front page since 1852 by Josh Begley.

It seemed fun so I decided to give it a go as well. NYT is hosting all of their front page images in the following URL format: The images are fairly low-res, although since Jul 6 2012 they also provide higher-res pdf scans (same url, but with .pdf extension).

Since NYT is organizing their images in such a neat format, downloading all of them is quite simple. Using python3:

import urllib.request, datetime
d =,9,18) # date of first NYT publication
missing = [] # list where errors and missing dates go in, just in case
while d !=
url = "{}/{}/{}/nytfrontpage/scan.jpg".format(d.strftime("%Y"),d.strftime("%m"),d.strftime("%d"))
urllib.request.urlretrieve(url, d.strftime("%Y%m%d")+".jpg")
print(d.strftime("%Y%m%d")+" missing")
d += datetime.timedelta(days=1)

That’s it. It should run for several hours, downloading roughly 10GB of images. I tried to shorten the process by dividing the dates into chunks of 10,000 days and running 6 scripts simultaneously.

In the process, I have learned that some dates are not available because of strikes that went on at NYT. 

Next step: making an animation out of the images.

Analyzing MIDI files in python

The mid-term project for Machine Learning (Prof. Haralick) is MIDI music recognition. This is a casual log of my process so far.

To read MIDI files I am using the music21 python package, as suggested. Reading MIDI files directly using music21.converter.parse() seems to produce unreliable results. For example, I tried reading a file of Satie’s Gymnopedie 1:

However, by parsing this file directly I lose the tempo, rests, and key signature. I also get an “incorrect MusicXML” warning.

>g = converter.parse('satie_gymnopedie_1_(c)dery.mid')

I can preserve the info by first converting the MIDI file into MusicXML using MuseScore, then parsing it.

> /Applications/MuseScore\ satie_gymnopedie_1_\(c\)dery.mid --export-to satie_gymnopedie_1_\(c\)dery.xml
python: x = converter.parse('satie_gymnopedie_1_(c)dery.xml')
python: x.measures(1,4).show()

Therefore, next step: batch convert every MIDI file into MusicXML and work on it.

Feb 23: The assignment was further specified to build a classifier that distinguishes between two composers/genres, instead of across all composers/genres. That reminded me to go into some basic overview on the data. Here I wanted to find how many scores I have per composer name. (I excluded some folders, mainly to avoid having all members of the bach family)

MIDIFILEDIR> find . -type f -name "*.mid" -exec mv {} TARGETDIR
ls > index.txt

import re
from collections import Counter
with open("index.txt", "r") as f:
l = f.readlines()
regex = r"[a-zA-Z]+"
names = [re.findall(regex, filename)[0] for filename in l]

results in: [(‘bach’, 2276), (‘haydn’, 744), (‘mozart’, 728), (‘beethoven’, 673), (‘scarlatti’, 598), (‘handel’, 535), (‘victoria’, 333), (‘schubert’, 287), (‘chopin’, 277), (‘tchaikovsky’, 243), (‘alkan’, 238), (‘dandrieu’, 211), (‘debussy’, 199), (‘pachelbel’, 185), (‘liszt’, 170), (‘brahms’, 162), (‘dvorak’, 148), (‘lully’, 119), (‘schumann’, 118), (‘couperin’, 117)]

I feel inclined to work with composers that have similar numbers of data, so it is going to be Haydn/Mozart classification. (Maybe Beethoven as well.)