I mentioned a while ago how
I was slowly migrating away from Mixcloud as their business model is starting to impinge on people listening to my mixes and I’m not so keen on that. I already mentioned trying to get
Funkwhale working and using cue files.
While looking through my mixes I noticed I really didn’t do a great job with some of the metadata. While I had spent most of my time adding to the metadata to Mixcloud (not ideal)
In Mixcloud (once logged in) there are some human friendly urls which I was able to grab images from. The key one being the upload edit page – https://www.mixcloud.com/upload/{username}/{mixname}/edit/ for example https://www.mixcloud.com/upload/cubicgarden/follow-me-into-the-fading-moonlight/edit/
My plan was to manually copy the times into my newly written cue files but while talking to Jon about it, he said give him 5mins and he could knock up a script to pull the values out of the HTML page. I thought about it before but using XSLT, however noticed there is a lot javascript rendering making things difficult.
Jon’s quick script written was just what I needed.
#!/usr/bin/env python3
import csv
import sys
from collections import namedtuple
from typing import List
import bs4
from bs4 import Tag
SongInfo = namedtuple('SongInfo', ['number', 'artist', 'title', 'time'])
def load_html(filename: str):
with open(filename, 'r', encoding='utf-8') as fo:
return ''.join(fo.readlines())
def extract_song_info(song: Tag):
try:
number = song.find(class_='section-number').text
artist = song.find(class_='section-artist').text
title = song.find(class_='section-title').text
time = song.find(class_='section-time')['value']
result = SongInfo(number, artist, title, time)
print(f'Extracted {result}')
return result
except AttributeError:
print(f'Error with item {song}')
return None
def parse_table(input_html: str):
soup = bs4.BeautifulSoup(input_html, features="html5lib")
songs = [row for row in soup.find_all(class_="section-row")]
return [x for x in [extract_song_info(song) for song in songs] if x is not None]
def save_to_csv(file_name: str, songs: List[SongInfo]):
with open(file_name, 'w', encoding='utf-8') as fo:
writer = csv.writer(fo)
for song in songs:
writer.writerow(song)
if __name__=="__main__":
if len(sys.argv) != 3:
print('Usage: extractor.py [input_html_file] [output_csv_file]')
html = load_html(sys.argv[1])
songs = parse_table(html)
save_to_csv(sys.argv[2], songs)
print(f'Saved to {sys.argv[2]} successfully - Done!')
With it and the HTML pages, which I almost got with Chromedriver, again thanks to Jon, but I couldn’t be bothered to sort out the cookies, etc. I quickly wrote a quick /dirty bash script and fired up a terminal.
#!/bin/bash
./extractor.py $1.html $1.csv
# Verify
echo Details for $1
I thought about modifying Jon’s script to generate the cue files directly bypassing the csv file but decided I should just get them all done. Because I still need to get funkwhale going.
I did notice the edit page doesn’t include genre or the year of the mix, but I can live with that, for now… Scraping web pages is certainly a throw back but its better solution that what I originally was thinking.
This will teach me to sort out my own house of data!