While looking through my mixes I noticed I really didn’t do a great job with some of the metadata. While I had spent most of my time adding to the metadata to Mixcloud (not ideal)
In Mixcloud (once logged in) there are some human friendly urls which I was able to grab images from. The key one being the upload edit page – https://www.mixcloud.com/upload/{username}/{mixname}/edit/ for example https://www.mixcloud.com/upload/cubicgarden/follow-me-into-the-fading-moonlight/edit/
My plan was to manually copy the times into my newly written cue files but while talking to Jon about it, he said give him 5mins and he could knock up a script to pull the values out of the HTML page. I thought about it before but using XSLT, however noticed there is a lot javascript rendering making things difficult.
Jon’s quick script written was just what I needed.
#!/usr/bin/env python3 import csv import sys from collections import namedtuple from typing import List import bs4 from bs4 import Tag SongInfo = namedtuple('SongInfo', ['number', 'artist', 'title', 'time']) def load_html(filename: str): with open(filename, 'r', encoding='utf-8') as fo: return ''.join(fo.readlines()) def extract_song_info(song: Tag): try: number = song.find(class_='section-number').text artist = song.find(class_='section-artist').text title = song.find(class_='section-title').text time = song.find(class_='section-time')['value'] result = SongInfo(number, artist, title, time) print(f'Extracted {result}') return result except AttributeError: print(f'Error with item {song}') return None def parse_table(input_html: str): soup = bs4.BeautifulSoup(input_html, features="html5lib") songs = [row for row in soup.find_all(class_="section-row")] return [x for x in [extract_song_info(song) for song in songs] if x is not None] def save_to_csv(file_name: str, songs: List[SongInfo]): with open(file_name, 'w', encoding='utf-8') as fo: writer = csv.writer(fo) for song in songs: writer.writerow(song) if __name__=="__main__": if len(sys.argv) != 3: print('Usage: extractor.py [input_html_file] [output_csv_file]') html = load_html(sys.argv[1]) songs = parse_table(html) save_to_csv(sys.argv[2], songs) print(f'Saved to {sys.argv[2]} successfully - Done!')
With it and the HTML pages, which I almost got with Chromedriver, again thanks to Jon, but I couldn’t be bothered to sort out the cookies, etc. I quickly wrote a quick /dirty bash script and fired up a terminal.
#!/bin/bash ./extractor.py $1.html $1.csv # Verify echo Details for $1
I thought about modifying Jon’s script to generate the cue files directly bypassing the csv file but decided I should just get them all done. Because I still need to get funkwhale going.
I did notice the edit page doesn’t include genre or the year of the mix, but I can live with that, for now… Scraping web pages is certainly a throw back but its better solution that what I originally was thinking.
This will teach me to sort out my own house of data!