Music discovery

Finding new music one would enjoy is commonly desirable, and it's a problem for which computers and Internet are well-suited. Major obstacles are copyright laws restricting sharing of music itself, as well as centralized music recommender systems not sharing their data, and not doing a great job themselves. For a long time I imagined that distributed systems and increased user collaboration would help to replace the latter, making the data available for anyone to analyze, but that doesn't happen so far. Here I will focus on analysis itself, even though there's not much of freely and legally available data to analyze.

Metadata analysis

Metadata—including musical properties (possibly tagged by listeners) and presence in users' playlists—is relatively easy to acquire and handle: human listeners themselves do the hard work, they are likely to analyze/classify/cluster the music better than neural networks or other statistical methods would, and copyright restrictions on the music itself don't apply to it. Elements of music, types of ornaments, voice type, and a bunch of other parameters can be parts of metadata, in addition to genres and other common tags.

The hard problem here is perhaps to make users to publish their playlists and tags freely, instead of giving them to centralized proprietary systems that won't give even the raw data back.

As for now, there's the Spotify Million Playlist Dataset Challenge, requiring registration and acceptance a few long user agreements, and then free for research/non-commercial usage, apparently.

Audio analysis

As with other statistical analysis, there's usually preprocessing, where Mel-frequency cepstrum is often employed to basically simplify and split the sound into timbre and timbre-less features, and the actual statistical model, often neural networks for larger inputs, and one may also try to feed them raw input.

Manual discovery

While automation is potentially useful, the data isn't quite available, but one can still rely on the discovery methods that would be archaic if it wasn't for copyright: Internet "radios", for instance. Icecast directory lists a bunch of those.