Machine learning tools and stuff

Machine learning is quite interesting. As a functional programmer, I tend to apply algebra and logic more often than other maths, yet applied maths, and statistics in particular, do pop up here and there from time to time. For years I'm postponing to learn it more or less properly, and used to choose a language first, and libraries after that – what often led to a lack of machine learning/statistics-related libraries, a lack of practice, and boredom. So, this time I have checked what is out there, ignoring languages, and just poking the fun bits.

Well, also tried to get a bit further into statistics; a part of that is covered here. That's in part a note for myself, so that later it'd be easy to pick up from where I've stopped, with the links collected here.

1 Theory

Linear algebra, mathematical analysis, probability theory, and set theory, of course, are all used in statistics/ML. Yet it doesn't look like a lot of those is used directly when it comes to application; well, maybe it is, if one is beyond the very newbie level, but otherwise even the papers seem to be of the form "let's try a few approaches and see which one works better", or "we've used this and it worked well" – somewhat like recipes. The process reminds me of cooking indeed: some knowledge, some practice and intuition, a bunch of techniques.

For an overview of what's used in practice, there are papers such as "An Empirical Comparison of Machine Learning Models for Time Series Forecasting", and each time I open the HN first page, there is something on the topic. The "Deep Learning" book contains a nice introduction and plenty of references; songrotek/Deep-Learning-Papers-Reading-Roadmap and A Guide to Deep Learning list free books and papers. The OpenIntro Statistics textbook seems to be a nice (and free) introductory book, too.

Actually, when it comes to materials, not just there's plenty of those, but there also are materials for pretty much any level and background: statistics are intended to be used in different sciences, so there are materials for pretty much anyone (though many of them lack proofs, and sometimes they even lack theory/explanations: just lists of rules with constants to apply, not useful for understanding), and machine learning is quite hyped nowadays, so there's plenty of materials on that specifically. The "Machine Learning is Fun!" articles, described as "The world’s easiest introduction to Machine Learning", may also serve as an easy to read overview.

Books on statistics seem to be helpful, of course. I've lost the one I've started with, but there is plenty of them.

The Probability and Statistics Cookbook may serve as a cheatsheet, though there's also Wikipedia for that.

As in any field, there is a lot to learn, and one should learn a lot to avoid doing something stupid. At the same time, it may be less harmful in statistics than in, say, cryptography, and at some point one should get to application with what they've learned.

2 Tools

Though it seems that one could use something like Octave, the most popular languages for ML/statistics seem to be Python and R. Julia looks nicer, but the tools are probably less mature there. There also are lists of libraries for different languages, such as those listed on awesome-machine-learning and awesome-rnn, but as mentioned above, the options are rather limited if you choose a language first. There's also a list on Wikipedia.

So, here are just the first impressions, after a couple of days of active poking.

2.1 Python

There is a nice library called scikit-learn. It has good documentation, plenty of helpful functions, etc. Though no fancy RNNs, but still nice to poke, and may be useful in some cases. matplotlib is there for plotting.

Then there are Theano and TensorFlow, and a bunch of libraries implemented on top of those. Both are pretty much like languages – with their own type systems, functions, variables, conditionals, loops, semantics – well, everything, except for a syntax. It's really awkward to work with them that way: since Python doesn't faciliate DSLs anyhow, one just composes an AST using Python API. Even high-level libraries (e.g., Lasagne) depend on that. I was rather surprised to find that, and decided that if it'll involve learning a new language, it'd better be a sane language, and moved on to R.

Installation of those on CentOS 7 was rather awkward, too: scikit-learn only required to install atlas-devel, but others had dependencies in third-party repositories, required to reconfigure Python with --enable-shared option, etc. IIRC their installation instructions also suggested to curl | sh things.

2.2 R

Installation is relatively nice (./configure --disable-java && make && sudo make install install-info, if you don't have/want java), there is an emacs mode (M-x package-install RET ess RET), and the intro is also available in info – so one can poke it all in emacs.

While the language is not particularly consistent or nice as a general-purpose language, it seems to be well-suited for the task: type conversions are implicit, but relatively sane (well, sometimes you get bin2int(int2bin(1)) == 16777216, but not everything in statistics is intuitively obvious), and handy for trying things out; the primary data structures are vectors, matrices, arrays, lists, data frames – the ones you'll need often, and there is plenty of functions to work with those.

There is a lot of R packages, too; not all are statistics-related, but there is a lot of those that are. The ones I've tried were quite easy to use, too.

An interesting picture popped up while I've poked it: languages like Octave and R could be associated with applied maths, while the ones like Haskell and Idris – with pure maths, and most of the mainstream ones – with basic arithmetic. Well, some languages are actually designed with particular theories in mind and/or based on those, but the picture I've imagined got better balanced after adding R into it.

3 Application and practice

That's a somewhat sad part: there are data sets to play with, and there are techniques to apply, but it's not that fun without an actual application: it's not very exciting to just observe that the things which should work, do in fact work. I've been thinking of log analysis, but apparently it won't be very useful without more structured logs than the ones I have; wake hours prediction, but I don't have reliable observations of that.

Not a long time ago I needed to fit some lines, given sets of points, but even that time it was desirable to guess the actual human-defined formula behind those (since there was one, and the known points were rough approximations), not just something that fits.

Finding the music that I may like would be useful, but there is no huge music database around (although if there was a distributed network for that, with users applying a given classifier to what they have, that could have worked – but alas, I haven't heard of anything of that kind). Actually, it could be generalized to a fancy file search, but that's not exactly on the topic of statistics.

Apparently most of the fun deep learning things are done with pictures nowadays (well, same as before, or maybe I'm just finding them more fun), including GANs, but for some reason I'm not that excited about it now. Though might be fun to, say, draw cartoons based on photos, and try to train a network to reverse that process.

I guess the approach of XKCD #208 may do for now: just learning the basics, to apply them someday, maybe. Besides, R may serve as a calculator with plotting – the task for which I've occasionally used Octave before.