The machine learning algo zoo

One of the wonderful, but also baffling, things about machine learning is that there are so many ways to do it. At some very high level, most of them do something like this (highlighting some jargon):

  1. The human settles on a task (“Predict lithology”) and finds a bunch of data relevant to that task (say, some well logs A, B, and C). Then the human has to come up with some known instances or examples where these well log data go with those lithology labels.

  2. Stuff the logs into an equation. Not an equation like A + B + C, because there’s nothing to tweak in that equation. The equation needs parameters or coefficients, like \(\alpha A + \beta B + \gamma C\). The machine can tweak those Greek letters to change the output. At first, they’ll be random guesses.

  3. See how the output of that equation, which is the machine’s prediction, compares to the known labels. Come up with another equation whose output is a good measure of how far away the predictions are from the known labels. This distance is called the cost, and the equation to compute it the cost function.

  4. Now that the machine has something to guess (the Greek parameters) and a way to know how well its doing (the cost function), it just needs a way to minimize the cost, or to put it another way, optimize the parameters. This optimization process is called learning.

Together, these steps constitute a learning algorithm. An algorithm with a set of optimized weights is usually referred to simply as a model.

All the algorithms

Every piece of this story is worth a whole blog post on its own, but for today let’s stay high-level.

The problem is that the algorithm zoo can be overwhelming. My post last week was an attempt to compare a lot of regression algorithms, in terms of how they make sense of three synthetic datasets.

Today I’m sharing a Big Giant Spreadsheet™ that attempts to compare some of the most popular ‘shallow’ learning algorithms in terms of their most important characteristics. For example, can they predict probabilities? Are they deterministic? What are the key hyperparameters? And so on.

Here’s a small version of the table (see the links below for other versions):

There’s a PDF version here — and here’s the original spreadsheet.

Eventually, I’m visualizing a poster for the wall. I think it would be nice to have some equations on here. Maybe the plots from the various comparisons too (see last weeks post!). And even more advice, like which ones break when you have too many features. What else would you like to see on there?

Comparing regressors

There are several really nice comparisons between various algorithms in the Scikit-Learn documentation. The most famous, and useful, one is probably the classifier comparison:

A comparison of classification algorithms. Each row is a different dataset; each column (except the first) is a different classifier, each trying to separate the blue and red points. The accuracy score of each classifier is show in the lower right corner of each plot. There’s so much to look at in this one plot!

There’s also a very nice clustering algorithm comparison, and this anomaly detection comparison. As usual with awesome open source software packages like Scikit-Learn, the really wonderful thing is that all the source code is right there so you can hack these things to show your own data.

What about regression?

Regression problems are the other major kind of machine learning task. If the thing you’re trying to predict is not a category (like ‘blue’ or ‘red’, as above) but a continuous property (like porosity, say), then you’re looking at a regression problem.

I wondered what a comparison plot for the various regressors in Scikit-Learn would look like. I couldn’t find one, so I made one. I made up three one-dimensional datasets — one linear, one polynomial, and one periodic. Then I tried predicting each one with various different model types, from linear regression to a deep neural network. Here’s version 1 (well, 0.1 really) of my script; feel free to adapt and improve it!

Here’s the plot it produces:

A comparison of most of the regressors in scikit-learn, made with this script. The red lines are unregularized models; the blue have regularization. The pale points are the validation data. The small numbers in each plot are RMS error (lower is better!).

I think this plot repays careful study. Notice the smoothing effect of regularization. See how tree-based methods result in discretized predictions, and kernel-based ones are pretty horrible at extrapolation.

I’m 100% open to feedback on ways to improve this plot… or please improve it and show me how it goes!

Take one, make one

There’s a teaching method originating in medicine known as “see one, do one, teach one”. I like it because it underscores hands-on practice and knowledge sharing as essential steps in developing a craft — and it works. Today, I want to urge you to take a challenge, then make one for others.

First, what’s the challenge?

A couple of years ago, inspired by the annual Advent of Code challenges, we introduced the kata, a set of coding challenges especially for geoscientists. For a long time we sent them to students in our Geocomputing class, to encourage them to keep coding. Now we just tell everyone about them.

At the time we announced the kata, there were five puzzles. Today, there are 11: four beginner-friendly challenges, four intermediate ones, and three quite hard ones. Topics range from data munging to map indexing, and from digital elevation models to fractures.

💡 If you want to try one, this Colab is the easiest way to get started: https://ageo.co/kata-live

Now make one!

Once you’ve got an idea of how these things work, you might want to try your hand at making one. Once you have an idea for a short task, you need a way to generate a random dataset. For example, for the sample-names challenge, I have a function that generates a random set of sample names, composed of several parts (a number, a basin, a formation, a data, etc).

When you have a dataset, you can ask some questions about it. Start with an one, and build from there. The last question (there can be 3 or 4), should be a somewhat realistic challenge for this kind of data. Each question needs a hint, and each question must have only one possible answer (this is the tricky bit!).

If you fancy trying your hand at it, check out our new kata-dev repository on GitHub. There is a demo challenge there, which is also live on the kata server, so you can see how it all works. Good luck!


Whether or not you try making a challenge for your peers, so let us know how you get on in the #kata-challenges channel on the Software Underground. We’re always ready to answer questions about them.