Openness is a two-way street

Last week the Data Analysis Study Group of the SPE Gulf Coast Section announced a new machine learning contest (I’m afraid registration is now closed, even though the contest has not started yet). The task is to predict shear-wave sonic from other logs, similar to the SPWLA PDDA contest last year. This is a valuable problem in the subsurface, because shear sonic log is essential for computing elastic properties of rocks and therefore in predicting rock and fluid properties or processing seismic. Indeed, TGS have built a business on predicted logs with their ARLAS product. There’s money in log prediction!

The task looks great, but there’s one big problem: the dataset is not open.

Why is this a problem?

Before answering that, let’s look at some context.

What’s a machine learning contest?

Good question. Typically, an organization releases a dataset (financial timeseries, Netflix viewer data, medical images, or whatever). They invite people to predict some valuable property (when to sell, which show to recommend, how to treat the illness, or whatever). And they pick the best, measured against known labels on a hidden dataset.

Kaggle is one of the largest platforms hosting such challenges, and they often attract thousands of participants — competing for large prizes. TGS ran a seismic salt-picking contest on the platform, attracting almost 74,000 submissions from 3220 teams with a $100k prize purse. Other contests are more grass-roots, like the one I ran with Brendon Hall in 2016 on lithology prediction, and like this SPE contest. It’s being run by a team of enthusiasts without a lot of resources from SPE, and the prize purse is only $1000 — representing about 3 hours of the fully loaded G&A of an oil industry professional.

What has this got to do with reproducibility?

Contests that award a large prize in return for solving a hard problem are essentially just a kind of RFP-combined-with-consulting-job. It’s brutally inefficient: hundreds or even thousands of people spend hours on the problem for free, and a handful are financially rewarded. These contests attract a lot of attention, but I’m not that interested in them.

Community-oriented events like this SPE contest — and the recent FORCE one that Xeek hosted — are more interesting and I believe they are more impactful. They have lots of great outcomes:

  • Lots of people have fun working on a hard problem and connecting with each other.

  • Solutions are often shared after, or even during, the contest, so that everyone learns and grows their toolbox.

  • A new open dataset that might even become a much-needed benchmark for the task in hand.

  • Researchers can publish what they did, or do later. (The SEG ML contest tutorial and results article have 136 citations between them, largely from people revisiting the dataset to show new solutions.)

A lot of new open-source machine learning code is always exciting, but if the data is not open then the work is by definition not reproducible. It seems especially unfair — cheeky, even — to ask participants to open-source their code, but to keep the data proprietary. For sure TGS is interested in how these free solutions compare to their own product.

Well, life’s not fair. Why is this a problem?

The data is being shared with the contest participants on the condition that they may not share it. In other words it’s proprietary. That means:

  • Participants are encumbered with the liability of a proprietary dataset. Sure, TGS is sharing this data in good faith today, but who knows how future TGS lawyers will see it after someone accidentally commits it to their GitHub repo? TGS is a billion-dollar company, they will win a legal argument with you. (Having said that, there’s no NDA or anything, just a checkbox in a form. I don’t know how binding it really is… but I don’t want to be the one that finds out.)

  • Participants can’t publish reproducible papers on their own work. They can publish classic oil-indsutry, non-reproducible work — I did this thing but no-one can check it because I can’t give you the data — but do we really need more of that? (In the contest introductory Zoom, someone asked about publishing plots of the data. The answer: “It should be fine.” Are we really still this naive about data?)

If anyone from TGS is reading this and thinking, “Come on, we’re not going to sue anyone — we’re not GSI! — it’s fine :)” then my response is: Wonderful! In that case, why not just formalize everything by releasing the data under an open licence — preferably Creative Commons Attribution 4.0? (Unmodified! Don’t make the licensing mistakes that Equinor and NAM have made recently.) That way, everyone knows their rights, everyone can safely download the data, and the community can advance. And TGS looks pretty great for contributing an awesome dataset to the subsurface machine learning community.

I hope TGS decides to release the data with an open licence. If they don’t, it feels like a rather one-sided deal to me. And with the arrangement as it stands, there’s no way I would enter this contest.

No secret codes: announcing the winners

The SEG / Agile / Enthought Machine Learning Contest ended on Tuesday at midnight UTC. We set readers of The Leading Edge the challenge of beating the lithology prediction in October's tutorial by Brendon Hall. Forty teams, mostly of 1 or 2 people, entered the contest, submitting several hundred entries between them. Deadlines are so interesting: it took a month to get the first entry, and I received 4 in the second month. Then I got 83 in the last twenty-four hours of the contest.

How it ended

Team F1 Algorithm Language Solution
1 LA_Team (Mosser, de la Fuente) 0.6388 Boosted trees Python Notebook
2 PA Team (PetroAnalytix) 0.6250 Boosted trees Python Notebook
3 ispl (Bestagini, Tuparo, Lipari) 0.6231 Boosted trees Python Notebook
4 esaTeam (Earth Analytics) 0.6225 Boosted trees Python Notebook
ml_contest_lukas_alfo.png

The winners are a pair of graduate petroelum engineers, Lukas Mosser (Imperial College, London) and Alfredo de la Fuente (Wolfram Research, Peru). Not coincidentally, they were also one of the more, er, energetic teams — it's say to say that they explored a good deal of the solution space. They were also very much part of the discussion about the contest on GitHub.com and on the Software Underground Slack chat group, aka Swung (you're in there, right?).

I will be sending Raspberry Shakes to the winners, along with some other swag from Enthought and Agile. The second-place team will receive books from SEG (thank you SEG Book Mart!), and the third-place team will have to content themselves with swag. That team, led by Paolo Bestagini of the Politecnico di Milano, deserves special mention — their feature engineering approach was very influential, being used by most of the top-ranking teams.

Coincidentally Gram and I talked to Lukas on Undersampled Radio this week:

Back up a sec, what the heck is a machine learning contest?

To enter, a team had to predict the lithologies in two wells, given wireline logs and other data. They had complete data, including lithologies, in nine other wells — the 'training' data. Teams trained a wide variety of models — from simple nearest neighbour models and support vector machines, to sophisticated deep neural networks and random forests. These met with varying success, with accuracies ranging between about 0.4 and 0.65 (i.e., error rates from 60% to 35%). Here's one of the best realizations from the winning model:

One twist that made the contest especially interesting was that teams could not just submit their predictions — they had to submit the code that made the prediction, in the open, for all their fellow competitors to see. As a result, others were quickly able to adopt successful strategies, and I'm certain the final result was better than it would have been with secret code.

I spent most of yesterday scoring the top entries by generating 100 realizations of the models. This was suggested by the competitors themselves as a way to deal with model variance. This was made a little easier by the fact that all of the top-ranked teams used the same language — Python — and the same type of model: extreme gradient boosted trees. (It's possible that the homogeneity of the top entries was a negative consequence of the open format of the contest... or maybe it just worked better than anything else.)

What now?

There will be more like this. It will have something to do with seismic data. I hope I have something to announce soon.

I (or, preferably, someone else) could write an entire thesis on learnings from this contest. I am busy writing a short article for next month's Leading Edge, so if you're interested in reading more, stay tuned for that. And I'm sure there wil be others.

If you took part in the contest, please leave a comment telling about your experience of it or, better yet, write a blog post somewhere and point us to it.