October 31, 2018

Reproducibility Zoo

October 31, 2018/ Matt Hall

The Repro Zoo was a new kind of event at the SEG Annual Meeting this year. The goal: to reproduce the results from well-known or important papers in GEOPHYSICS or The Leading Edge. By reproduce, we meant that the code and data should be open and accessible. By results, we meant equations, figures, and other scientific outcomes.

And some of the results are scary enough for Hallowe’en :)

What we did

All the work went straight into GitHub, mostly as Jupyter Notebooks. I had a vague goal of hitting 10 papers at the event, and we achieved this (just!). I’ve since added a couple of other papers, since the inspiration for the work came from the Zoo… and I haven’t been able to resist continuing.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

Here’s what the Repro Zoo team got up to, in alphabetical order:

Aldridge (1990). The Berlage wavelet. GEOPHYSICS 55 (11). The wavelet itself, which has also been added to bruges.
Batzle & Wang (1992). Seismic properties of pore fluids. GEOPHYSICS 57 (11). The water properties, now added to bruges.
Claerbout et al. (2018). Data fitting with nonstationary statistics, Stanford. Translating code from FORTRAN to Python.
Claerbout (1975). Kolmogoroff spectral factorization. Thanks to Stewart Levin for this one.
Connolly (1999). Elastic impedance. The Leading Edge 18 (4). Using equations from bruges to reproduce figures.
Liner (2014). Long-wave elastic attentuation produced by horizontal layering. The Leading Edge 33 (6). This is the stuff about Backus averaging and negative Q.
Luo et al. (2002). Edge preserving smoothing and applications. The Leading Edge 21 (2).
Yilmaz (1987). Seismic data analysis, SEG. Okay, not the whole thing, but Sergey Fomel coded up a figure in Madagascar.
Partyka et al. (1999). Interpretational aspects of spectral decomposition in reservoir characterization.
Röth & Tarantola (1994). Neural networks and inversion of seismic data. Kudos to Brendon Hall for this implementation of a shallow neural net.
Taner et al. (1979). Complex trace analysis. GEOPHYSICS 44. Sarah Greer worked on this one.
Thomsen (1986). Weak elastic anisotropy. GEOPHYSICS 51 (10). Reproducing figures, again using equations from bruges.

As an example of what we got up to, here’s Figure 14 from Batzle & Wang’s landmark 1992 paper on the seismic properties of pore fluids. My version (middle, and in red on the right) is slightly different from that of Batzle and Wang. They don’t give a numerical example in their paper, so it’s hard to know where the error is. Of course, my first assumption is that it’s my error, but this is the problem with research that does not include code or reference numerical examples.

Figure 14 from Batzle & Wang (1992). Left: the original figure. Middle: My attempt to reproduce it. Right: My attempt in red, overlain on the original.

This was certainly not the only discrepancy. Most papers don’t provide the code or data to reproduce their figures, and this is a well-known problem that the SEG is starting to address. But most also don’t provide worked examples, so the reader is left to guess the parameters that were used, or to eyeball results from a figure. Are we really OK with assuming the results from all the thousands of papers in GEOPHYSICS and The Leading Edge are correct? There’s a long conversation to have here.

What next?

One thing we struggled with was capturing all the ideas. Some are on our events portal. The GitHub repo also points to some other sources of ideas. And there was the Big Giant Whiteboard (below). Either way, there’s plenty to do (there are thousands of papers!) and I hope the zoo continues in spirit. I will take pull requests until the end of the year, and I don’t see why we can’t add more papers until then. At that point, we can start a 2019 repo, or move the project to the SEG Wiki, or consider our other options. Ideas welcome!

Thank you!

The following people and organizations deserve accolades for their dedication to the idea and hard work making it a reality. Please give them a hug or a high five when you see them.

David Holmes (Dell EMC) and Chance Sanger worked their tails off on the booth over the weekend, as well as having the neighbouring Dell EMC booth to worry about. David also sourced the amazing Dell tech we had at the booth, just in case anyone needed 128GB of RAM and an NVIDIA P5200 graphics card for their Jupyter Notebook. (The lights in the convention centre actually dimmed when we powered up our booths in the morning.)
Luke Decker (UT Austin) organized a corps of volunteer Zookeepers to help manage the booth, and provided enthusiasm and coding skills. Karl Schleicher (UT Austin), Sarah Greer (MIT), and several others were part of this effort.
Andrew Geary (SEG) for keeping things moving along when I became delinquent over the summer. Lots of others at SEG also helped, mainly with the booth: Trisha DeLozier, Rebecca Hayes, and Beth Donica all contributed.
Diego Castañeda got the events site in shape to support the Repro Zoo, with a dashboard showing the latest commits and contributors.

October 18, 2018

Café con leche

October 18, 2018/ Matt Hall

At the weekend, 28 digital geoscientists gathered at MAZ Café in Santa Ana, California, to sprint on some open geophysics software projects. Teams and individuals pushed pull requests — code contributions to open source projects — left, right, and centre. Meanwhile, Senah and her team at MAZ kept us plied with coffee and horchata, with fantastic food on the side.

Because people were helping each other and contributing where they could, I found it a bit hard to stay on top of what everyone was working on. But here are some of the things I heard at the project breakdown on Sunday afternoon:

Gerard Gorman, Navjot Kukreja, Fabio Luporini, Mathias Louboutin, and Philipp Witte, all from the devito project, continued their work to bring Kubernetes cluster management to devito. Trying to balance ease of use and unlimited compute turns out to be A Hard Problem! They also supported the other teams hacking on devito.

Thibaut Astic (UBC) worked on implementing DC resistivity models in devito. He said he enjoyed the expressiveness of devito’s symbolic equation definitions, but that there were some challenges with implementing the grad, div, and curl operator matrices for EM.

Vitor Mickus and Lucas Cavalcante (Campinas) continued their work implementing a CUDA framework for devito. Again, all part of the devito project trying to give scientists easy ways to scale to production-scale datasets.

That wasn’t all for devito. Alongside all these projects, Stephen Alwon worked on adapting segyio to read shot records, Robert Walker worked on poro-elastic models for devito, and Mohammed Yadecuri and Justin Clark (California Resources) contributed too. On the second day, the devito team was joined by Felix Hermann (now Georgia Tech), with Mengmeng Yang, and Ali Siakoohi (both UBC). Clearly there’s something to this technology!

Brendon Hall and Ben Lasscock (Enthought) hacked on an open data portal concept, based on the UCI Machine Learning Repository, coincidentally based just down the road from our location. The team successfully got some examples of open data and code snippets working.

Jesper Dramsch (Heriot-Watt), Matteo Niccoli (MyCarta), Yuriy Ivanov (NTNU) and Adriana Gordon and Volodymyr Vragov (U Calgary), hacked on bruges for the weekend, mostly on its documentation and the example notebooks in the in-bruges project. Yuriy got started on a ray-tracing code for us.

Nathan Jones (California Resources) and Vegard Hagen (NTNU) did some great hacking on an interactive plotting framework for geoscience data, based on Altair. What they did looked really polished and will definitely come in useful at future hackathons.

All in all, an amazing array of projects!

This event was low-key compared to recent hackathons, and I enjoyed the slightly more relaxed atmosphere. The venue was also incredibly supportive, making my life very easy.

A big thank you as always to our sponsors, Dell EMC and Enthought. The presence of the irrepressible David Holmes and Chris Lenzsch (both Dell EMC), and Enthought’s new VP of Energy, Charlie Cosad, was greatly appreciated.

We will definitely be revisiting the sprint concept in the future — einmal ist keinmal, as they say. Devito and bruges both got a boost from the weekend, and I think all the developers did too. So stay tuned for the next edition!

October 16, 2018

Volve: not open after all

October 16, 2018/ Matt Hall

Back in June, Equinor made the bold and exciting decision to release all its data from the decommissioned Volve oil field in the North Sea. Although the intent of the release seemed clear, the dataset did not carry a license of any kind. Since you cannot use unlicensed content without permission, this was a problem. I wrote about this at the time.

To its credit, Equinor listened to the concerns from me and others, and considered its options. Sensibly, it chose an off-the-shelf license. It announced its decision a few days ago, and the dataset now carries a Creative Commons Attribution-NonCommercial-ShareAlike license.

Unfortunately, this license is not ‘open’ by any reasonable definition. The non-commercial stipulation means that a lot of people, perhaps most people, will not be able to legally use the data (which is why non-commercial licenses are not open licenses). And the ShareAlike part means that we’re in for some interesting discussion about what derived products are, because any work based on Volve will have to carry the CC BY-NC-SA license too.

Non-commercial licenses are not open

Here are some of the problems with the non-commercial clause:

NC licenses are not 'open'. They do not meet the Open Definition, because open data must be available for anyone to use, for any purpose. For example, here’s a quote from Hagedorn et al (2011):

NC licenses come at a high societal cost: they provide a broad protection for the copyright owner, but strongly limit the potential for re-use, collaboration and sharing in ways unexpected by many users

NC licenses are incompatible with CC-BY-SA. This means that the data cannot be used on Wikipedia, SEG Wiki, or AAPG Wiki, or in any openly licensed work carrying that license.
NC-licensed data cannot be used commercially. This is obvious, but far-reaching. It means, for example, that nobody can use the data in a course or event for which they charge a fee. It means nobody can use the data as a demo or training data in commercial software. It means nobody can use the data in a book that they sell.
The boundaries of the license are unclear. It's arguable whether any business can use the data for any purpose at all, because many of the boundaries of the scope have not been tested legally. What about a course run by AAPG or SEG? What about a private university? What about a government, if it stands to realize monetary gain from, say, a land sale? All of these uses would be illiegal, because it’s the use that matters, not the commercial status of the user.

Now, it seems likely, given the language around the release, that Equinor will not sue people for most of these use cases. They may even say this. Goodness knows, we have enough nudge-nudge-wink-wink agreements like that already in the world of subsurface data. But these arrangements just shift the onus onto the end use and, as we’ve seen with GSI, things can change and one day you wake up with lawsuits.

ShareAlike means you must share too

Creative Commons licenses are, as the name suggests, intended for works of creativity. Indeed, the whole concept of copyright, depends on creativity: copyright protects works of creative expression. If there’s no creativity, there’s no basis for copyright. So for example, a gamma-ray log is unlikely to be copyrightable, but seismic data is (follow the GSI link above to find out why). Non-copyrightable works are not covered by Creative Commons licenses.

All of which is just to help explain some of the language in the CC BY-NC-SA license agreement, which you should read. But the key part is in paragraph 4(b):

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License

What’s a ‘derivative work’? It’s anything ‘based upon’ the licensed material, which is pretty vague and therefore all-encompassing. In short, if you use or show Volve data in your work, no matter how non-commercial it is, then you must attach a CC BY-NC-SA license to your work. This is why SA licenses are sometimes called ‘viral’.

By the way, the much-loved F3 and Penobscot datasets also carry the ShareAlike clause, so any work (e.g. a scientific paper) that uses them is open-access and carries the CC BY-SA license, whether the author of that work likes it or not. I’m pretty sure no-one in academic publishing knows this.

By the way again, everything in Wikipedia is CC BY-SA too. Maybe go and check your papers and presentations now :)

What should Equinor do?

My impression is that Equinor is trying to satisfy some business partner or legal edge case, but they are forgetting that they have orders of magnitude more capacity to deal with edge cases than the potential users of the dataset do. The principle at work here should be “Don’t solve problems you don’t have”.

Encumbering this amazing dataset with such tight restrictions effectively kills it. It more or less guarantees it cannot have the impact I assume they were looking for. I hope they reconsider their options. The best choice for any open data is CC-BY.

October 09, 2018

Reproduce this!

October 09, 2018/ Matt Hall

There’s a saying in programming: untested code is broken code. Is unreproducible science broken science?

I hope not, because geophysical research is — in general — not reproducible. In other words, we have no way of checking the results. Some of it, hopefully not a lot of it, could be broken. We have no way of knowing.

Next week, at the SEG Annual Meeting, we plan to change that. Well, start changing it… it’s going to take a while to get to all of it. For now we’ll be content with starting.

We’re going to make geophysical research reproducible again!

Welcome to the Repro Zoo!

If you’re coming to SEG in Anaheim next week, you are hereby invited to join us in Exposition Hall A, Booth #749.

We’ll be finding papers and figures to reproduce, equations to implement, and data tables to digitize. We’ll be hunting down datasets, recreating plots, and dissecting derivations. All of it will be done in the open, and all the results will be public and free for the community to use.

You can help

There are thousands of unreproducible papers in the geophysical literature, so we are going to need your help. If you’ll be in Anaheim, and even if you’re not, here some things you can do:

Vote on the papers and figures to reproduce, or propose new ones. Click here!
Show up at Booth #749 to take part. Bring your laptop, or use one of ours (kindly provided by Dell EMC, our amazing booth neighbours). We’ll mostly be coding in Python and Julia, but any open language is welcome.
Tell people about the Repro Zoo and bring them along too.

That’s all there is to it! Whether you’re a coder or an interpreter, whether you have half an hour or half a day, come along to the Repro Zoo and we’ll get you started.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

Blog