Hard things that look easy

After working on a few data science (aka data analytics aka machine learning) problems with geoscientific data, I think we've figured out the 10-step workflow. I'm happy to share it with you now:

  1. Look at all these cool problems, machine learning can solve all of these! I just need to figure out which model to use, parameterize it, and IT'S GONNA BE AWESOME, WE'LL BE RICH. Let's just have a quick look at the data...
  2. Oh, there's no data.
  3. Three months later: we have data! Oh, the data's a bit messy.
  4. Six months later: wow, cleaning the data is gross and/or impossible. I hate my life.
  5. Finally, nice clean data. Now, which model do I choose? How do I set parameters? At least you expected these problems. These are well-known problems.
  6. Wait, maybe there are physical laws governing this natural system... oh well, the model will learn them.
  7. Hmm, the results are so-so. I guess it's harder to make predictions than I thought it would be.
  8. Six months later: OK, this sort of works. And people think it sounds cool. They just need a quick explanation.
  9. No-one understands what I've done.
  10. Where is everybody?

I'm being facetious of course, but only a bit. Modeling natural systems is really hard. Much harder for the earth than for, say, the human body, which is extremely well-known and readily available for inspection. Even the weather is comparitively easy.

Coupled with the extreme difficulty of the problem, we have a challenging data environment. Proprietary, heterogeneous, poor quality, lost, non-digital... There are lots of ways the data goblins can poop on the playground of machine learning.

If the machine learning lark is so hard, why not just leave it to non-artificial intelligence — humans. We already learned how to interpret data, right? We know the model takes years to train. Of course, but I don't accept that we couldn't use some of the features of intelligently applied big data analytics: objectivity, transparency, repeatability (by me), reproducibility (by others), massive scale, high speed... maybe even error tolerance and improved decisions, but those seem far off right now.

I also believe that AI models, like any software, can encode the wisdom of professionals — before they retire. This seems urgent, as the long-touted Great Crew Change is finally underway.

What will we work on?

There are lots of fascinating and tractable problems for machine learning to attack in geoscience — I hope many of them get attacked at the hackathon in June — and the next 2 to 3 years are going to be very exciting. There will be the usual marketing melée to wade through, but it's up to the community of scientists and data analysts to push their way through that with real results based on open data and, ideally with open code.

To be sure, this is happening already — we've had over 25 entrants publishing their solutions to the SEG machine learning contest already, and there will be more like this. It's the only way to building transparent problem-solving systems that we can all participate in and, ultimately, trust.

What machine learning problems are most pressing in geoscience?
I'm collecting ideas for projects to tackle in the hackathon. Please visit this Tricider question and contribute your comments, opinions, or ideas of your own. Help the community work on the problems you care about.

Coding to tell stories

Last week, I was in Calgary on family business, but I took an afternoon to host a 'private beta' for a short course that I am creating for geoscience computing. I invited about twelve familiar faces who would be provide gentle and constuctive feedback. In the end, thirteen geophysicists turned up, seven of whom I hadn't met before. So much for familiarity.

I spent about two and half hours stepping through the basics of the Python programming language, which I consider essential material — getting set up with Python via Enthought Canopy, basic syntax, and so on. In the last hour of the afternoon, I steamed through a number of geoscientific examples to showcase exercises for this would-be course. 

Here are three that went over well. Next week, I'll reveal the code for making these images. I might even have a go at converting some of my teaching materials from IPython Notebook to HTML:

To plot a wavelet

The Ricker wavelet is a simple analytic function that is used throughout seismology. This curvaceous waveform is easily described by a single variable, the dominant frequency of its many contituents frequencies. Every geophysicist and their cat should know how to plot one: 

To make a wedge

Once you can build a wavelet, the next step is to make that wavelet interact with the earth. The convolution of the wavelet with this 3-layer impedance model yields a synthetic seismogram suitable for calibrating seismic signals to subtle stratigraphic geometries. Every interpreter should know how to build a wedge, with site-specific estimates of wavelet shape and impedance contrasts. Wedge models are important in all instances of dipping and truncated layers at or below the limit of seismic resolution. So basically they are useful all of the time. 

To make a 3D viewer

The capacity of Python to create stunning graphical displays with merely a few (thoughtful) lines of code seemed to resonate with people. But make no mistake, it is not easy to wade through the hundreds of function arguments to access this power and richness. It takes practice. It appears to me that practicing and training to search for and then read documentation, is the bridge that carries people from the mundane to the empowered.

This dry-run suggested to me that there are at least two markets for training here. One is a place for showing what's possible — "Here's what we can do, now let’s go and build it". The other, more arduous path is the coaching, support, and resources to motivate students through the hard graft that follows. The former is centered on problem solving, the latter is on problem finding, where the work and creativity and sweat is. 

Would you take this course? What would you want to learn? What problem would you bring to solve?

Unsolved problems

One of the recurring dreams I've had this year is about unsolved problems. I've always loved these lists, the best known of which is perhaps the German David Hilbert's 1900 list of 23 unsolved problems in mathematics. There are several published versions of the list; take look at a later manuscript describing some of the problems.

A year or two ago, I read this meta-list in Wikipedia. Natch, I immediately wanted to create a list of unsolved problems in geoscience. It could help researches find big, interesting problems. It could help software developers focus their talents. It might just be a bit of fun. However, articles in Wikipedia need something to reference[citation needed], so even if I were capable of such a thing, one can't just sit down and hack one out. 

But you can try. Earlier this year, I drafted a proto-list for geophysics, drawn mostly from chats with friends. Please feel free to vote on the list, or add problems of your own. It is, I admit, a bit biased towards problems in seismology in pursuit of hydrocarbons. The list should be much broader, but I'm not yet the polymath I strive to be and quickly get out of my depth!

Here are the top five (per today) from my Google Moderator list of unsolved problems in geophysics:

  • How can we represent and quantify error and uncertainty from acquisition, through processing and interpretation, to analysis?
  • What useful signal or information can we extract from what we usually call 'noise' (multiples, refractions, reverberations, etc)?
  • How can we exploit the full spectrum in acquisition, processing, interpretation, and analysis?
  • Is there a 'best practice' for tying wells; if so, what is it?
  • What exactly is AVO-friendly processing?

What might a list of unsolved problems in geology look like? My likely-ignorant outlook suggests some:

  • Is it possible to predict the location, severity, and/or timing of earthquakes?
  • Do mantle plumes exist?
  • How do magnetic reversals happen?
  • Are mass extinctions cyclic?
  • Do the earth's physico-chemical systems mostly drive, or mostly respond to, changes in climate?
  • Does eustatic (global, synchronous, uniform) sea-level change happen, or does the ubiquity of local tectonism obviate the concept?
  • What exactly was the sequence of events that resulted in the end-Permian extinction? The end-Cretaceous?

I am proposing a workshop on the topic of unsolved problems in exploration and development geophysics at SEG next year in San Antonio. Ideas welcome.