The norm and simple solutions

Last time I wrote about different ways of calculating distance in a vector space — say, a two-dimensional Euclidean plane like the streets of Portland, Oregon. I showed three ways to reckon the distance, or norm, between two points (i.e. vectors). As a reminder, using the distance between points u and v on the map below this time:

$$ \|\mathbf{u} - \mathbf{v}\|_1 = |u_x - v_x| + |u_y - v_y| $$

$$ \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{(u_x - v_x)^2 + (u_y - v_y)^2} $$

$$ \|\mathbf{u} - \mathbf{v}\|_\infty = \mathrm{max}(|u_x - v_x|, |u_y - v_y|) $$

Let's think about all the other points on Portland's streets that are the same distance away from u as v is. Again, we have to think about what we mean by distance. If we're walking, or taking a cab, we'll need to think about \(\ell_1\) — the sum of the distances in x and y. This is shown on the left-most map, below.

For simplicity, imagine u is the origin, or (0, 0) in Cartesian coordinates. Then v is (0, 4). The sum of the distances is 4. Looking for points with the same sum, we find the pink points on the map.

If we're thinking about how the crow flies, or \(\ell_2\) norm, then the middle map sums up the situation: the pink points are all equidistant from u. All good: this is what we usually think of as 'distance'.

norms_equidistant_L0.png

The \(\ell_\infty\) norm, on the other hand, only cares about the maximum distance in any direction, or the maximum element in the vector. So all points whose maximum coordinate is 4 meet the criterion: (1, 4), (2, 4), (4, 3) and (4, 0) all work.

You might remember there was also a weird definition for the \(\ell_0\) norm, which basically just counts the non-zero elements of the vector. So, again treating u as the origin for simplicity, we're looking for all the points that, like v, have only one non-zero Cartesian coordinate. These points form an upright cross, like a + sign (right).

So there you have it: four ways to draw a circle.

Wait, what?

A circle is just a set of points that are equidistant from the centre. So, depending on how you define distance, the shapes above are all 'circles'. In particular, if we normalize the (u, v) distance as 1, we have the following unit circles:

It turns out we can define any number of norms (if you like the sound of \(\ell_{2.4}\) or \(\ell_{240}\) or \(\ell_{0.024}\)...) but most of the time, these will suffice. You can probably imagine the shapes of the unit circles defined by these other norms.

What can we do with this stuff?

Let's think about solving equations. Think about solving this:

$$ x + 2y = 8 $$

norms_line.png

I'm sure you can come up with a soluiton in your head, x = 6 and y = 1 maybe. But one equation and two unknowns means that this problem is underdetermined, and consequently has an infinite number of solutions. The solutions can be visualized geometrically as a line in the Euclidean plane (right).

But let's say I don't want solutions like (3.141590, 2.429205) or (2742, –1367). Let's say I want the simplest solution. What's the simplest solution?

norms_line_l2.png

This is a reasonable question, but how we answer it depends how we define 'simple'. One way is to ask for the nearest solution to the origin. Also reasonable... but remember that we have a few different ways to define 'nearest'. Let's start with the everyday definition: the shortest crow-flies distance from the origin. The crow-flies, \(\ell_2\) distances all lie on a circle, so you can imagine starting with a tiny circle at the origin, and 'inflating' it until it touches the line \(x + 2y - 8 = 0\). This is usually called the minimum norm solution, minimized on \(\ell_2\). We can find it in Python like so:

    import numpy.linalg as la
    A = [[1, 2]]
    b = [8]
    la.lstsq(A, b)

The result is the vector (1.6, 3.2). You could almost have worked that out in your head, but imagine having 1000 equations to solve and you start to appreciate numpy.linalg. Admittedly, it's even easier in Octave (or MATLAB if you must) and Julia:

    A = [1 2]
    b = [8]
    A \ b
norms_line_all.png

But remember we have lots of norms. It turns out that minimizing other norms can be really useful. For example, minimizing the \(\ell_1\) norm — growing a diamond out from the origin — results in (0, 4). The \(\ell_0\) norm gives the same sparse* result. Minimizing the \(\ell_\infty\) norm leads to \( x = y = 8/3 \approx 2.67\).

This was the diagram I wanted to get to when I started with the 'how far away is the supermarket' business. So I think I'll stop now... have fun with Norm!


* I won't get into sparsity now, but it's a big deal. People doing big computations are always looking for sparse representations of things. They use less memory, are less expensive to compute with, and are conceptually 'neater'. Sparsity is really important in compressed sensing, which has been a bit of a buzzword in geophysics lately.

The norm: kings, crows and taxicabs

How far away is the supermarket from your house? There are lots of ways of answering this question:

  • As the crow flies. This is the green line from \(\mathbf{a}\) to \(\mathbf{b}\) on the map below.

  • The 'city block' driving distance. If you live on a grid of streets, all possible routes are the same length — represented by the orange lines on the map below.

  • In time, not distance. This is usually a more useful answer... but not one we're going to discuss today.

Don't worry about the mathematical notation on this map just yet. The point is that there's more than one way to think about the distance between two points, or indeed any measure of 'size'.

norms.png

Higher dimensions

The map is obviously two-dimensional, but it's fairly easy to conceive of 'size' in any number of dimensions. This is important, because we often deal with more than the 2 dimensions on a map, or even the 3 dimensions of a seismic stack. For example, we think of raw so-called 3D seismic data as having 5 dimensions (x position, y position, offset, time, and azimuth). We might even formulate a machine learning task with a hundred or more dimensions (or 'features').

Why do we care about measuring distances in high dimensions? When we're dealing with data in these high-dimensional spaces, 'distance' is a useful way to measure the similarity between two points. For example, I might want to select those samples that are close to a particular point of interest. Or, from among the points satisfying some constraint, select the one that's closest to the origin.

Definitions and nomenclature

We'll define norms in the context of linear algebra, which is the study of vector spaces (think of multi-dimensional 'data spaces' like the 5D space of seismic data). A norm is a function that assigns a positive scalar size to a vector \(\mathbf{v}\) , with a size of zero reserved for the zero vector (in the Cartesian plane, the zero vector has coordinates (0, 0) and is usually called the origin). Any norm \(\|\mathbf{v}\|\) of this vector satisfies the following conditions:

  1. Absolutely homogenous. The norm of \(\alpha\mathbf{v}\) is equal to \(|\alpha|\) times the norm of \(\mathbf{v}\).

  2. Subadditive. The norm of \( (\mathbf{u} + \mathbf{v}) \) is less than or equal to the norm of \(\mathbf{u}\) plus the norm of \(\mathbf{v}\). In other words, the norm satisfies the triangle inequality.

  3. Positive. The first two conditions imply that the norm is non-negative.

  4. Definite. Only the zero vector has a norm of 0.

Kings, crows and taxicabs

Let's return to the point about lots of ways to define distance. We'll start with the most familiar definition of distance on a map— the Euclidean distance, aka the \(\ell_2\) or \(L_2\) norm (confusingly, sometimes the two is written as a superscript), the 2-norm, or sometimes just 'the norm' (who says maths has too much jargon?). This is the 'as-the-crow-flies distance' on the map above, and we can calculate it using Pythagoras:

$$ \|\mathbf{v}\|_2 = \sqrt{(a_x - b_x)^2 + (a_y - b_y)^2} $$

You can extend this to an arbitrary number of dimensions, just keep adding the squared elementwise differences. We can also calculate the norm of a single vector in n-space, which is really just the distance between the origin and the vector:

$$ \|\mathbf{u}\|_2 = \sqrt{u_1^2 + u_2^2 + \ldots + u_n^2}  = \sqrt{\mathbf{u} \cdot \mathbf{u}} $$

As shown here, the 2-norm of a vector is the square root of its dot product with itself.

So the crow-flies distance is fairly intuitive... what about that awkward city block distance? This is usually referred to as the Manhattan distance, the taxicab distance, the \(\ell_1\) or \(L_1\) norm, or the 1-norm. As you can see on the map, it's just the sum of the absolute distances in each dimension, x and y in our case:

$$ \|\mathbf{v}\|_1 = |a_x - b_x| + |a_y - b_y| $$

What's this magic number 1 all about? It turns out that the distance metric can be generalized as the so-called p-norm, where p can take any positive value up to infinity. The definition of the p-norm is consistent with the two norms we just met:

$$ \| \mathbf{u} \|_p = \left( \sum_{i=1}^n | u_i | ^p \right)^{1/p} $$

[EDIT, May 2021: This generalized version is sometimes called the Minkowski distance, e.g. in the scipy documentation.]

In practice, I've only ever seen p = 1, 2, or infinity (and 0, but we'll get to that). Let's look at the meaning of the \(\infty\)-norm, aka the \(\ell_\infty\) or \(L_\infty\) norm, which is sometimes called the Chebyshev distance or chessboard distance (because it defines the minimum number of moves for a king to any given square):

$$ \|\mathbf{v}\|_\infty = \mathrm{max}(|a_x - b_x|, |a_y - b_y|) $$

In other words, the Chebyshev distance is simply the maximum element in a given vector. In a nutshell, the infinitieth root of the sum of a bunch of numbers raised to the infinitieth power, is the same as the infinitieth root of the largest of those numbers raised to the infinitieth power — because infinity is weird like that.

What about p = 0?

Infinity is weird, but so is zero sometimes. Taking the zeroeth root of a lot of ones doesn't make a lot of sense, so mathematicians often redefine the \(\ell_0\) or \(L_0\) "norm" (not a true norm) as a simple count of the number of non-zero elements in a vector. In other words, we toss out the 0th root, define \(0^0 := 0 \) and do:

$$ \| \mathbf{u} \|_0 = |u_1|^0 + |u_2|^0 + \cdots + |u_n|^0 $$

(Or, if we're thinking about the points \(\mathbf{a}\) and \(\mathbf{b}\) again, just remember that \(\mathbf{v}\) = \(\mathbf{a}\) - \(\mathbf{b}\).)

Computing norms

Let's take a quick look at computing the norm of some vectors in Python:

 
>>> import numpy as np

>>> a = np.array([1, 1]).T
>>> b = np.array([6, 5]).T

>>> L_0 = np.count_nonzero(a - b)
2

>>> L_1 = np.sum(np.abs(a - b))
9

>>> L_2 = np.sqrt((a - b) @ (a - b))
6.4031242374328485

>>> L_inf = np.max(np.abs(a - b))
5

>>> # Using NumPy's `linalg` module:
>>> import numpy.linalg as la
>>> for p in (0, 1, 2, np.inf):
>>>    print("L_{} norm = {}".format(p, la.norm(a - b, p)))
L_0 norm = 2.0
L_1 norm = 9.0
L_2 norm = 6.4031242374328485
L_inf norm = 5.0

What can we do with all this?

So far, so good. But what's the point of these metrics? How can we use them to solve problems? We'll get into that in a future post, so don't go too far!

For now I'll leave you to play with this little interactive demo of the effect of changing p-norms on a Voronoi triangle tiling — it's by Sarah Greer, a geophysics student at UT Austin. 


UPDATE — The next post is The norm and simple solutions, which looks at how these different norms can be used to solve real-world problems.

Hacking in Houston

geohack_2017_banner.png

Houston 2013
Houston 2014
Denver 2014
Calgary 2015
New Orleans 2015
Vienna 2016
Paris 2017
Houston 2017... The eighth geoscience hackathon landed last weekend!

We spent last weekend in hot, humid Houston, hacking away with a crowd of geoscience and technology enthusiasts. Thirty-eight hackers joined us on the top-floor coworking space, Station Houston, for fun and games and code. And tacos.

Here's a rundown of the teams and what they worked on.

Seismic Imagers

Jingbo Liu (CGG), Zohreh Souri (University of Houston).

Tech — DCGAN in Tensorflow, Amazon AWS EC2 compute.

The team looked for patterns that make seismic data different from other images, using a deep convolutional generative adversarial network (DCGAN). Using a seismic volume and a set of 2D lines, they made 121,000 sub-images (tiles) for their training set.

The Young And The RasLAS

William Sanger (Schlumberger), Chance Sanger (Museum of Fine Arts, Houston), Diego Castañeda (Agile), Suman Gautam (Schlumberger), Lanre Aboaba (University of Arkansas).

State of the art text detection by Google Cloud Vision API

State of the art text detection by Google Cloud Vision API

Tech — Google Cloud Vision API, Python flask web app, Scatteract (sort of). Repo on GitHub.

Digitizing well logs is a common industry task, and current methods require a lot of manual intervention. The team's automated pipeline: convert PDF files to images, perform OCR with Google Cloud Vision API to extract headers and log track labels, pick curves using a CNN in TensorFlow. The team implemented the workflow in a Python flask front-end. Check out their slides.

Hutton Rocks

Kamal Hami-Eddine (Paradigm), Didi Ooi (University of Bristol), James Lowell (GeoTeric), Vikram Sen (Anadarko), Dawn Jobe (Aramco).

hutton.png

Tech — Amazon Echo Dot, Amazon AWS (RDS, Lambda).

The team built Hutton, a cloud-based cognitive assistant for gaining more efficient, better insights from geologic data. Project includes integrated cloud-hosted database, interactive web application for uploading new data, and a cognitive assistant for voice queries. Hutton builds upon existing Amazon Alexa skills. Check out their GitHub repo, and slides.

Big data > Big Lore 

Licheng Zhang (CGG), Zhenzhen Zhong (CGG), Justin Gosses (Valador/NASA), Jonathan Parker (Marathon)

The team used machine learning to predict formation tops on wireline logs, which would allow for rapid generation of structure maps for exploration play evaluation, save man hours and assist in difficuly formation-top correlations. The team used the AER Athabasca open dataset of 2193 wells (yay, open data!).

Tech — Jupyter Notebooks, SciPy, scikit-learn. Repo on GitHub.

Free near surface

free_surface.png

Tien-Huei Wang, Jing Wu, Clement Zhang (Schlumberger).

Multiples are a kind of undesired seismic signal and take expensive modeling to remove. The project used machine learning to identify multiples in seismic images. They attempted to use GAN frameworks, but found it difficult to formulate their problem, turning instead to the simpler problem of binary classification. Check out their slides.

Tech — CNN... I don't know the framework.

The Cowboyz

Mingliang Liu, Mohit Ayani, Xiaozheng Lang, Wei Wang (University of Wyoming), Vidal Gonzalez (Universidad Simón Bolívar, Venezuela).

A tight group of researchers joined us from the University of Wyoming at Laramie, and snagged one of the most enthusiastic hackers at the event, a student from Venezuela called Vidal. The team attempted acceleration of geostatistical seismic inversion using TensorFlow, a central theme in Mingliang's research.

Tech — TensorFlow.

Augur.ai

Altay Sensal (Geokinetics), Yan Zaretskiy (Aramco), Ben Lasscock (Geokinetics), Colin Sturm (Apache), Brendon Hall (Enthought).

augur.ai.JPG

Electrical submersible pumps (ESPs) are critical components for oil production. When they fail, they can cause significant down time. Augur.ai provides tools to analyze pump sensor data to predict when pumps when pump are behaving irregularly. Check out their presentation!

Tech — Amazon AWS EC2 and EFS, Plotly Dash, SigOpt, scikit-learn. Repo on GitHub.

disaster_input.png

The Disaster Masters

Joe Kington (Planet), Brendan Sullivan (Chevron), Matthew Bauer (CSM), Michael Harty (Oxy), Johnathan Fry (Chevron)

Hydrologic models predict floodplain flooding, but not local street flooding. Can we predict street flooding from LiDAR elevation data, conditioned with citizen-reported street and house flooding from U-Flood? Maybe! Check out their slides.

Tech — Python geospatial and machine learning stacks: rasterio, shapely, scipy.ndimage, scikit-learn. Repo on GitHub.

The structure does WHAT?!

Chris Ennen (White Oak), Nanne Hemstra (dGB Earth Sciences), Nate Suurmeyer (Shell), Jacob Foshee (Durwella).

Inspired by the concept of an iPhone 'face ageing' app, Nate recruited a team to poke at applying the concept to maps of the subsurface. Think of a simple map of a structural field early in its life, compared to how it looks after years of interpretation and drilling. Maybe we can preview the 'aged' appearance to help plan where best to drill next to reduce uncertainty!

Tech — OpendTect, Azure ML Studio, C#, self-boosting forest cluster. Repo on GitHub.


Thank you!

Massive thanks to our sponsors — including Pioneer Natural Resources — for their part in bringing the event to life! 

sponsors_tight.png

More thank-yous

Apart from the participants themselves, Evan and I benefitted from a team of technical support, mentors, and judges — huge thanks to all these folks:

  • The indefatigable David Holmes from Dell EMC. The man is a legend.
  • Andrea Cortis from Pioneer Natural Resources.
  • Francois Courteille and Issam Said of NVIDIA.
  • Carlos Castro, Sunny Sunkara, Dennis Cherian, Mike Lapidakis, Jit Biswas, and Rohan Mathews of Amazon AWS.
  • Maneesh Bhide and Steven Tartakovsky of SigOpt.
  • Dave Nichols and Aria Abubakar of Schlumberger.
  • Eric Jones from Enthought.
  • Emmanuel Gringarten from Paradigm.
  • Frances Buhay and Brendon Hall for help with catering and logistics.
  • The team at Station for accommodating us.
  • Frank's Pizza, Tacos-a-Go-Go, Cali Sandwich (banh mi), Abby's Cafe (bagels), and Freebird (burritos) for feeding us.

Finally, megathanks to Gram Ganssle, my Undersampled Radio co-host. Stalwart hack supporter and uber-fixer, Gram came over all the way from New Orleans to help teams make sense of deep learning architectures and generally smooth things over. We recorded an episode of UR at the hackathon, talking to Dawn Jobe, Joe Kington, and Colin Sturm about their respective projects. Check it out!


[Update, 29 Sep & 3 Nov] Some statistics from the event:

  • 39 participants, including 7 women (way too few, but better than 4 out of 63 in Paris)
  • 9 students (and 0 professors!).
  • 12 people from petroleum companies.
  • 18 people from service and technology companies, including 5 from Schlumberger!
  • 13 no-shows, not including folk who cancelled ahead of time; a bit frustrating because we had a long wait list.
  • Furthest travelled: James Lowell from Newcastle, UK — 7560 km!
  • 98 tacos, 67 burritos, 96 slices of pizza, 55 kolaches, and an untold number of banh mi.

Newsflash: the Geophysics Hackathon is back!

Mark your calendar: 22–24 September (right before SEG), at a downtown Houston location to be confirmed.

We're filling the room with 50 geoscientists of all stripes. Interpreters, programmers, students, professionals... everyone is welcome. The plan: to imagine, design, and prototype some new tools in geophysics — all around the theme of machine learning. It's going to be awesome. 

The schedule: we'll get started at 6 pm on Friday 22 September, and go till 10 pm. Then we pick it up again on Saturday morning, and go till 6 pm, and the same again on Sunday. Teams will present a demo to everyone on Sunday after 3 pm. There will be a few prizes, a few drinks, lots of food, and a lot of new geophysical tools and widgets. 

If you want to know more about what a hackathon is, read my summary from the last one: Le grand hack! Or check out the project round-up posts, part 1 and part 2.

If you're not sure you belong, I promise that you do. One of the prize-winning teams in Paris had no coding experience! And every team needs help with brainstorming, design, testing, and presentation. Absolutely anyone can contribute, and absolutely everyone will learn something.

If you have some like-minded friends, bring them along! We need teams of 5 people, so if there are already 5 of you, you can start coding as soon as you walk in the door!

If you can't be there yourself, please share this post with someone you know.

When you're ready, click here to buy a ticket.


Thank you as always to our sponsors so far: Dell EMC and Amazon AWS. If you'd like to sponsor the Houston event, please check this page out, or just get in touch.

Fear and loathing in oil & gas

Sometimes you have to swallow your fear. This is one of those times.

The proliferation of 3D seismic in the 1980s was a major step forward for the petroleum industry. However, it took more than a decade for the 3D seismic method to become popular. During that decade, seismic equipment continued to evolve, particularly with the advent of telemetry recording systems that needed for doing 3D surveys offshore.

Things were never the same again. New businesses sprouted up to support it, and established service companies and tech companies exploded size and in order to keep up with the demand and all the new work.

Not so coincidently, another major shift happened in the late 1980s and early 1990s with the industry-wide shift to Sun workstations in order to cope with the crunching and rendering the overwhelming influx of all these digits. UNIX workstations with hilariously large cathode-ray tube monitors became commonplace. This industry helped make Sun and many other IT companies very wealthy, and once again everything was good. At least until Sun's picnic was trampled on by Linux workstations in the early 2000s, but that's another story...

I think the advent of 3D seismic is one of many examples of the upstream oil and gas industry thriving on technological change. 3D seismic changed everything, facilitating progress in the full sense of the word and we never looked back. As an early career geoscientist, I don't know what the world was like before 3D seismic, but I have interpreted 2D data and I know it's an awful experience — even on a computer.

Debilitating skepticism?

Today, in 2017, we find ourselves in the middle of the next major transformation. Like 3D seismic before it, machine learning will alter yesterday's landscape beyond all recognition. We've been through all of this before, but this time, for some reason it feels different. Many people are cautious, unconvinced about whether this next thing will live up to the hype. Other people are vibrating with excitement viewing the whole thing with rose-coloured glasses. Still others truly believe that it will fail — assertively rejecting hopes and over-excited claims that yes, artificial intelligence will catapult us into a better world, a world beyond our wildest dreams.

A little skepticism is healthy, but I meet a lot of people who are so skeptical about this next period of change that they are ignoring it. It feels to me like an unfair level of dismissal, a too-rigid stance. And it has left me rather perplexed: Why is there so much resistance and denial this time around? Why the apprehension?

I'll wager the reason it is different this time because this change is happening to us, in spite of us, whether we like it or not. We're not in the driving seat. Most of us aren't even in the passenger seat. Unlike seismic technology and UNIX|Linux workstations, our sector has had little to do with this revolution. We haven't been pushing for it, instead, it is dragging us along with it. Worse, it's happening fast; even the people who are trying to keep up with it can barely hold on. 

We need you

This is the opportunity of a lifetime. It's happening. High time to crank up the excitement, get involved, be a part of it. I for one want you to be part of it. Come along with us. We need you, whether you like it or not. 


This post was provoked by a conversation on LinkedIn.

Subsurface Hackathon project round-up, part 2

Following on from Part 1 yesterday, here are the other seven team projects from the hackathon:


Interactive visualization of Water Table heights over many years.

Interactive visualization of Water Table heights over many years.

Water, water everywhere

Water Underground: Martin Bentley (NMMU), Joseph Barraud (Rolls Royce), Rabah Cheknoun (UPPA)

The team built readers for the groundwater data available from dinoloket.nl, both the groundwater levels and the hydrochemistry. They clustered the data by aggregating by month and then looking for similarities in levels in the boreholes and built an open Jupyter notebook.


  

 

 

Seismic from noise

OBSNoise: Fernando Villanueva-Robles (IPGP), Yann Huet (Setec-Lerm), Ngoc Huyen Luu (Ecole Polytechnique), Dorian Bagur (Telecom ParisTech), Jonathan Grandjean (Independent)

The OBSNoise project investigated the application of machine learning to coherently stack ambient noise records collected from ocean bottom seismic (OBS) arrays in order to extract reservoir information. The team's results from synthetic data showed promise. If fully developed, this technology could be a virtually real-time monitoring system of dynamic reservoir properties.


The Killers. Killing It. 

The Killers. Killing It. 

Global geochemical data analytics

The Killers: Alexandre Sache, Violaine Delahaye, Karl Sache (all from Institute Polytechnique UniLaSalle), Côme Arvis, Guillaume Ligner (Ecole Polytechnique)

Two geoscience undergrads and one automotive design student (I know right?) from UniLaSalle hooked up with two data science students from Ecole Polytechnique to interogate the massive GeoRoc database using some clever data analytics tricks and did some novel many-dimensional geochemical classifications.


Team LogFix.

Team LogFix.

Fixing broken well data

LogFix: Guillaume Coffin (Telecom Evolution), Florian Napierala (EISTI), Camille Gimenez (Université Paris-Saclay), Tristan Siméon (Université de Montpellier), Robert Leckenby (Independent)

A truly pristine, calibrated, and corrected petrophysical data is so rare it has a sort of mythical status. Team LogFix used machine learning to identify bad-data zones, repair, QC, and fill-in missing sections. They got an impressive way with the problem, using a dataset from the Athabasca of Canada.


Between the hand-drawn lines

Automagical: Louis Poirier (Independent), Maggie Baber (Independent), Georg Semmler (GiGa infosystems), Björn Wieczoreck (GiGa infosystems), Jonas Kopcsek (GiGa infosystems)

Automagical_Paris_Hackathon.png

You don't need to believe in magic. Team Automagical used machine learning to create 3D geological models from 2D cross-sections sections. They trained a predictive model using a collection of standardized hand-drawn cross-sections from human geoscientists. The model learns how to propagate rocks throughout a 3D scene. Their goal is to be able to generate cross-sections along any direction through the model. The AI learned how to do geologically realistic interpolation on simple structures. What kind of geologic complexity is possible with more input from more cross-sections?


The document on the left contains a log display with a lithology column. It's a 'hit'. The one on the right has no lithlogies and is a 'miss'. 

The document on the left contains a log display with a lithology column. It's a 'hit'. The one on the right has no lithlogies and is a 'miss'.

 

There's rocks in them hills! Hills of paper, that is

Logs on the Rocks: Daniel Stanton (Leeds University), Jack Woolam (Leeds University), Adam Goddard (Leeds University), Henri Blondelle (AgileDD)

If the oil and gas industry is to get more efficient, we better get really good at finding lithology and fluid information in the mountains of paper we've collectively built. Team Logs on the Rocks used CNNs to identify graphical depictions of rock types in a sea of unstructured PDFs and TIFFs. They introduced themselves as a team of non-coders, but these guys were were doing cloud computing on AWS and using NVIDIA's GPUs before the end of the weekend. 


Robot vision for seismic interpretation

It's not our FAULT! Claire Birnie (Leeds University), Carlos Alberto da Costa Filho (Edinburgh University), Matteo Ravasi (Statoil), Filippo Broggini (ETHZ), Gijs Straathof (SGS)

Geologic feature recognition using machine learning. The goal was to assist seismic interpreters in detecting geologic features – faults, folds, traps, etc. – in seismic data . They used Haar cascade classifiers, which are routinely used for identifying faces or kittens or beer bottles in photographs and video streams, specially trained to work on seismic data. They used the awesome OpenCV library to build this technology. At the time of writing, their website appears to be maxed out for the month, so if you're dying to see it, leave them a comment on LinkedIn asking them increase their capacity. And in the meantime, you can check out their project's repo on GitHub.

Kudos for the open source repo, team!


It was thrilling to see such a large range of data and applications. Digital thin-sections, ground water maps, seismic data, well logs, cross-sections, information in unstructured documents, and so on. Thanks to each and every individual that showed up with their expertise and enthusiasm. We're all better off because of it.

A quick reminder that our sponsors are awesome! Please high-five them next time you meet them...

Subsurface Hackathon project round-up, part 1

The dust has settled from the Hackathon in Paris two weeks ago. Been there, done that, came home with the T-shirt.

In the same random order they presented their 4-minute demos to our panel of esteemed judges, I present a (very) abbreviated round-up of what the teams made together over the course of the weekend. With the exception of a few teams who managed to spontaneously nucleate before the hackathon, most of these teams were comprised of people who had never met each other before the event.

Just let that sink in for a second: teams of mostly mutual strangers built 13 legit machine-learning-based geoscience applications in one weekend. 


Log Healer  

Log Healer

 

 

An automated well log management system

Team Un-well Loggers: James Wanstall (Glencore), Niket Doshi (Teradata), Joseph Taylor (Teradata), Duncan Irving (Teradata), Jane McConnell (Teradata).

Tech: Kylo (NiFi, HDFS, Hive, Spark)

If you're working with well logs, and if you've got lots of them, you've almost certainly got gaps or inaccuracies from curve to curve and from well to well. The team's scalable, automated well-log file management system Log Healer computes missing logs and heals broken ones. Amazing.


An early result from Team Janus. The image on the left is ground truth, that on the right is predicted. Many of the features are present. Not bad for v0.1!

An early result from Team Janus. The image on the left is ground truth, that on the right is predicted. Many of the features are present. Not bad for v0.1!

Meaningful cross sections from well logs

Team Janus: Daniel Buse, Johannes Camin, Paul Gabriel, Powei Huang, Fabian Kampe (all from GiGa Infosystems)

The team built an elegant machine learning workflow to attack the very hard problem of creating geologically realistic cross-section from well logs. The validation algorithm compares pixels to score the result. 


Think Section's mindblowing photomicrograph labeling tool can also make novel camouflage patterns.

Think Section's mindblowing photomicrograph labeling tool can also make novel camouflage patterns.

Paint-by-numbers on digital thin sections

Team Think Section: Diego Castaneda (Agile*), Brendon Hall (Enthought), Roeland Nieboer (Fugro), Jan Niederau (RWTH Aachen), Simon Virgo (RWTH Aachen)

Tech: Python (Scikit Learn, Scikit Image, Flask, NumPy, SciPy, Pandas), AWS for hosting app & Jupyter server.

Description: Mineral classification and point-counting on thin sections can be an incredibly tedious and time consuming task. Team Think Section trained a model to segregate, classify, and label mineral grains in 200GB of high-resolution multi-polarization-angle photomicrographs.


Team Classy's super-impressive shot gather seismic event Detection technology. Left: synthetic gather. Middle: predicted labels. Right: truth.

Team Classy's super-impressive shot gather seismic event Detection technology. Left: synthetic gather. Middle: predicted labels. Right: truth.

Event detection on seismic shot gathers

Team Classy: Princy Ikotoko Ndong (EOST), Anna Lim (NTNU), Yuriy Ivanov (NTNU), Song Hou (CGG), Justin Gosses (Valador).

Tech: Python (NumPy, Matplotlib), Jupyter notebooks.

The team created an AI which identifies and labels different events on a shot gather image. It can find direct waves, reflections, multiples or coherent noise. It uses a support vector machine for classification, and is simple and fast. 


model2seismic: An entirely new way to do modeling and inversion. Take note: the neural network that made this image knows no physics.

model2seismic: An entirely new way to do modeling and inversion. Take note: the neural network that made this image knows no physics.

Forward and inverse modeling without the physics

Team GANsters - Lukas Mosser (Imperial), Wouter Kimman (Meridian), Jesper Dramsch (Copenhagen), Alfredo de la Fuente (Wolfram), Steve Purves (Euclidity)

Tech: PyNoddy, homegrown Python ML tools.

The GANsters created a deep-learning image-translation-based seismic inversion and forward modelling system. I urge you to go and look at their project on model2seismic. If it doesn't give you goosebumps, you are geophysically inert.


Team Pick Pick Log

Team Pick Pick Log

Machine learning for for stratigraphic interpretation

Team Pick Pick LOG - Antoine Vanbesien (EOST), Fidèle Degni (Mines St-Étienne), Massinissa Mesbahi (Pau), Natsuki Gunji (Mines St-Étienne), Cédric Menut (EOST).

This team of data science and geoscience undergrads attacked an automated stratigraphic interpretation task. They used supervised learning to determine lithology from well logs in Alberta's Athabasca play, then attempted to teach their AI to pick stratigraphic tops. Impressive!


Pretty amazing, huh? The power of the hackathon to bring a project from barely-even-an-idea to actual-working-code is remarkable! And we're not even halfway through the teams: tomorrow I'll describe the other seven projects. 

Machine learning meets seismic interpretation

Agile has been reverberating inside the machine learning echo chamber this past week at EAGE. The hackathon's theme was machine learning, Monday's workshop was all about machine learning. And Matt was also supposed to be co-chairing the session on Applications of machine learning for seismic interpretation with Victor Aare of Schlumberger, but thanks to a power-cut and subsequent rescheduling, he found himself double-booked so, lucky me, he invited me to sit in his stead. Here are my highlights, from the best seat in the house.

Before I begin, I must mention the ambivalence I feel towards the fact that 5 of the 7 talks featured the open-access F3 dataset. A round of applause is certainly due to dGB Earth Sciences for their long time stewardship of open data. On the other hand, in the sardonic words of my co-chair Victor Aarre, it would have been quite valid if the session was renamed The F3 machine learning session. Is it really the only quality attribute research dataset our industry can muster? Let's do better.

Using seismic texture attributes for salt classification

Ghassan AlRegib ruled the stage throughout the session with not one, not two, but three great talks on behalf of himself and his grad students at Georgia Institute of Technology (rather than being a show of bravado, this was a result of problems with visas). He showed some exciting developments in shallow learning methods for predicting facies in seismic data. In addition to GLCM attributes, he also introduced a couple of new (to me anyway) attributes for salt classification. Namely, textural gradient and a thing he called seismic saliency, a metric modeled after the human visual system describing the 'reaction' between relative objects in a 3D scene. 

Twelve Seismic attributes used for multi-attribute salt-boundary classification. (a) is RMS Amplitude, (B) to (M) are TEXTURAL attributes. See abstract for details. This figure is copyright of Ghassan AlRegib and licensed CC-BY-SA by virtue of being generated from the F3 dataset of dGB and TNO.

Ghassan also won the speakers' lottery, in a way. Due to the previous day's power outage and subsequent reshuffle, the next speaker in the schedule was a no-show. As a result, Ghassan had an extra 20 minutes to answer questions. Now for most speakers that would be a public-speaking nightmare, but Ghassan hosted the onslaught of inquiring minds beautifully. If we hadn't had to move on to the next next talk, I'm sure he could have entertained questions all afternoon. I find it fascinating how unpredictable events like power outages can actually create the conditions for really effective engagement. 

Salt classification without using attributes (using deep learning)

Matt reported on Anders Waldeland's work a year ago, and it was interesting to see how his research has progressed, as he nears the completion of his thesis. 

Anders successfully demonstrated how convolutional neural networks (CNNs) can classify salt bodies in seismic datasets. So, is this a big deal? I think it is. Indeed, Anders's work seems like a breakthough in seismic interpretation, at least of salt bodies. To be clear, I don't think this means that it is time for seismic interpreters to pack up and go home. But maybe we can start looking forward to spending our time doing less tedious things than picking complex salt bodies.  

One slice of a 3d seismic volume with two CLASS LABELS: Salt (red) and Not SALT (GREEN). This is the training data. On the right: Extracted 3D salt body in the same dataset, coloured by elevation. Copyright of A Waldeland, used with permission.

One slice of a 3d seismic volume with two CLASS LABELS: Salt (red) and Not SALT (GREEN). This is the training data. On the right: Extracted 3D salt body in the same dataset, coloured by elevation. Copyright of A Waldeland, used with permission.

He trained a CNN on one manually labeled slice of a 3D cube and used the network to automatically classify the full 3D salt body (on the right in the figure). Conventional algorithms for salt picking, such as that used by AlRegib (see above), typically rely on seismic attributes to define a feature space. This requires professional insight and judgment, and is prone to error and bias. Nicolas Audebert mentioned the same shortcoming in his talk in the workshop Matt wrote about last week. In contrast, the CNN algorithm works directly on the seismic data, learning the most discriminative filters on its own, no attributes needed

Intuition training

Machine learning isn't just useful for computing in the inverse direction such as with inversion, seismic interpretation, and so on. Johannes Amtmann showed us how machine learning can be useful for ranking the performance of different clustering methods using forward models. It was exciting to see: we need to get back into the habit of forward modeling, each and every one of us. Interpreters build synthetics to hone their seismic intuition. It's time to get insanely good at building forward models for machines, to help them hone theirs. 

There were so many fascinating problems being worked on in this session. It was one of the best half-day sessions of technical content I've ever witnessed at a subsurface conference. Thanks and well done to everyone who presented.


Machine learning and analytics in geoscience

We're at EAGE in Paris. I'm sitting in a corner of the exhibition because the power is out in the main hall, so all the talks for the afternoon have been postponed. The poor EAGE team must be beside themselves, I feel for them. (Note to future event organizers: white boards!)

Yesterday Diego, Evan, and I — along with lots of hackathon participants — were at the Data Science for Geosciences workshop, an all-day machine learning fest. The session was chaired by Cyril Agut (Total), Marianne Cuif-Sjostrand (Total), Florence Delprat-Jannaud (IFPEN), and Noalwenn Dubos-Sallée (IFPEN), and they had assembled a good programme, with quite a bit of variety.

Michel Lutz, Group Data Officer at Total, and adjunct at École des Mines de Saint-Étienne, gave a talk entitled, Data science & application to geosciences: an introduction. It was high-level but thoughtful, and such glimpses into large companies are always interesting. The company seems to have a mature data science strategy, and a well-developed technology stack. Henri Blondelle (AgileDD) asked about open data at the end, and Michel somewhat sidestepped on specifics, but at least conceded that the company could do more in open source code, if not data.

Infrastructure, big data, and IoT

Next we heard a set of talks about the infrastructure aspect of big (really big) data.

Alan Smith of Luchelan told the group about some negative experiences with Hadoop and seismic data (though it didn't seem to me that his problems were insoluble since I know of several projects that use it), and the realization that sometimes you just need fast infrastructure and custom software.

Hadi Jamali-Rad of Shell followed with an IoT story from the field. He had deployed a large number of wireless seismic sensors around a village in Holland, then tested various aspects of the communication system to answer questions like, what's the packet loss rate when you collect data from the nodes? What about from a balloon stationed over the site?

Duncan Irving of Teradata asked, Why aren't we [in geoscience] doing live analytics on 100PB of live data like eBay? His hypothesis is that IT organizations in oil and gas failed to keep up with key developments in data analytics, so now there's a crisis of sorts and we need to change how we handle our processes and culture around big data. 

Machine learning

We shifted gears a bit after lunch. I started with a characteristically meta talk about how I think our community can help ensure that our research and practice in this domain leads to good places as soon as possible. I'll record it and post it soon.

Nicolas Audebert of ONERA/IRISA presented a nice application of a 3D convolutional neural network (CNN) to the segmentation and classification of hyperspectral aerial photography. His images have between about 100 and 400 channels, and he finds that CNNs reduce error rates by up to about 50% (compared to an SVM) on noisy or complex images. 

Henri Blondelle of Agile Data Decisions talked about his experience of the CDA's unstructured data challenge of 2016. About 80% of the dataset is unstructured (e.g. folders of PDFs and TIFFs), and Henri's vision is to transform 80% of that into structured data, using tools like AgileDD's IQC to do OCR and heuristic labeling. 

Irina Emelyanova of CSIRO provided another case study: unsupervised e-facies prediction using various types of clustering, from K-means to some interesting variants of self-organizing maps. It was refreshing to see someone revealing a lot of the details of their implementation.

Jan Limbeck, a research scientist at Shell wrapped up the session with an overview of Shell's activities around big data and machine learning, as they prepare for exabytes. He mentioned the Mauricio Araya-Polo et al. paper on deep learning in seismic shot gathers in the special March issue of The Leading Edge — clearly it's easiest to talk about things they've already published. He also listed a lot of Shell's machine learning projects (frac optimization, knowledge graphs, reservoir simulation, etc), but there's no way to know what state they are in or what their chances of success are. 

As well as all the 9 talks, there were 13 posters, about a third of which were on infrastructure stuff, with the rest providing more case studies. Unfortunately, I didn't get the chance to look at them in any detail, but I appreciated the organizers making time for discussion around the posters. If they'd also allowed more physical space for the discussion it could have been awesome.

Analytics!

After hearing about Mentimeter from Chris Jackson I took the opportunity to try it out on the audience. Here are the results, I think they are fairly self-explanatory... 

I also threw in the mindmap I drew at the end as a sort of summary. The vertical axis represents something like'abstraction' or 'time' (in a workflow sense) and I think each layer depends somewhat on those beneath it. It probably makes sense to no-one but me.

Breakout!

It seems clear that 2017 is the breakout year for machine learning in petroleum geoscience, and in petroleum in general. If your company or institution has not yet gone beyond "watching" or "thinking about" data science and machine learning, then it is falling behind by a little more every day, and it has been for at least a year. Now's the time to choose if you want to be part of what happens next, or a victim of it.

Le grand hack!

It happened! The Subsurface Hackathon drew to a magnificent close on Sunday, in an intoxicating cloud of code, creativity, coffee, and collaboration. It will take some beating.

Nine months in gestation, the hackathon was on a scale we have not attempted before. Total E&P joined us as co-organizers and made this new reach possible. They also let us use their amazing Booster — a sort of intrapreneurship centre — which was perfect for the event. Their team (thanks especially to Marine and Caroline!) did an amazing job of hosting, as well as providing several professionals from their subsurface software (thanks Jonathan and Yannick!) and data science teams (thanks Victor and David!). Arnaud Rodde and Frédéric Broust, who had to do some organization hacking of their own to make something as weird as a hackathon happen, should be proud of their teams.

Instead of trying to describe the indescribable, here are some photos:

BY THE NUMBERS

16 hours of code
13 teams
62 hackers
44 students
4 robots
568 croissants
0 lost-time incidents

I won't say much about the projects for now. The diversity was high — there were projects in thin section photography, 3D geological modeling, document processing, well log prediction, seismic modeling and inversion, and fault detection. All of the projects included some kind of machine learning, and again there was diversity there, including several deep learning applications. Neural networks are back!

Feel the buzz!

If you are curious, Gram and I recorded a quick podcast and interviewed a few of the teams:

It's going to take a few days to decompress and come down from the high. In a couple of weeks I'll tell you more about the projects themselves, and we'll edit the photos and post the best ones to Flickr (and in the meantime there are a few more pics there already). 

Thank you to the sponsors!

Last thing: we couldn't have done any of this without the support of Dell EMC. David Holmes has been a rock for the hackathon project over the last couple of years, and we appreciate his love of community and code! Thank you too to Duncan and Jane at Teradata, Francois at NVIDIA, Peter and Jon at Amazon AWS, and Gram at Sandstone for all your support. Dear reader: please support these organizations!