Looking forward to EAGE

Evan, Diego and I are flying to Paris today for the EAGE Conference and Exhibition. It's exciting. We're excited. 

But the excitement starts before the conference. The Subsurface Hackathon is this weekend!

My diary

Even the hackathon excitement starts before the weekend, because tomorrow, Friday, we're running the hacker's bootcamp — a sort of short course appetizer for the hackathon. We have about 25 geoscientists coming to the Booster TOTAL (an event space at TOTAL's La Défense offices) to get some hands-on practice with Python and the latest in machine learning tools. It's especially exciting because we'll also have engineers from NVIDIA on hand to help with the coaching. The idea is to help people hit the ground running when the hackathon starts on Saturday.

After that, on Saturday and Sunday,  it's the hackathon itself. We have no fewer than 60 geoscientists and engineers registered for this breakout event. They're coming to the Booster to work on a wide array of machine learning ideas for the subsurface. It's going to be epic. You can read all about what happens next week, I promise. 

Then on Monday it's the Data Science for Geoscience workshop, at which I'm giving a keynote. Since I'm far from possessing expertise, I'm using it as a chance to get people jazzed about helping make the coming AI revolution in geoscience a positive experience. I'm really looking forward to it.

The conference itself starts on Tuesday. In the afternoon I'm co-chairing a session on machine learning (have you spotted the theme yet?) in seismic interpretation, along with Victor Aare of Schlumberger. It will be awesome to see what kind of progress our community is making in this field — it's fun to imagine what seismic interpretation might be like in a few years. There are so many fascinating problems to work on! Here are the talks in that session:

On Wednesday we'll be taking in some more talks and posters, then in the afternoon I'm reprising my keynote talk at IFPEN, a subsurface research institute in the Bois de Boulogne. I've never been there before, although I have met a few IFP scientists before. I'm looking forward to it very much. 

It all ends for us on Thursday. Evan and Diego fly home and I'm off to Cambridge (the old one in the fens, not the one in Massachusetts) for a few days with family (and bookshops). Until then, expect much blogging!


Going to EAGE?

If you're reading this and would like to meet up with us at Agile or some of the Software Underground crowd — the friendliest bunch of coding geoscientists you could hope for — let's plan to meet at the end of the workshop, at the workshop location. Look for the Software Underground shirts.

GeoConvention highlights

We were in Calgary last week at the Canada GeoConvention 2017. The quality of the talks seemed more variable than usual but, as usual, there were some gems in there too. Here are our highlights from the technical talks...

Filling in gaps

Mauricio Sacchi (University of Alberta) outlined a new reconstruction method for vector field data. In other words, filling in gaps in multi-compononent seismic records. I've got a soft spot for Mauricio's relaxed speaking style and the simplicity with which he presents linear algebra, but there are two other reasons that make this talk worthy of a shout out:

  1. He didn't just show equations in his talk, he used pseudocode to show the algorithm.
  2. He linked to his lab's seismic processing toolkit, SeismicJulia, on GitHub.

I am sure he'd be the first to admit that it is early days for for this library and it is very much under construction. But what isn't? All the more reason to showcase it openly. We all need a lot more of that.

Update on 2017-06-7 13:45 by Evan Bianco: Mauricio, has posted the slides from his talk

Learning about errors

Anton Birukov (University of Calgary & graduate intern at Nexen) gave a great talk in the induced seismicity session. It was a lovely mashing-together of three of our favourite topics: seismology, machine-learning, and uncertainty. Anton is researching how to improve microseismic and earthquake event detection by framing it as a machine-learning classification problem. He's using Monte Carlo methods to compute myriad synthetic seismic events by making small velocity variations, and then using those synthetic events to teach a model how to be more accurate about locating earthquakes.

Figure 2 from Anton Biryukov's abstract. An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event or…

Figure 2 from Anton Biryukov's abstract. An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event origin. From Biryukov (2017). Event origin depth uncertainty - estimation and mitigation using waveform similarity. Canada GeoConvention, May 2017.

The bright lights of geothermal energy
Matt Hall

Two interesting sessions clashed on Wednesday afternoon. I started off in the Value of Geophysics panel discussion, but left after James Lamb's report from the mysterious Chief Geophysicists' Forum. I had long wondered what went on in that secretive organization; it turns out they mostly worry about how to make important people like your CEO think geophysics is awesome. But the large room was a little dark, and — in keeping with the conference in general — so was the mood.

Feeling a little down, I went along to the Diversification of the Energy Industry session instead. The contrast was abrupt and profound. The bright room was totally packed with a conspicuously young audience numbering well over 100. The mood was hopeful, exuberant even. People were laughing, but not wistfully or ironically. I think I saw a rainbow over the stage.

If you missed this uplifting session but are interested in contributing to Canada's geothermal energy scene, which will certainly need geoscientists and reservoir engineers if it's going to get anywhere, there are plenty of ways to find out more or get involved. Start at cangea.ca and follow your nose.

We'll be writing more about the geothermal scene — and some of the other themes in this post — so stay tuned. 


DID YOU KNOW?

You can get regular updates right to your email, just drop your address in the box:

The fine print: No spam, we promise! We never share email addresses with 3rd parties. Unsubscribe any time with the link in the emails. The service is provided by MailChimp in accordance with Canada's anti-spam regulations.

Hard things that look easy

After working on a few data science (aka data analytics aka machine learning) problems with geoscientific data, I think we've figured out the 10-step workflow. I'm happy to share it with you now:

  1. Look at all these cool problems, machine learning can solve all of these! I just need to figure out which model to use, parameterize it, and IT'S GONNA BE AWESOME, WE'LL BE RICH. Let's just have a quick look at the data...
  2. Oh, there's no data.
  3. Three months later: we have data! Oh, the data's a bit messy.
  4. Six months later: wow, cleaning the data is gross and/or impossible. I hate my life.
  5. Finally, nice clean data. Now, which model do I choose? How do I set parameters? At least you expected these problems. These are well-known problems.
  6. Wait, maybe there are physical laws governing this natural system... oh well, the model will learn them.
  7. Hmm, the results are so-so. I guess it's harder to make predictions than I thought it would be.
  8. Six months later: OK, this sort of works. And people think it sounds cool. They just need a quick explanation.
  9. No-one understands what I've done.
  10. Where is everybody?

I'm being facetious of course, but only a bit. Modeling natural systems is really hard. Much harder for the earth than for, say, the human body, which is extremely well-known and readily available for inspection. Even the weather is comparitively easy.

Coupled with the extreme difficulty of the problem, we have a challenging data environment. Proprietary, heterogeneous, poor quality, lost, non-digital... There are lots of ways the data goblins can poop on the playground of machine learning.

If the machine learning lark is so hard, why not just leave it to non-artificial intelligence — humans. We already learned how to interpret data, right? We know the model takes years to train. Of course, but I don't accept that we couldn't use some of the features of intelligently applied big data analytics: objectivity, transparency, repeatability (by me), reproducibility (by others), massive scale, high speed... maybe even error tolerance and improved decisions, but those seem far off right now.

I also believe that AI models, like any software, can encode the wisdom of professionals — before they retire. This seems urgent, as the long-touted Great Crew Change is finally underway.

What will we work on?

There are lots of fascinating and tractable problems for machine learning to attack in geoscience — I hope many of them get attacked at the hackathon in June — and the next 2 to 3 years are going to be very exciting. There will be the usual marketing melée to wade through, but it's up to the community of scientists and data analysts to push their way through that with real results based on open data and, ideally with open code.

To be sure, this is happening already — we've had over 25 entrants publishing their solutions to the SEG machine learning contest already, and there will be more like this. It's the only way to building transparent problem-solving systems that we can all participate in and, ultimately, trust.

What machine learning problems are most pressing in geoscience?
I'm collecting ideas for projects to tackle in the hackathon. Please visit this Tricider question and contribute your comments, opinions, or ideas of your own. Help the community work on the problems you care about.

x lines of Python: machine learning

You might have noticed that our web address has changed to agilescientific.com, reflecting our continuing journey as a company. Links and emails to agilegeoscience.com will redirect for the foreseeable future, but if you have bookmarks or other links, you might want to change them. If you find anything that's broken, we'd love it if you could let us know.


Artificial intelligence in 10 lines of Python? Is this really the world we live in? Yes. Yes it is.

After reminding you about the SEG machine learning contest just before Christmas, I thought I could show you how you train a model in a supervised learning problem, then use it to make predictions on unseen data. So we'll just break a simple contest entry down into ten easy steps (note that you could do this on anything, doesn't have to be this problem). 

A machine learning primer

Before we start, let's review quickly what a machine learning problem looks like, and introduct a bit of jargon. To begin, we have a dataset (e.g. the 'Old' well in the diagram below). This consists of records, called instances. In this problem, each instance is a depth location. Each instance is a feature vector: a row vector comprising attributes or features, which in our case are wireline log values for GR, ILD, and so on. Each feature vector is a row in a matrix we conventionally call \(X\). Associated with each instance is some target label — the thing we want to predict — which is a continuous quantity in a regression problem, discrete in a classification problem. The vector of labels is usually called \(y\). In the problem below, the labels are integers representing 9 different facies.

You can read much more about the dataset I'm using in Brendon Hall's tutorial (The Leading Edge, October 2016).

The ten steps to glory

Well, maybe not glory, but something. A prediction of facies at two wells, based on measurements made at 10 other wells. You can follow along in the notebook, but all the highlights are included here. We start by loading the data into a 'dataframe', which you can think of like a spreadsheet:

Now we specify the features we want to use, and make the matrix \(X\) and label vector \(y\):

  features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']
  X = df[features].values
  y = df.Facies.values

Since this dataset is all we have, we'd like to set aside some data to test our model on. The library we're using, scikit-learn, has functions to do this sort of thing; by default, it'll split \(X\) and \(y\) into train and test datasets, with 25% of the data going into the test part:

  X_train, X_test, y_train, y_test = train_test_split(X, y)

Now we're ready to choose a model, instantiate it (with some parameters if we want), and train the model (i.e. 'fit' the data). I am calling the trained model augur, because I like that word.

  from sklearn.ensemble import ExtraTreesClassifier
  model = ExtraTreesClassifier()
  augur = model.fit(X_train, y_train)

Now we're ready to take the part of the dataset we reserved for validation, X_test, and predict its labels. Then we can compare those with the known labels, y_test, to see how well we did:

  y_pred = augur.predict(X_test)

We can get a quick idea of the quality of prediction with sklearn.metrics.accuracy_score(y_test, y_pred), but it's more interesting to look at the classification report, which shows us the precision and recall for each class, along with their harmonic mean, the F1 score:

  from sklearn.metrics import classification_report
  print(classification_report(y_test, y_pred))
classification_report.png

Each row is a facies (facies 1, facies 2, etc.). The support is the number of instances representing that label. The key number here is 0.63 — we can regard this as an expression of the accuracy of our prediction. If that sounds low to you, I encourage you to enter the machine learning contest! If it sounds high, that's because it is — it's much too high. In fact, the instances of our dataset are not independent: they are spatially correlated (in depth). It would be smarter not to remove some random samples for validation, but to reserve entire wells. After all, this is how we typically collect subsurface data: one well at a time.

But now we're getting into the weeds of data science. I'll let you venture in there on your own...

SEG machine learning contest: there's still time

Have you been looking for an excuse to find out what machine learning is all about? Or maybe learn a bit of Python programming language? If so, you need to check out Brendon Hall's tutorial in the October issue of The Leading Edge. Entitled, "Facies classification using machine learning", it's a walk-through of a basic statistical learning workflow, applied to a small dataset from the Hugoton gas field in Kansas, USA.

But it was also the launch of a strictly fun contest to see who can get the best prediction from the available data. The rules are spelled out in ther contest's README, but in a nutshell, you can use any reproducible workflow you like in Python, R, Julia or Lua, and you must disclose the complete workflow. The idea is that contestants can learn from each other.

Left: crossplots and histograms of wireline log data, coloured by facies — the idea is to highlight possible data issues, such as highly correlated features. Right: true facies (left) and predicted facies (right) in a validation plot. See the rest of the paper for details.

What's it all about?

The task at hand is to predict sedimentological facies from well logs. Such log-derived facies are sometimes called e-facies. This is a familiar task to many development geoscientists, and there are many, many ways to go about it. In the article, Brendon trains a support vector machine to discriminate between facies. It does a fair job, but the accuracy of the result is less than 50%. The challenge of the contest is to do better.

Indeed, people have already done better; here are the current standings:

Team F1 Algorithm Language Solution
1 gccrowther 0.580 Random forest Python Notebook
2 LA_Team 0.568 DNN Python Notebook
3 gganssle 0.561 DNN Lua Notebook
4 MandMs 0.552 SVM Python Notebook
5 thanish 0.551 Random forest R Notebook
6 geoLEARN 0.530 Random forest Python Notebook
7 CannedGeo 0.512 SVM Python Notebook
8 BrendonHall 0.412 SVM Python Initial score in article

As you can see, DNNs (deep neural networks) are, in keeping with the amazing recent advances in the problem-solving capability of this technology, doing very well on this task. Of the 'shallow' methods, random forests are quite prominent, and indeed are a great first-stop for classification problems as they tend to do quite well with little tuning.

How do I enter?

There is still over 6 weeks to enter: you have until 31 January. There is a little overhead — you need to learn a bit about git and GitHub, there's some programming, and of course machine learning is a massive field to get up to speed on — but don't be discouraged. The very first entry was from Bryan Page, a self-described non-programmer who dusted off some basic skills to improve on Brendon's notebook. But you can run the notebook right here in mybinder.org (if it's up today — it's been a bit flaky lately) and a play around with a few parameters yourself.

The contest aspect is definitely low-key. There's no money on the line — just a goody bag of fun prizes and a shedload of kudos that will surely get the winners into some awesome geophysics parties. My hope is that it will encourage you (yes, you) to have fun playing with data and code, trying to do that magical thing: predict geology from geophysical data.


Reference

Hall, B (2016). Facies classification using machine learning. The Leading Edge 35 (10), 906–909. doi: 10.1190/tle35100906.1. (This paper is open access: you don't have to be an SEG member to read it.)

Le meilleur hackathon du monde

hackathon_2017_calendar.png

Hackathons are short bursts of creative energy, making things that may or may not turn out to be useful. In general, people work in small teams on new projects with no prior planning. The goal is to find a great idea, then manifest that idea as something that (barely) works, but might not do very much, then show it to other people.

Hackathons are intellectually and professionally invigorating. In my opinion, there's no better team-building, networking, or learning event.

The next event will be 10 & 11 June 2017, right before the EAGE Conference & Exhibition in Paris. I hope you can come.

The theme for this event will be machine learning. We had the same theme in New Orleans in 2015, but suffered a bit from a lack of data. This time we will have a collection of open datasets for participants to build off, and we'll prime hackers with a data-and-skills bootcamp on Friday 9 June. We did this once before in Calgary – it was a lot of fun. 

Can you help?

It's my goal to get 52 participants to this edition of the event. But I'll need your help to get there. Please share this post with any friends or colleagues you think might be up for a weekend of messing about with geoscience data and ideas. 

Other than participants, the other thing we always need is sponsors. So far we have three organizations sponsoring the event — Dell EMC is stepping up once again, thanks to the unstoppable David Holmes and his team. And we welcome Sandstone — thank you to Graham Ganssle, my Undersampled Radio co-host, who I did not coerce in any way.

sponsors_so_far.png

If your organization might be awesome enough to help make amazing things happen in our community, I'd love to hear from you. There's info for sponsors here.

If you're still unsure what a hackathon is, or what's so great about them, check out my November article in the Recorder (Hall 2015, CSEG Recorder, vol 40, no 9).

Tune in to Undersampled Radio

Back in the summer I mentioned Undersampled Radio, the world's newest podcast about geoscience. Well, geoscience and computers. OK, machine learning and geoscience. And conferences.

We're now 25 shows in, having started with Episode 0 on 28 January. The show is hosted by Graham 'Gram' Ganssle, a consulting and research geophysicist based in New Orleans, and me. Appropriately enough, I met Gram at the machine-learning-themed hackathon we did at SEG in 2015. He was also a big help with the local knowledge.

I broadcast from one of the phone rooms at The HUB South Shore. Gram has the luxury of a substantial book-lined office, which I imagine has ample views of paddle-steamers lolling on the Mississippi (but I actually have no idea where it is). 

To get an idea of what we chat about, check out the guests on some recent episodes:

Better than cable

The podcast is really more than just a podcast, it's really a live TV show, broadcasting on YouTube Live. You can catch the action while it's happening on the Undersampled Radio channel. However, it's not easy to catch live because the episodes are not that predictable — they are announced about 24 hours in advance on the Software Underground Slack group (you are in there, right?). We should try to put them out on the @undrsmpldrdio Twitter feed too... 

So, go ahead and watch the very latest episode, recorded last Thursday. We spoke to Tim Hopper, a data scientist in Raleigh, NC, who works at Distil Networks, a cybersecurity firm. It turns out that using machine learning to filter web traffic has some features in common with computational geophysics...

You can subscribe to the show in iTunes or Google Play, or anywhere else good podcasts are served. Grab the RSS Feed from the UndersampledRad.io website.

Of course, we take guest requests. Who would you like to hear us talk to? 

Seismic inception

A month ago, some engineers at Google blogged about how they had turned a deep learning network in on itself and produced some fascinating and/or disturbing images:

One of the images produced by the team at Google. Click to see a larger version. Read more. CC-BY.

The basic recipe, which Google later open sourced, involves training a deep learning network (basically a multi-layer neural network) on some labeled images, animals maybe, then searching for matching patterns in a target image, like these clouds. If it finds something, it emphasizes it — given the data, it tries to construct an animal. Then do it again.

Or, here's how a Google programmer puts it (one of my favourite sentences ever)...

Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer. 

That's all! Anyway, the point is that you get utter weirdness:

OK, cool... what happens if you feed it seismic?

That was my first thought, I'm sure it was yours too. The second thing I thought, and the third, and the fourth, was: wow, this software is hard to compile. I spent an unreasonable amount of time getting caffe, the Berkeley Vision & Learning Centre's deep learning software, working. But on Friday I cracked it, so today I got to satisfy my curiosity.

The short answer is: reptiles. These weirdos were 8 levels down, which takes about 20 minutes to reach on my iMac.

Seismic data from the Virtual Seismic Atlas, courtesy of Fugro. 

THE DEEPDREAM TREATMENT. Mostly reptiles.

Er, right... what's the point in all this?

That's a good question. It's just a bit of fun really. But it makes you wonder:

  • What if we train the network on seismic facies? I think this could be very interesting.
  • Better yet, what if we train it on geology? Probably spurious: seismic is not geology.
  • Does this mean learning networks are just dumb machines, or can they see more than us? Tough one — human vision is highly fallible. There are endless illusions to prove this. But computers only do what we tell them, at least for now. I think if we're careful what we ask for, we can use these highly non-linear data-crunching algorithms for good.
  • Are we out of a job? Definitely not. How do you think machines will know what to learn? The challenge here is to make this work, and then figure out how it can help change, or at least accelerate, our understanding of the subsurface.

This deep learning stuff — of which the University of Toronto was a major pioneer during its emergence in about 2010 — is part of the machine learning revolution that you are, like it or not, experiencing. It will take time, and it will make awful mistakes, but the indications are that machine learning will eat every analytical method for breakfast. Customer behaviour prediction, computer vision, natural language processing, all this stuff is reeling from the relatively sudden and widespread availability of inexpensive computer intelligence. 

So what are we going to do with that?

           Okay, one more. from Paige Bailey's Twitter feed.

           Okay, one more. from Paige Bailey's Twitter feed.