October 31, 2017

Abstract horror

October 31, 2017/ Matt Hall

This isn't really a horror story, more of a Grimm fairy tale. Still, I thought it worthy of a Hallowe'eny title.

I've been reviewing abstracts for the 2018 AAPG annual convention. It's fun, because you get to read about new research months ahead of the rest of the world. But it's also not fun because... well, most abstracts aren't that great. I have no idea what proportion of abstracts the conference accepts, but I hope it's not too far above about 50%. (There was some speculation at SEG that there are so many talks now — 18 parallel sessions! — because giving a talk is the only way for many people to get permission to travel to it. I hope this isn't true.)

Some of the abstracts were great; at least 1 in 4 was better than 'good'. So what's wrong with the others? Here are the three main issues I saw:

Lots of abstracts were uninteresting.
Even more of them were vague.
Almost all of them were about unreproducible research.

Let's look at each of these in turn and ask what we can do about it.

Uninteresting

Let's face it, not all research is interesting research. That's OK — it might still be useful or otherwise important. I think you can still write an interesting abstract about it. Here are some tips:

Don't be vague! Details are interesting. See the next section.
Break things up a bit. Use at least 2 paragraphs, maybe 3 or 4. Maybe a list or two.
Use natural, everyday language. Try reading your abstract aloud.
In the first sentence, tell me why I should come to your talk or visit your poster.

Vague

I scribbled 'Vague' on nearly every abstract. In almost every case, either the method or the results, and usually both, were described in woolly language. For example (this is not a direct quote, but paraphrased):

Machine learning was used to predict the reservoir quality in most of the wells in the area, using millions of training examples and getting good results. The inputs were wireline log data from nearby wells.

This is useless information — which algorithm? How did you optimize it? How much training data did you have, and how many data instances did you validate against? How many features did you use? What kind of validation did you do, and what scores did you achieve? Which competing methods did you compare with? Use numbers, be specific:

We used a 9-dimensional support vector machine, implemented in scikit-learn, to model the permeability. With over 3 million training examples from logs in 150 nearby wells in the training set, and 1 million in cross-validation, we achieved an F1 score of 0.75 or more in 18 of the 20 wells.

A roughly 50% increase in the number of words, but an ∞% increase in the information content.

Unreproducible

Maybe I'm being unfair on this one, because I can't really tell if something is going to be reproducible or not from an abstract... or can I?

I'd venture to say that, if the formations are called A, B, C, and D, and the wells are called 1, 2, 3, and 4, then I'm pretty sure I'm not going to find out much about your research. (I had a long debate with someone in Houston recently about whether this sort of thing even qualifies as science.)

So what can you do to make a more useful abstract?

Name your methods and algorithms. Where did they come from? Which other work did you build on?
Name the dataset and tell me where it came from. Don't obfuscate the details — they're what make you interesting! Share as much of the data as you can.
Name the software you're using. If it's open source, it's the least you can do. If it's not open source, it's not reproducible, but I'd still like to know how you're doing what you do.

I realize not everyone is in a position to do 100% reproducible research, but you can aim for something over 50%. If your work really is top secret (<50% reproducible), then you might think twice about sharing your work at conferences, since no-one can really learn anything from you. Ask yourself if your paper is really just an advertisement.

So what does a good abstract look like?

Well, I do like this one-word abstract from Gardner & Knopoff (1974), from the Bulletin of the Seismological Society of America:

Is the sequence of earthquakes in Southern California, with aftershocks removed, Poissonian?

Yes.

A classic, but I'm not sure it would get your paper accepted at a conference. I don't collect awesome abstracts — maybe I should — but here are some papers with great abstracts that caught my interest recently:

Dean, T (2017). The seismic signature of rain. Geophysics 82 (5). The title is great too; what curious person could resist this paper?
Durkin, P et al. (2017) on their beautiful McMurry Fm interpretation in JSR 27 (10). It could arguably be improved by a snappier first sentence that gives punchline of the paper.
Doughty-Jones, G, et al (2017) in AAPG Bulletin 101 (11). There's maybe a bit of an assumption that the reader cares about intraslope minibasins, but the abstract has meat.

Becoming a better abstracter

The number one thing to improve as a writer is probably asking other people — friendly but critical ones — for honest feedback. So start there.

As I mentioned in my post More on brevity way back in March 2011, you should probably read Landes (1966) once every couple of years:

Landes, K (1966). A scrutiny of the abstract II. AAPG Bulletin 50 (9). Available online. (An update to his original 1951 piece, A scrutiny of the abstract, AAPG Bulletin 35, no 7.)

There's also this plea from geophysicist Paul Lowman, to stop turning abstracts into introductions:

Lowman, Paul (1988). The abstract rescrutinized. Geology 16 (12). Available online.

Give those a read — they are very short — and maybe pay extra attention to the next dozen or so abstracts you read. Do they tell you what you need to know? Are they either useful or interesting? Do they paint a vivid picture? Or are they too... abstract?

October 25, 2017

EarthArXiv wants your preprints

October 25, 2017/ Matt Hall

If you're into science, and especially physics, you've heard of arXiv, which has revolutionized how research in physics is shared. BioarXiv, SocArXiv and PaleorXiv followed, among others*.

Well get excited, because today, at last, there is an open preprint server especially for earth science — EarthArXiv has landed!

I could write a long essay about how great this news is, but the best way to get the full story is to listen to two of the founders — Chris Jackson (Imperial College London and fellow University of Manchester alum) and Tom Narock (University of Maryland, Baltimore) — on Undersampled Radio this morning:

Congratulations to Chris and Tom, and everyone involved in EarthArXiv!

Friedrich Hawemann, ETH Zurich, Switzerland
Daniel Ibarra, Earth System Science, Standford University, USA
Sabine Lengger, University of Plymouth, UK
Andelo Pio Rossi, Jacobs University Bremen, Germany
Divyesh Varade, Indian Institute of Technology Kanpur, India
Chris Waigl, University of Alaska Fairbanks, USA

Sara Bosshart, International Water Association, UK
Alodie Bubeck, University of Leicester, UK
Allison Enright, Rutgers - Newark, USA
Jamie Farquharson, Université de Strasbourg, France
Alfonso Fernandez, Universidad de Concepcion, Chile
Stéphane Girardclos, University of Geneva, Switzerland
Surabhi Gupta, UGC, India

Don't underestimate how important this is for earth science. Indeed, there's another new preprint server coming to the earth sciences in 2018, as the AGU — with Wiley! — prepare to launch ESSOAr. Not as a competitor for EarthArXiv (I hope), but as another piece in the rich open-access ecosystem of reproducible geoscience that's developing. (By the way, AAPG, SEG, SPE: you need to support these initiatives. They want to make your content more relevant and accessible!)

It's very, very exciting to see this new piece of infrastructure for open access publishing. I urge you to join in! You can submit all your published work to EarthArXiv — as long as the journal's policy allows it — so you should make sure your research gets into the hands of the people who need it.

I hope every conference from now on has an EarthArXiv Your Papers party.

* Including snarXiv, don't miss that one!

October 23, 2017

x lines of Python: load curves from LAS

October 23, 2017/ Matt Hall

Welcome to the latest x lines of Python post, in which we have a crack at some fundamental subsurface workflows... in as few lines of code as possible. Ideally, x < 10.

We've met curves once before in the series — in the machine learning edition, in which we cheated by loading the data from a CSV file. Today, we're going to get it from an LAS file — the popular standard for wireline log data.

Just as we previously used the pandas library to load CSVs, we're going to save ourselves a lot of bother by using an existing library — lasio by Kent Inverarity. Indeed, we'll go even further by also using Agile's library welly, which uses lasio behind the scenes.

The actual data loading is only 1 line of Python, so we have plenty of extra lines to try something more ambitious. Here's what I go over in the Jupyter notebook that goes with this post:

Load an LAS file with lasio.
Look at its header.
Look at its curve data.
Inspect the curves as a pandas DataFrame.
Load the LAS file with welly.
Look at welly's Curve objects.
Plot part of a curve.
Smooth a curve.
Export a set of curves as a matrix.
BONUS: fix some broken things in the file header.

Each one of those steps is a single line of Python. Together, I think they cover many of the things we'd like to do with well data once we get our hands on it. Have a play with the notebook and explore what you can do.

Next time we'll take things a step further and dive into some seismic petrophysics.

October 19, 2017

The norm and simple solutions

October 19, 2017/ Matt Hall

Last time I wrote about different ways of calculating distance in a vector space — say, a two-dimensional Euclidean plane like the streets of Portland, Oregon. I showed three ways to reckon the distance, or norm, between two points (i.e. vectors). As a reminder, using the distance between points u and v on the map below this time:

$$ \|\mathbf{u} - \mathbf{v}\|_1 = |u_x - v_x| + |u_y - v_y| $$

$$ \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{(u_x - v_x)^2 + (u_y - v_y)^2} $$

$$ \|\mathbf{u} - \mathbf{v}\|_\infty = \mathrm{max}(|u_x - v_x|, |u_y - v_y|) $$

Let's think about all the other points on Portland's streets that are the same distance away from u as v is. Again, we have to think about what we mean by distance. If we're walking, or taking a cab, we'll need to think about $\ell_1$ — the sum of the distances in x and y. This is shown on the left-most map, below.

For simplicity, imagine u is the origin, or (0, 0) in Cartesian coordinates. Then v is (0, 4). The sum of the distances is 4. Looking for points with the same sum, we find the pink points on the map.

If we're thinking about how the crow flies, or $\ell_2$ norm, then the middle map sums up the situation: the pink points are all equidistant from u. All good: this is what we usually think of as 'distance'.

The $\ell_\infty$ norm, on the other hand, only cares about the maximum distance in any direction, or the maximum element in the vector. So all points whose maximum coordinate is 4 meet the criterion: (1, 4), (2, 4), (4, 3) and (4, 0) all work.

You might remember there was also a weird definition for the $\ell_0$ norm, which basically just counts the non-zero elements of the vector. So, again treating u as the origin for simplicity, we're looking for all the points that, like v, have only one non-zero Cartesian coordinate. These points form an upright cross, like a + sign (right).

So there you have it: four ways to draw a circle.

Wait, what?

A circle is just a set of points that are equidistant from the centre. So, depending on how you define distance, the shapes above are all 'circles'. In particular, if we normalize the (u, v) distance as 1, we have the following unit circles:

It turns out we can define any number of norms (if you like the sound of $\ell_{2.4}$ or $\ell_{240}$ or $\ell_{0.024}$...) but most of the time, these will suffice. You can probably imagine the shapes of the unit circles defined by these other norms.

What can we do with this stuff?

Let's think about solving equations. Think about solving this:

$$ x + 2y = 8 $$

I'm sure you can come up with a soluiton in your head, x = 6 and y = 1 maybe. But one equation and two unknowns means that this problem is underdetermined, and consequently has an infinite number of solutions. The solutions can be visualized geometrically as a line in the Euclidean plane (right).

But let's say I don't want solutions like (3.141590, 2.429205) or (2742, –1367). Let's say I want the simplest solution. What's the simplest solution?

This is a reasonable question, but how we answer it depends how we define 'simple'. One way is to ask for the nearest solution to the origin. Also reasonable... but remember that we have a few different ways to define 'nearest'. Let's start with the everyday definition: the shortest crow-flies distance from the origin. The crow-flies, $\ell_2$ distances all lie on a circle, so you can imagine starting with a tiny circle at the origin, and 'inflating' it until it touches the line $x + 2y - 8 = 0$. This is usually called the minimum norm solution, minimized on $\ell_2$. We can find it in Python like so:

    import numpy.linalg as la
    A = [[1, 2]]
    b = [8]
    la.lstsq(A, b)

The result is the vector (1.6, 3.2). You could almost have worked that out in your head, but imagine having 1000 equations to solve and you start to appreciate numpy.linalg. Admittedly, it's even easier in Octave (or MATLAB if you must) and Julia:

    A = [1 2]
    b = [8]
    A \ b

But remember we have lots of norms. It turns out that minimizing other norms can be really useful. For example, minimizing the $\ell_1$ norm — growing a diamond out from the origin — results in (0, 4). The $\ell_0$ norm gives the same sparse* result. Minimizing the $\ell_\infty$ norm leads to $ x = y = 8/3 \approx 2.67$.

This was the diagram I wanted to get to when I started with the 'how far away is the supermarket' business. So I think I'll stop now... have fun with Norm!

* I won't get into sparsity now, but it's a big deal. People doing big computations are always looking for sparse representations of things. They use less memory, are less expensive to compute with, and are conceptually 'neater'. Sparsity is really important in compressed sensing, which has been a bit of a buzzword in geophysics lately.

October 05, 2017

The norm: kings, crows and taxicabs

October 05, 2017/ Matt Hall

How far away is the supermarket from your house? There are lots of ways of answering this question:

As the crow flies. This is the green line from $\mathbf{a}$ to $\mathbf{b}$ on the map below.
The 'city block' driving distance. If you live on a grid of streets, all possible routes are the same length — represented by the orange lines on the map below.
In time, not distance. This is usually a more useful answer... but not one we're going to discuss today.

Don't worry about the mathematical notation on this map just yet. The point is that there's more than one way to think about the distance between two points, or indeed any measure of 'size'.

Higher dimensions

The map is obviously two-dimensional, but it's fairly easy to conceive of 'size' in any number of dimensions. This is important, because we often deal with more than the 2 dimensions on a map, or even the 3 dimensions of a seismic stack. For example, we think of raw so-called 3D seismic data as having 5 dimensions (x position, y position, offset, time, and azimuth). We might even formulate a machine learning task with a hundred or more dimensions (or 'features').

Why do we care about measuring distances in high dimensions? When we're dealing with data in these high-dimensional spaces, 'distance' is a useful way to measure the similarity between two points. For example, I might want to select those samples that are close to a particular point of interest. Or, from among the points satisfying some constraint, select the one that's closest to the origin.

Definitions and nomenclature

We'll define norms in the context of linear algebra, which is the study of vector spaces (think of multi-dimensional 'data spaces' like the 5D space of seismic data). A norm is a function that assigns a positive scalar size to a vector $\mathbf{v}$ , with a size of zero reserved for the zero vector (in the Cartesian plane, the zero vector has coordinates (0, 0) and is usually called the origin). Any norm $\|\mathbf{v}\|$ of this vector satisfies the following conditions:

Absolutely homogenous. The norm of $\alpha\mathbf{v}$ is equal to $|\alpha|$ times the norm of $\mathbf{v}$.
Subadditive. The norm of $ (\mathbf{u} + \mathbf{v}) $ is less than or equal to the norm of $\mathbf{u}$ plus the norm of $\mathbf{v}$. In other words, the norm satisfies the triangle inequality.
Positive. The first two conditions imply that the norm is non-negative.
Definite. Only the zero vector has a norm of 0.

Kings, crows and taxicabs

Let's return to the point about lots of ways to define distance. We'll start with the most familiar definition of distance on a map— the Euclidean distance, aka the $\ell_2$ or $L_2$ norm (confusingly, sometimes the two is written as a superscript), the 2-norm, or sometimes just 'the norm' (who says maths has too much jargon?). This is the 'as-the-crow-flies distance' on the map above, and we can calculate it using Pythagoras:

$$ \|\mathbf{v}\|_2 = \sqrt{(a_x - b_x)^2 + (a_y - b_y)^2} $$

You can extend this to an arbitrary number of dimensions, just keep adding the squared elementwise differences. We can also calculate the norm of a single vector in n-space, which is really just the distance between the origin and the vector:

$$ \|\mathbf{u}\|_2 = \sqrt{u_1^2 + u_2^2 + \ldots + u_n^2} = \sqrt{\mathbf{u} \cdot \mathbf{u}} $$

As shown here, the 2-norm of a vector is the square root of its dot product with itself.

So the crow-flies distance is fairly intuitive... what about that awkward city block distance? This is usually referred to as the Manhattan distance, the taxicab distance, the $\ell_1$ or $L_1$ norm, or the 1-norm. As you can see on the map, it's just the sum of the absolute distances in each dimension, x and y in our case:

$$ \|\mathbf{v}\|_1 = |a_x - b_x| + |a_y - b_y| $$

What's this magic number 1 all about? It turns out that the distance metric can be generalized as the so-called p-norm, where p can take any positive value up to infinity. The definition of the p-norm is consistent with the two norms we just met:

$$ \| \mathbf{u} \|_p = \left( \sum_{i=1}^n | u_i | ^p \right)^{1/p} $$

[EDIT, May 2021: This generalized version is sometimes called the Minkowski distance, e.g. in the scipy documentation.]

In practice, I've only ever seen p = 1, 2, or infinity (and 0, but we'll get to that). Let's look at the meaning of the $\infty$-norm, aka the $\ell_\infty$ or $L_\infty$ norm, which is sometimes called the Chebyshev distance or chessboard distance (because it defines the minimum number of moves for a king to any given square):

$$ \|\mathbf{v}\|_\infty = \mathrm{max}(|a_x - b_x|, |a_y - b_y|) $$

In other words, the Chebyshev distance is simply the maximum element in a given vector. In a nutshell, the infinitieth root of the sum of a bunch of numbers raised to the infinitieth power, is the same as the infinitieth root of the largest of those numbers raised to the infinitieth power — because infinity is weird like that.

What about p = 0?

Infinity is weird, but so is zero sometimes. Taking the zeroeth root of a lot of ones doesn't make a lot of sense, so mathematicians often redefine the $\ell_0$ or $L_0$ "norm" (not a true norm) as a simple count of the number of non-zero elements in a vector. In other words, we toss out the 0th root, define $0^0 := 0 $ and do:

$$ \| \mathbf{u} \|_0 = |u_1|^0 + |u_2|^0 + \cdots + |u_n|^0 $$

(Or, if we're thinking about the points $\mathbf{a}$ and $\mathbf{b}$ again, just remember that $\mathbf{v}$ = $\mathbf{a}$ - $\mathbf{b}$.)

Computing norms

Let's take a quick look at computing the norm of some vectors in Python:

>>> import numpy as np

>>> a = np.array([1, 1]).T
>>> b = np.array([6, 5]).T

>>> L_0 = np.count_nonzero(a - b)
2

>>> L_1 = np.sum(np.abs(a - b))
9

>>> L_2 = np.sqrt((a - b) @ (a - b))
6.4031242374328485

>>> L_inf = np.max(np.abs(a - b))
5

>>> # Using NumPy's `linalg` module:
>>> import numpy.linalg as la
>>> for p in (0, 1, 2, np.inf):
>>>    print("L_{} norm = {}".format(p, la.norm(a - b, p)))
L_0 norm = 2.0
L_1 norm = 9.0
L_2 norm = 6.4031242374328485
L_inf norm = 5.0

What can we do with all this?

So far, so good. But what's the point of these metrics? How can we use them to solve problems? We'll get into that in a future post, so don't go too far!

For now I'll leave you to play with this little interactive demo of the effect of changing p-norms on a Voronoi triangle tiling — it's by Sarah Greer, a geophysics student at UT Austin.

UPDATE — The next post is The norm and simple solutions, which looks at how these different norms can be used to solve real-world problems.

Blog