August 31, 2017

Attribution is not permission

August 31, 2017/ Matt Hall

This morning a friend of mine, Fernando Enrique Ziegler, a pore pressure researcher and practitioner in Houston, let me know about an "interesting" new book from Elsevier: Practical Solutions to Integrated Oil and Gas Reservoir Analysis, by Enwenode Onajite, a geophysicist in Nigeria... And about 350 other people.

What's interesting about the book is that the majority of the content was not written by Onajite, but was copy-and-pasted from discussions on LinkedIn. A novel way to produce a book, certainly, but is it... legal?

Who owns the content?

Before you read on, you might want to take a quick look at the way the book presents the LinkedIn material. Check it out, then come back here. By the way, if LinkedIn wasn't so damn difficult to search, or if the book included a link or some kind of proper citation of the discussion, I'd show you a conversation in LinkedIn too. But everything is completely untraceable, so I'll leave it as an exercise to the reader.

LinkedIn's User Agreement is crystal clear about the ownership of content its users post there:

[...] you own the content and information that you submit or post to the Services and you are only granting LinkedIn and our affiliates the following non-exclusive license: A worldwide, transferable and sublicensable right to use, copy, modify, distribute, publish, and process, information and content that you provide through our Services [...]

This is a good user agreement [Edit: see UPDATE, below]. It means everything you write on LinkedIn is © You — unless you choose to license it to others, e.g. under the terms of Creative Commons (please do!).

Fernando — whose material was used in the book — tells me that none of the several other authors he has asked gave, or were even asked for, permission to re-use their work. So I think we can say that this book represents a comprehensive infringement of copyright of the respective authors of the discussions on LinkedIn.

Roles and reponsibilities

Given the scale of this infringement, I think there's a clear lack of due diligence here on the part of the publisher and the editors. Having said that, while publishers are quick to establish their copyright on the material they publish, I would say that this lack of diligence is fairly normal. Publishers tend to leave this sort of thing to the author, hence the standard "Every effort has been made..." disclaimer you often find in non-fiction books... though not, apparently, in this book (perhaps because zero effort has been made!).

But this defence doesn't wash: Elsevier is the copyright holder here (Onajite signed it over to them, as most authors do), so I think the buck stops with them. Indeed, you can be sure that the company will make most of the money from the sale of this book — the author will be lucky to get 5% of gross sales, so the buck is both figurative and literal.

Incidentally, in Agile's publishing house, Agile Libre, authors retain copyright, but we take on the responsibility (and cost!) of seeking permissions for re-use. We do this because I consider it to be our reputation at stake, as much as the author's.

OK, so we should blame Elsevier for this book. Could Elsevier argue that it's really no different from quoting from a published research paper, say? Few researchers ask publishers or authors if they can do this — especially in the classroom, "for educational purposes", as if it is somehow exempt from copyright rules (it isn't). It's just part of the culture — an extension of the uneducated (uninterested?) attitude towards copyright that prevails in academia and industry. Until someone infringes your copyright, at least.

Seek permission not forgiveness

I notice that in the Acknowledgments section of the book, Onajite does what many people do — he gives acknowledgement ("for their contributions", he doesn't say they were unwitting) to some the authors of the content. Asking for forgiveness, as it were (but not really). He lists the rest at the back. It's normal to see this sort of casual hat tip in presentations at conferences — someone shows an unlicensed image they got from Google Images, slaps "Courtesy of A Scientist" or a URL at the bottom, and calls it a day. It isn't good enough: attribution is not permission. The word "courtesy" implies that you had some.

Indeed, most of the figures in Onajite's book seem to have been procured from elsewhere, with "Courtesy ExxonMobil" or whatever passing as a pseudolicense. If I was a gambler, I would bet that the large majority were used without permission.

OK, you're thinking, where's this going? Is it just a rant? Here's the bottom line:

The only courteous, professional and, yes, legal way to re-use copyrighted material — which is "anything someone created", more or less — is to seek written permission. It's that simple.

A bit of a hassle? Indeed it is. Time-consuming? Yep. The good news is that you'll usually get a "Sure! Thanks for asking". I can count on one hand the number of times I've been refused.

The only exceptions to the rule are when:

The copyrighted material already carries a license for re-use (as Agile does — read the footer on this page).
The copyright owner explicitly allows re-use in their terms and conditions (for example, allowing the re-publication of single figures, as some journals do).
The law allows for some kind of fair use, e.g. for the purposes of criticism.

In these cases, you do not need to ask, just be sure to attribute everything diligently.

A new low in scientific publishing?

What now? I believe Elsevier should retract this potentially useful book and begin the long process of asking the 350 authors for permission to re-use the content. But I'm not holding my breath.

By a very rough count of the preview of this $130 volume in Google Books, it looks like the ratio of LinkedIn chat to original text is about 2:1. Whatever the copyright situation, the book is definitely an uninspiring turn for scientific publishing. I hope we don't see more like it, but let's face it: if a massive publishing conglomerate can make $87 from comments on LinkedIn, it's gonna happen.

What do you think about all this? Does it matter? Should Elsevier do something about it? Let us know in the comments.

UPDATE Friday 1 September

Since this is a rather delicate issue, and events are still unfolding, I thought I'd post some updates from Twitter and the comments on this post:

Elsevier is aware of these questions and is looking into it.
Re-read the user agreement quote carefully. As Ronald points out below, I was too hasty — it's really not a good user agreement, LinkedIn have a lot of scope to re-use what you post there.
It turns out that some people were asked for permission, though it seems it was unclear what they were agreeing to. So the author knew that seeking permission was a good idea.
It also turns out that at least one SPE paper was reproduced in the book, in a rather inconspicuous way. I don't know if SPE granted rights for this, but the author at least was not identified.
Some people are throwing the word 'plagiarism' around, which is rather a serious word. I'm personally willing to ascribe it to 'normal industry practices' and sloppy editing and reviewing (the book was apparently reviewed by no fewer than 5 people!). And, at least in the case of the LinkedIn content, proper attribution was made. For me, this is more about honesty, quality, and value in scientific publishing than about misconduct per se.
It's worth reading the comments on this post. People are raising good points.

Part of the thumbnail image was created by Jannoon028 — Freepik.com — and licensed CC-BY.

August 23, 2017

x lines of Python: read and write CSV

August 23, 2017/ Matt Hall

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro!

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)

The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

August 15, 2017

Organizing spreadsheets

August 15, 2017/ Matt Hall

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy 's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

This spreadsheet has several problems. Among them:

The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

Raw data. This is important. See below.
Computed columns. There may be good reasons to keep these with the data.
Charts.
'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

No computed fields or plots in the data sheet.
No hidden columns.
No semantic meaning in formatting (e.g. highlighting cells or bolding values).
Headers in the first row, only data in all the other rows.
The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
Zero means zero, empty cell means no data.
Only one unit per column. (You only use SI units right?)
Attribution! Include a citation or citations for every record.
If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
Check that it turns into a valid CSV so you can use this awesome format.
After all that, here's what we have (click here for the entire sheet):

The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.

The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

August 10, 2017

x lines of Python: read and write a shapefile

August 10, 2017/ Matt Hall

Shapefiles are a sort-of-open format for geospatial vector data. They can encode points, lines, and polygons, plus attributes of those objects, optionally bundled into groups. I say 'sort-of-open' because the format is well-known and widely used, but it is maintained and policed, so to speak, by ESRI, the company behind ArcGIS. It's a slightly weird (annoying) format because 'a shapefile' is actually a collection of files, only one of which is the eponymous SHP file.

Today we're going to read a SHP file, change its Coordinate Reference System (CRS), add a new attribute, and save a new file in two different formats. All in x lines of Python, where x is a small number. To do all this, we need to add a new toolbox to our xlines virtual environment: geopandas, which is a geospatial flavour of the popular data management tool pandas.

Here's the full rundown of the workflow, where each item is a line of Python:

Open the shapefile with fiona (i.e. not using geopandas yet).
Inspect its contents.
Open the shapefile again, this time with geopandas.
Inspect the resulting GeoDataFrame in various ways.
Check the CRS of the data.
Change the CRS of the GeoDataFrame.
Compute a new attribute.
Write the new shapefile.
Write the GeoDataFrame as a GeoJSON file too.

By the way, if you have not come across EPSG codes yet for CRS descriptions, they are the only way to go. This dataset is initially in EPSG 4267 (NAD27 geographic coordinates) but we change it to EPSG 26920 (NAD83 UTM20N projection).

Several bits of our workflow are optional. The core part of the code, items 3, 6, 7, and 8, are just a few lines of Python:

    import geopandas as gpd
    gdf = gpd.read_file('data_in.shp')
    gdf = gdf.to_crs({'init': 'epsg:26920'})
    gdf['seafl_twt'] = 2 * 1000 * gdf.Water_Dept / 1485
    gdf.to_file('data_out.shp')

That's it!

As in all these posts, you can follow along with the code in the Jupyter Notebook.

August 01, 2017

Murphy's Law for Excel

August 01, 2017/ Matt Hall

Where would scientists and engineers be without Excel? Far, far behind where they are now, I reckon. Whether it's a quick calculation, or making charts for a thesis, or building elaborate numerical models, Microsoft Excel is there for you. And it has been there for 32 years, since Douglas Klunder — now a lawyer at ACLU — gave it to us (well, some of us: the first version was Mac only!).

We can speculate about reasons for its popularity:

It's relatively easy to use, and most people started long enough ago that they don't have to think too hard about it.
You have access to it, and you know that your collaborators (boss, colleagues, future self) have access to it.
It's flexible enough that it can do almost anything.

Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

For instance, all the computation and graphics for my two 2006 articles on signal processing were done in Excel (plus the FFT add-on). I've seen reservoir simulators, complete with elaborate user interfaces, in Excel. An infinity of business-critical documents are stored in Excel (I just filled out a vendor registration form for a gigantic multinational in an Excel spreadsheet). John Nelson at ESRI made a heatmap in Excel. You can even play Pac Man.

Maybe it's gone too far:

Murphy's law for Excel: what people can make in Excel, people will make in Excel
— Lukas Mosser (@porestar) July 29, 2017

So what's wrong with Excel?

Nothing is wrong with it, but it's not the best tool for every number-crunching task. Why?

Excel files are just that — files. Sometimes you want to do analysis across datasets, and a pool of data (a database) becomes more useful. And sometimes you wish nine different people didn't have nine different versions of your spreadsheet, each emailing their version to nine other people...
The charts are rather clunky and static. They don't do well with large datasets, or in data you'd like to filter or slice dynamically.
In large datasets, scrolling around a spreadsheet gets old pretty quickly.
The tool is so flexible that people get carried away with pretty tables, annotating their sheets in ways that make the printed page look nice, but analysis impossible.

What are the alternatives?

Excel is a wonder-tool, but it's not the only tool. There are alternatives, and you should at least know about them.

For everyday spreadsheeting needs, I now use Google Sheets. Collaboration is built-in. Being able to view and edit a sheet at the same time as someone else is a must-have (probably Office 365 does this now too, so if you're stuck with Excel I urge you to check). Version control — another thing I'm not sure I can live without — is built in. For real nerds, there's even a complete API. I also really like the native 'webbiness' of Google Docs, for example being able to use web API calls natively, for example getting the current CAD–USD exchange rate with GoogleFinance("CURRENCY:CADUSD").

If it's graphical analysis you want, try Tableau or Spotfire. I'm especially looking at you, reservoir engineers — you are seriously missing out if you're stuck in Excel, especially if you have a lot of columns of different types (time series, categories and continuous variables for example). The good news is that the fastest way to get data into Spotfire is... Excel. So it's easy to get started.

If you're gathering information from people, like registering the financial details of vendors for instance, then a web form is your best bet. You can set one up in Google Forms in minutes, and there are lots of similar services. If you want to use your own servers, no problem: any dev worth their wages can throw one together in a few hours.

If you're doing geoscience in Excel, like my 2006 self — filtering logs, or generating synthetics, or computing spectrums — your mind will be blown by spending a few hours learning a programming language. Your first day in Python (or Julia or Octave or R) will change your quantitative life forever.

Excel is great at some things, but for most things, there's a better way. Take some time to explore them the next time you have some slack in your schedule.

References

Hall, M (2006). Resolution and uncertainty in spectral decomposition. First Break 24, December 2006, p 43–47.

Hall, M (2006). Predicting stratigraphy with cepstral decomposition. The Leading Edge 25 (2, Special Issue on Spectral Decomposition). doi:10.1190/1.2172313

UPDATE

As a follow-up example, I couldn't resist sharing this recent story about an artist that draws anime characters in Excel.

Blog