What should national data repositories do?

Right now there's a conference happening in Stavanger, Norway: National Data Repository 2017. My friend David Holmes of Dell EMC, a long time supporter of Agile's recent hackathons and general geocomputing infrastructure superhero, is there. He's giving a talk, I think, and chairing at least one session. He asked a question today on Software Underground:

If anyone has any thoughts or ideas as to what the regulators should be doing differently now is a good time to speak up :)

My response

For me it's about raising their aspirations. Collectively, they are sitting on one of the most valuable — or invaluable — datasets in the world, comparable to Hubble, or the LHC. Better yet, the data are (in most cases) already open and they actually want to share it. And the community (us) is better tooled than ever, and perhaps also more motivated, to get cracking. So the possibility is there to see a revolution in subsurface science and exploration (in the broadest sense of the word) and my challenge to them is:

Can they now create the conditions for this revolution in earth science?

Some things I think they can do right now:

  • Properly fund the development of an open data platform. I'll expand on this topic below.
  • Don't get too twisted off on formats (go primitive), platforms (pick one), licenses (go generic), and other busy work that committees love to fret over. Articulate some principles (e.g. public first, open source, small footprint, no lock-in, componentize, no single provider, let-users-choose, or what have you), and stay agile. 
  • Lobby NOCs and IOCs hard to embrace integrated and high-quality open data as an advantage that society, as well as industry, can share in. It's an important piece in the challenge we face to modernize the industry. Not so that it can survive for survival's sake, but so that it can serve society for as long as it's needed. 
  • Get involved in the community: open up their processes and collaborate a lot more with the technical societies — like show up and talk about their programs. (How did I not hear about the CDA's unstructured data challenge — a subject I'm very much into — till it was over? How many other potential participants just didn't know about it?)

An open data platform

The key piece here is the open data platform. Here are the features I'd like to see of such a platform:

  • Optimized for users, not the data provider, hosting provider, or system administrator.
  • Clear rights: well-known, documented, obvious, clearly expressed open licenses for re-use.
  • Meaningful levels of access that are free of charge for most users and most use cases.
  • Access for humans (a nice mappy web interface) with no awkward or slow registration processes.
  • Access for machines (a nice API, perhaps even a couple of libraries expressing it).
  • Tools for query, discovery, and retrieval; ideally with user feedback paths ('more like this, less like that').
  • Ways to report, or even fix, problems in the data. This relieves you of "the data's not ready" procrastination.
  • Good documentation of all of this, ideally in a wiki or something that people can improve.
  • Support for a community of users and developers that want to do things with the data.

Building this platform is not trivial. There is massive file storage, database back end, web front end, licensing, and so on. Then there's the community of developers and users to engage and support. It will take years, and never be finished. It sounds hard... but people are doing it. Prototypes for seismic data exist already, and there are countless models in other verticals (just check out the Registry of Research Data Repositories, or look at the list on PLOS). 

The contract to build data infrostructure is often awarded to the likes of Schlumberger, Halliburton or CGG. In theory, these companies have the engineering depth to pull them off (though this too is debatable, especially in today's web-first, native-never world). But they completely lack the culture required: there's no corporate understanding of what 'open' means. So the model is broken in subtle but fatal ways and the whole experiment fails. 

I'm excited to hear what comes out of this conference. If you're there, please tell!