10 ways to improve your data store

When I look at the industry's struggle with the data mess, I see a parallel with science's struggle with open data. I've written lots about that before, but the basic idea is simple: scientists need discoverable, accessible, documented, usable data. Does that sound familiar?

I wrote yesterday that I think we have to get away from the idea that we can manage data like we might manage a production line. Instead, we need to think about more organic, flexible strategies that cope with and even thrive on chaos. I like, or liked until yesterday, the word 'curation', because it implies ongoing care and a focus on the future. But my friend Eric Marchand was right in his comment yesterday — the dusty connotation is too strong, and dust is bad for data. I like his supermarket analogy: packaged, categorized items, each with a cost of production and a price. A more lively, slightly chaotic market might match my vision better — multiple vendors maintaining their own realms. One can get carried away with analogies, but I like this better than a library or museum.

The good news is that lots of energetic and cunning people have been working on this idea of open data markets. So there are plenty of strategies we can try, alongside the current strategy of giving giant service companies millions of dollars for their TechCloud® Integrated ProSIGHT™ Data Management Solutions.

Serve your customer:

  • Above all else, build what people need. It's amazing that this needs to be said, but ask almost anyone what they think of IT at their company and you will know that it is not how it works today. Everything you build should be in response to the business pulling. 
  • This means you have to get out of the building and talk to your customers. In person, one-one-one. Watch them use your systems. Listen to them. Respond to them. 

Knock down the data walls:

  • Learn and implement open data practices inside the organization. Focus on discoverability, accessiblity, documentation of good-enough data, not on building The One True Database. 
  • Encourage and fund open data practices among providers of public data. There is a big role here for our technical societies, I believe, but I don't think they have seen it yet.

I've said it before: hire loads of geeks:

  • The web (well, intranet) is your pipeline. Build and maintain proper machine interfaces (APIs and web APIs) for data. What, you don't know how to do this? I know; it means hiring web-savvy data-obsessed programmers.
  • Bring back the hacker technologists that I think I remember from the nineties. Maybe it's a myth memory, but sprinkled around big companies there used to be super-geeks with degrees in astrophysics, mad UNIX skills, and the Oracle admin password. Now it's all data managers with Petroleum Technology certificates who couldn't write an awk script if your data depended on it (it does). 
  • Institute proper data wrangling and analysis training for scientists. I think this is pretty urgent. Anecdotal evidence: the top data integration tools in our business is PowerPoint... or an Excel chart with two y-axes if we're talking about engineers. (What does E&P mean?)

Three more things:

  • Let data live where it wants to live (databases, spreadsheets, wikis, SharePoint if you must). Focus on connecting data with APIs and data translators. It's pointless trying to move data to where you want it to be — you're just making it worse. ("Oh, you moved my spreadsheet? Fine, I will copy my spreadsheet.")
  • Get out of the company and find out what other people are doing. Not the other industry people struggling with data — they are just as clueless as we are. Find out what the people who are doing amazing things with data are doing: Google, Twitter, Facebook, data.gov, Wikipedia, Digital Science, The New York Times, The Guardian,... there are so many to choose from. We should invite these people to our conferences; they can help us.
  • If you only do one thing, fix search in your company. Stop tinkering with semantic ontologies and smart catalogs, just buy Google Search Appliance and fix it. You can get this one done by Christmas.

Last thing. If there's one mindset that will really get in the way, it's the project mindset. If we want to go beyond coping with the data mess, far beyond it to thriving on it, then we have to get comfortable with the idea that this is not a project. The word is banned, along with 'initiative', 'governance', and Gantt charts. The requirements you write on the back of a napkin with three colleagues will be just as useful as the ones you get back from three months of focus groups.

No, this is the rest of your career. This is never done, next year there are better ideas, more flexible libraries, faster hardware, and new needs. It's like getting fit: this ain't an 8-week get-fit program, it's an eternity of crunches.

The photograph of Covent Market in London, Ontario is from Boris Kasimov on Flickr.

Data management fairy tales

On Tuesday I read this refreshing post in LinkedIn by Jeffrey Maskell of Westheimer Energy Consultants. It's a pretty damning assessment of the current state of data management in the petroleum industry:

The fact is that no major technology advances have been seen in the DM sector for some 20 years. The [data management] gap between acquisition and processing/interpretation is now a void and is impacting the industry across the board...

I agree with him. But I don't think he goes far enough on the subject of what we should do about it. Maskell is, I believe, advocating more effort (and more budget) developing what the data management crowd have been pushing for years. In a nutshell:

I agree that standards, process, procedures, workflows, data models are all important; I also agree that DM certification is a long term desirable outcome. 

These words make me sad. I'd go so far as to say that it's the pursuit of these mythical ideas that's brought about today's pitiful scene. If you need proof, just look around you. Go look at your shared drive. Go ask someone for a well file. Go and (a) find then (b) read your IT policies and workflow documents — they're all fairy tales.

Maskell acknowledges at least that these are not enough; he goes on:

However I believe the KEY to achieving a breakthrough is to prove positively that data management can make a difference and that the cost of good quality data management is but a very small price to pay...

No, the key to achieving a breakthrough is a change of plan. Another value of information study just adds to the misery.

Here's what I think: 'data management' is an impossible fiction. A fairy tale.

You can't manage data

I'm talking to you, big-company-data-management-person.

Data is a mess, and it's distributed across your organization (and your partners, and your government, and your data vendors), and it's full of inconsistencies, and people store local copies of everything because of your broken SharePoint permissions, and... data is a mess.

The terrible truth you're fighting is that subsurface data wants to be a mess. Subsurface geoscience is not accounting. It's multi-dimensional. It's interdependent. Some of it is analog. There are dozens, maybe hundreds of formats, many of which are proprietary. Every single thing is unquantifiably uncertain. There are dozens of units. Interpretation generates more data, often iteratively. Your organization won't fund anything to do with IT properly — "We're an oil company, not a technology company!" — but it's OK because VPs only last 2 years. Well, subsurface facts last for decades.

You can't manage data. Try something else.

The principle here is: cope don't fix.

People earnestly trying to manage data reminds me of Yahoo trying to catalog the Internet in 1995. Bizarrely, they're still doing it... for 3 more months anyway. But we all know there's only one way to find things on the web today: search. Search transcends the catalog. 

So what transcends data management? I've got my own ideas, but first I really, really want to know what you think. What's one thing we could do — or stop doing — to make things a bit better?

Creeping inefficiency

Dear CIO of a major oil and gas company,

Search—something you take for granted on the Internet—is broken in your company. Ask anyone.

You don't notice, because you don't count the cost of lost seconds or minutes finding things. And you can't count the cost of the missed opportunities because someone gave up looking. This happens thousands of times a day, by the way. 

Here's what people do when they want to find something on your intranet: 

  1. Ask people if they know where it is. (Nobody does.)
  2. Give up.

The good news is that there is a relatively easy way to fix this immediately and forever. Here's how:

  1. Buy Google Search Appliance.

If you don't already have one of these in your server room, then your luck is in. Soon everyone will think you're a hero. At least, they will until they realize there are 31 versions of every file in your organization. At least you'll know where they all are though, right?

You're welcome,
Matt