The Real job of a data scientist

On the surface a data scientist does many things. You will be many things at different times. An engineer, an evengelist, a web dev and a database manager. Sometime you will be a SQL monkey, other times a PM. You will build data products and (semi)intelligent systems. You will make machines able to learn and then teach them. You will clean data (so much cleaning).

More and more Data Science roles are split between the Insights Data Scientist who deals with helping stakeholders make rational decisions on real data and the Machine Learning Engineers.

There is an open question about what a data scientist is. Much argument probably comes from the fact that so much of this seems disparate and unrelated.

let me tell you what I think defines the role data scientist.

But I want to talk about Donald Rumsfeld and WMDs

Donald Rumsfeld

Rumsfeld got a lot of crap for this line, but I actually think it is quite profound.

Uncertain Systems Warning System

(adapted from the excellent paper Warning: Physics envy may be hazardous to your wealth)

Low: complete certainty

The system is deterministic. Given initial conditions you can predict the results with perfect certainty for all time. Newtonian Physics is an example. For most levels of approximation, classical physics operates on this level.

You are at a casino full of slot machines. If you insert a coin you always win a prize. Oh wait, that is the coke machine.

Guarded: risk without uncertainty

the System is not completely predictable, but the randomness in the system obeys a known probability distribution.

“This is life in a hypothetical honest casino” where every slot machine tells you the exact odds of a payout. You don’t know that you will win, but you know exactly your risk.

Elevated: fully reducible uncertainty

The system is governed by some statistical distrubution, but we do not get to know the parameters of the model (or even the type of model generating our data). But if we measure the system long enough, we can start to use the data to make statistical inferences about.

This casino is still honest, but they don’t post the true odds on the slot machines.

High: partially reducible uncertainty

The system is more complex than a simple statistical distribution. Either there are hidden variables that you don’t know about/can’t measure, or the system evolves in time, faster than you can get to statistical significance.

Maybe (unknown to you) the odds of the slot machines vary by the age of the user, the number of ciggarette smoked in the area, the phase of the moon, the primeness of the 1000 millisecond columns of the time since 1970 or the price of butter in Munich.

The maddening thing about this level is that you will never get to know exactly what the system is that generates the data. There are methods that can help with making predictions at this level, but there will always be some uncertainty left hanging over the system, only to bite you in the ass when the price of butter in munich collapses.

Severe: irreducible uncertainty

Polite term for total ignorance. This is an extreme class of system, representing unknowable systems.

The slot machines in this casino are able to change thier behavior in response to the model you are currently testing in your mind. There might be other explainations for the behavior of the slot machine, but most are, by definition, unthinkable.

To assign a system to this category is to admit intellectual defeat. The Terrorists have won.

Risk vs Uncertainty

note that in previous examples risk and uncertainty are not the same. They are distinct types of ignorance.

Risk

Uncertainty

Weapons of Math Destruction

I thought that the book Weapons of Math Destruction was over-sensationalistic and reaked of neo-ludditeism, but was a decent collection of case studies about what not to do as a data scientist.

Recidivism Models

Credit models

What the book showed, were not cases of data science run amok, but cases where the ‘data scientists’ involved failed at thier fundemental job. These models are not bad, but the creators of these models are mis-representing the systems they model, to be more certain than is warrented.

What do Data Scientists Own?

In the ‘agile’ business world everyone needs to have ownership of something to ensure it gets the attention it deserves.

  • Marketing Managers own the message sent to a certain group
  • Product Managers own the creation of a product to accomplish a goal
  • UX designers own the user experience
  • Operations owns server uptime and the stability of the infrastructure

What do data scientists own.

Data Scientists own Uncertainty

Depending on how you look at it, I am either fantastic at giving things unsexy names or terrible at giving things good name. My old team at Linkedin worked on projects like LAME, LIAR, LOSER, and Taxman. It is a good thing data scientists have master hype artists like DJ Patel to make our field ‘the sexiest job of the 21st century’ because I am always tempted to do the opposite. So like Marketing Managers or Product Managers, what do Data Scientists manage?

**I think we are Ignorance Managers **

A more politic way of saying that would be ‘we manage uncertainty’

The job of a data scientist

Our job as data scientists is to own uncertainty and risk. We should try to decrease both, but we should never delude ourselves or others that there is less uncertainty than there actually is.

Data scientists need to understand the system they are working with, and be able to realisticly assign it to its proper category, and know the appropriate tools for dealing with a problem in that category, all the while, remembering that unknown unknowns, could elevate the category at any time.I

in roman times, Triumphent Generals were paraded through the streets. It was said that on that day they were gods. But so as to not get too upity, they would have a slave ride along in the chariot and whisper ‘Momento Mori’: Remember you are mortal (good story, probably not true)

Data scientists, we are the modern version of that slave. Momento ignorantia!

Tools for dealing with uncertainty

I won’t be able make a comprehensive list of all the tools a data scientist could use to manage uncertainty. Such a list would certainly not fit in a single post. But I will list a few. In the future I will attempt to frame any method or tool in the context of how it interacts with the various categories of systems.

Written on March 7, 2017