Skip to main content

data jobs, tautologies, bullshit, $$$


(c) Tom Gauld (2014)
When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math. If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics... You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

Karl Broman


what makes data science special and distinct from statistics is that this data product gets incorporated back into the real world, and users interact with that product, and that generates more data: a feedback loop. This is very different from predicting the weather...

– Cathy O'Neil / Rachel Schutt


"Data science" is the latest name for an old pursuit: the attempt to make computers give us new knowledge. * In computing's short history, there have already been about 10 words for this activity (and god knows how many derived job titles). So: here's an anti-bullshit exercise, a genealogy of some very expensive buzzwords.

The following are ordered by the year the hype peaked (as estimated by maximum mentions in books). You can play with proper data here.




  • "Expert systems"
    The original, GOFAI craft. Painstaking, manually-built rule stacks. 69% accuracy in certain medical tasks, which beat out human experts.


  • "Business intelligence"
    The most transparently hokum. I include the rigorous but dead-ended world of MDX and OLAP in here, perhaps unfairly: they're certainly still in use by some organisations who you'd expect to know better.


  • "Data mining". Originally a pejorative, among actual statisticians, meaning "looking for fake patterns to proclaim". Now reclaimed in industry and academia.** Compared to ML, data mining has a lot of corporate dilution, proprietary gremlins and C20th crannies in, from what I can tell. (Basically the same as "Knowledge discovery"?)


  • "Predictive analytics". See machine learning but subtract most.


  • "Big data". Somewhat meaningful as a concept, extremely tangible as an engineering challenge, and tied to genuinely new results. But still highly repugnant. Has captured much of the present job market, but the hype train has headed off well and truly.


  • "Machine learning". Applied statistics, but recast by computer scientists into algorithms. Goal: Getting systems that work fast rather than inferring the calibrated convergent truth. Along with stats, ML is the heart of the actual single phenomenon underlying all this money and hype.


  • "Data science". Recent high-profile successes in the AI/ML/DS space are largely due to the data explosion - not to new approaches or smarter protagonists. So this is at least half job title inflation. Still, it is handy to have a job title with enough elbow-room to be statistician and developer and machine teacher at once.


You might have hoped that nominally scientific minds would shun the proliferation of tautologous or meaningless terms. But stronger pressures prevail - chiefly the need of job security, via bamboozling clients or upper management or tech conference attendees.



##################################################################################################

* As always, we settle for optimal guesses instead of 'knowledge'.

If I'd said "the attempt to get knowledge from data" then of course I would just be describing statistics.^ This near miss doesn't bother me - despite the fact that statisticians computerised before any other nonengineering profession or field - and despite their building much of the theory and even implementations described in this piece (besides expert systems and GOFAI). Their gigantic century of work is a superset of what I'm talking about.
^ Obviously my initial definition is pretty close to "narrow artificial intelligence" too: at the limit, AI is "building a system for automatically getting knowledge from arbitrary input". Many of the successes described above also belong to them (particularly expert systems and GOFAI). "Data jobs", as I blandly put it in the header, are "jobs dealing with the fact that we don't quite have AI". There are a lot of terrible data jobs, and I'm not talking about them either. The full specification, if you bloody insist, is: "cultures, largely applied or industrial ones, which use cool data processing methods which are not really A.I. in the wide or strong sense, but which aren't standard 70s drone analyst work either. Nor have they anything to do with the very similar work of information physicists or electronic engineers or anything."

(But then all work in applied maths and stats shares a lot, since it's all based on the same world using the same concepts and logics. Only the goals and technologies really vary.)

I'm speaking as generally as I am - that is, almost speaking nonsense - so I can cut through the mire of terms, the effluent of the academic-industrial complex. In intellectual terms, it is pretty easy to refer to all the things I am trying to refer to: they are 'the formal sciences'. But I'm trying to tease out the practitioners, and the way-downstream economics.


** I had been dismissing "data mining" as just a 90s business way of saying "machine learning", but the distinction is actually fairly well-defined:
Data mining: Direct algorithm design for already well-defined goals - where you know what features to use. (e.g. "What kind of language do CVs use?")

Machine learning: Indirect algorithm design, via automated feature engineering, for a ill-defined goal. (e.g. "How do we distinguish a picture of a cat from a picture of a far away lynx ?")



Comments