4 Exciting Data Science Developments at the 2022 Open Data Science Conference

I published this article in Towards Data Science here

Last week, I attended the Open Data Science Conference (ODSC) in Boston. As an early career data scientist, this was an opportunity for me to connect with the community, and I gained from it a palpable sense of excitement and momentum.

ODSC lasted for 4 days and ran both in person and online. Talks were divided into streams including Machine Learning, MLOps, NLP, Biotech and more.

The sheer breadth of topics means that it would be impossible to accurately summarise ODSC completely, so I’m going to cherry-pick some of the most exciting perspectives.

These were:

  1. “Generative models are having a moment” — Hillary Mason
  2. Large companies are investing in open-source MLOps tools which solve ubiquitous engineering challenges
  3. Data Scientists are interested in collaborating better within their organisations
  4. Productive Human AI collaboration will be possible in the near future (and now!)

1. “Generative models are having a moment “— Hillary Mason

Hillary Mason from Hidden Door, an AI storyteller based video game startup, made a strong case in her talk. While it would be hard not to generally be aware of this trend, Hillary connected the dots for me on why, and why now.

Why Generative Models? Well, sampling from generative models enables you to create output like text, images, protein-sequences and other useful output. Most famously, GPT-3 came out a couple of years ago and initiated many interesting discussions about machine intelligence. GPT3 already produces human-like text, and models are only getting better.

Further work on such models in creating new capabilities and opening the door to different kinds of products including AI generated art such as DALL-E, video games using AI story-tellers and even more powerful language models such as PaLM.

DALL-E generated images from the prompt “an ai in the shape of a campfire telling stories to an audience of enthralled forest animals” generated by Dave Orr and shared here

Why now? Training and advancing these models has felt deeply in-accessible to the uninitiated for a long time. It would be easy to think that for the foreseeable future, if you wanted to work on such models as GPT3, you would need to work at DeepMind or OpenAI.

However, Hillary pointed out that it’s getting easier than ever before to interact with this technology:

  • Platforms like HuggingFace are making it easier than ever to share data and models.
  • Colab and other compute platforms (such as start-up SaturnCloud) are making it increasingly easy to access powerful GPUs and TPUs as needed.

If you’d like to experience what Hillary’s talking about fast, then go to the OpenAI’s website and use the API playground. OpenAI, by providing generative models on demand, have demonstrated that they are established enough to be traded as a utility, just like the internet or the electricity!

2. Large companies are investing in open-source MLOps tools which solve ubiquitous engineering challenges

In my experience, MLOps is a huge component of life as a data scientist. Simultaneously, it is completely under taught in academic contexts.

MLOps and Data Engineering was an area focus of ODSC. One talk that I attended was especially good, given by Robert Crowe from Google/Searchlight, present on TensorFlow Extended (TFX).

TFX is an end-to-end platform for deploying ML pipelines which is used by companies like Spotify, Google (Maps/Gmail), and OpenX.

Robert argued that the motivation for building TFX, was much like the motivation for building any software tool. Everyone is facing the same tasks and rewriting the same boilerplate code again and again. This tracked with my own experience and the experience of many attendees I spoke to. Whether we want to admit it or not, lots of repetitive work isn’t automated or abstracted when it should be — so it’s nice that TFX exists!

Robert went on to describe much of the detail in developing and utilising an ML pipeline with TFX, however you can find this kind of detail here.

Now, whether TFX is the right MLOps solution can be investigated in much greater detail (such as in this article). I’m just going to say that I find it really exciting to know that many people are facing similar challenges and that large companies are collaborating on developing powerful open-source solutions.

A final note, Robert suggested the DeepLearning.AI TFX on Coursera (on which he is an instructor) for those interested in training up in this domain.

3. Data Scientists are interested in communicating to maximise their value in organisations

It would be very easy for a meeting of so many technical people working in such a technical domain to discuss mostly technical topics. This was not the case at ODSC with a fair number of talks addressing the social and business context of machine learning.

In particular, Mona Khalil’s talk ‘Leveling Up Your Organization’s Capacity for Data-informed Decisions’ was both inspiring and practical at the same time. Mona is a data science manager at GreenHouse Software.

Mona started by encouraging attendees to consider the broader context of data within their organisations. My understanding of their thesis was that developing an effective communication strategy while considering pathways to creating value can lead to maximising that value.

Throughout Mona’s presentation, I couldn’t help but think of the 3 ways in DevOps which include systems thinking (i.e.: pathways to value) and effective communication strategy (amplifying feedback cycles) as key components.

Specific suggestions Mona made that I’d like to highlight include:

  • Auditing your data assets. Know which stakeholders need access to which data.
  • Creating a monthly newsletter to keep your organisation aware of key data points and developments relevant to your teams.
  • Using dashboarding tools to create low cost viewership of valuable data across your organisation.
  • However, Mona’s suggestion that resonated with me the most was to empower learning across your organisation (similar to the 3rd way in DevOps).

The more people learn about the data that can inform their decisions, the less pressure will exist on data science professionals to be on call for data support.

Mona provided some valuable resources to provide further detail including this article on self-service analytics, Shopify’s “Data Science and Engineering Foundations” article and this article on enabling other teams in your organisation with data as a service.

Productive Human AI collaboration will be possible in the near future (and now!)

Earlier this year I started using GitHub Copilot an AI pair programmer which functions go far beyond sophisticated autocompletion to converting comments to relatively sophisticated code, writing my unit-tests and suggesting alternatives solutions to complete tasks. If you haven’t already tried it, I highly recommend it.

So when I saw that Padhraic Smyth was giving a talk titled “Overconfidence in Machine Learning: Do Our Models Know What They Don’t Know?” — I was intrigued, but I wasn’t expecting such an in-depth and fascinating presentation on Human-AI collaboration.

Smyth begins by showing that SOTA (State-of-the-art) models for tasks like image classification can be wrong, and confidently so. He provided examples where genuinely powerful, well-trained models were assigning high probabilities to incorrect classes or predictions.

Interestingly, he further indicates the literature suggests that shallow models tend to be better calibrated (assigning lower confidence to incorrect predictions than correct predictions) and while many attempts have been made to solve these problems such as via ensembling, Bayesian approaches or label smoothing, they have achieved only varying degrees of success.

At this point, Smyth diverges, focussing on human-AI complementarity as a solution. The basic idea (I’m probably being overly reductive), is that humans and AI make different kinds of mistakes. That is to say, we might be able to exploit the orthogonality of our predictions and use Bayesian methods to combine human/AI predictions to better than either alone.

Figure 3 of Smyth’s paper “Bayesian modeling of human–AI complementarity | PNAS” shows this effect nicely (I’d have included it here but wanted to avoid copyright issues. Have a look at it, and notice that improvements in accuracy from the Human-AI hybrids increased where correlation in the prediction was low).

On this image classification task, where human and neural network errors correlated the least, it was possible to use Bayesian modelling to combine predictions and improve results.

Smyth’s insights are very consistent with my own. Yes, Github co-pilot can often be confidently wrong. But it’s wrong in ways that are obvious to me. However, it’s still immensely useful and between the two of us, we’re writing code faster and with more comprehensive tests that before.

Another similar tool I’m eager to use more of is AI research assistant tool Elicit, which uses fine-tuned GPT-3 language models to assist researchers in evaluating evidence.

Personally, I’m so excited to have my own productivity increased in the presence of such tools and look forward to seeing progress increase in this space or to possibly contribute to it.

Final Notes

Attending ODSC East 2022 was an awesome experience and I highly recommend anyone working in tech or science to attend.

The bottom line is simple.

Machine Learning and AI capabilities are reaching new heights, augmenting human capabilities, enabling us to democratize information in our organisations and automate more of the boring tasks than ever before.

Hopefully I will see you all at ODSC 2023!

1533 Words

2022-04-29 00:00 +0000