Data Engineering

Data scientists should build reproducible pipelines

Daniel Taylor

Jan 10, 2023 • 4 min read

https://www.freepik.com/free-vector/industry-background-print_4547715.htm#query=pipeline&position=10&from_view=keyword Image by macrovector_official on Freepik

Most ML courses/blogs/articles/tutorials etc, describe an ML project as consisting of (something like) the following steps:

Extract data
Explore data
Clean data
Engineer features
Train model
Validate model
Release model

Steps 4-6 are repeated until the model is 'good enough' (which is a business decision, meaning it's a product person's job to decide).

Once the team has decided that the model is good enough, you hand over the model to the engineering team, who put it into production.

Job done.

What's wrong with this model?

This model of a machine learning project looks eerily similar to the process of releasing a book.

Research book
Write book
Release book

The cleaning part of the ML workflow maps to researching the book, the model-building part of the ML pipeline maps to writing the book (including feedback from an editor) and releasing the model maps onto releasing the book.

The target of this process is the book itself. The book is an artefact. Copies of it can be made and distributed. As long as you have intellectual property rights over the book, you can keep profiting from it.

So why is it different for machine learning?

Well, in general, books don't need to be rewritten. At least they don't need to be rewritten much. But machine learning models do need to be rewritten. New data might become available. New techniques might be invented.

Or. the model could suffer from data drift...

What is data drift?

Data drift occurs when something about the underlying data fed into the model no longer resembles the data the model was trained on.

Suppose we're trying to predict the number of ice cream sales a van will make as a function of the weather, location and time of year. Imagine we had five years' worth of data to build the model. Now imagine there's a recession in the sixth year and behaviour changes. The model isn't prepared for this, and predictions will be bad.

The model needs retraining on new data.

This is how machine learning is different from writing a book. A book is (relatively) unchanging. With a machine learning model, change is expected.

This makes me want to diverge into all types of philosophic discussions about the nature of permanency and change....

(resist urge)

Why does this matter?

This matters because, at a later date, somebody will have to restrain the model. This will involve going through the same process you went through the first time and possibly adding new steps.

That somebody might be you.

Or, it might be somebody else.

Either way, that person must be able to reproduce your original work.

Target the pipeline, not the artefact

This means that the target of your work is not the artefacts. Instead, the target is the ability to reproduce the artefacts. That also includes understanding how those artefacts are created.

What happens when you don't build reproducible pipelines?

Imagine you're working on a super-complicated deep-learning model inherited from a predecessor. As you work, new data comes along, which you use to update the model. You perform more and more analysis of the model, looking for weaknesses. In the process, you retrain the model ten times.

Then you deploy the model to production.

Job done.

Except that, six months later, somebody else came along to improve the model further.

They have access to the model and all the data (the artefacts). They also have access to all the code. But there's no way to know in which order the code was run, which data was present at the time the code was run, and what the purpose of the run was.

They can't reproduce the model.

How, then, are they supposed to improve it?

How to do it

Okay, I'm sold on building reproducible pipelines. How do I do it? What tools should I use?

If you're starting as a data scientist or machine learning engineer, you might get advice such as 'write modular code that can be reused.' The problem is, you might not know what that looks like. I didn't when I started. I'm still not %100 sure I know now (but I'm more sure than I was before).

I recently started using a tool called kedro to help me write data pipelines. I chose this tool because it lets me see the pipelines. It's designed around good software engineering principles. By forcing myself to use the tool, I started defining functions differently. By visualizing bad bits of the data pipeline, I could strip away layers of redundancy.

My software engineering friends will probably chuckle to themselves that this is so obvious to them. Still, it's a revelation to people coming at software engineering tangentially through maths and computer science.

How not to do it

There's a catch.

Let's remind ourselves of what data scientists do:

Extract data
Explore data
Clean data
Engineer features
Train model
Validate model
Release model

Much of what we do is exploratory. Speed is of the essence. Not all experiments need to be repeatable. That means that not all code needs this treatment.

Jupyter notebooks (the choice for most data scientists) are still an amazing tool.

Use them.

But when it does come time to write something more permanent – learn from software engineers who have been through all this before.