Creating Pipelines with Scikit Learn
A quick introduction to creating simple (yet powerful) data pipelines with sklearn in Python.
Introduction
A pipeline in data science is a sequence of code that is applied to some data to transform it in the desired way.
When we think about data, in general, we want to do something with it, like extracting insights or modeling it to predict the outcome when the input is similar to what we already know.
Thinking about that, Scikit Learn’s team developed the Pipeline method. With it, you can create an ordered sequence of actions that will be applied to your data, making it easier to replicate and escalate your data wrangling or modeling.
A good example is applying the same sequence of tasks for training and test purposes.
Let’s say we have a dataset that we preprocessed with data scaling and then trained it on some data. When we have the test set, we will have to apply the very same transformations to it before we present the new data to the trained model.
That’s when the Pipeline comes in handy.