Why Learn Machine Learning Pipelines?
When you start out in machine learning, it’s tempting to jump right into training models. But in real projects, model training is just one piece of the puzzle.
A machine learning pipeline is the step-by-step process that transforms raw data into something valuable a trained model that can make predictions in the real world.
Table of Contents
In this guide, we’ll break down:
- What an Machine Learning pipeline is
- The essential stages with code examples
- How to automate your pipeline with scikit-learn
- Best practices to structure pipelines like a pro
What Is a Machine Learning Pipeline?
A machine learning pipeline is a sequence of steps that:
- Ingests data
- Cleans it
- Engineers features
- Trains a model
- Evaluates performance
- (Optionally) Deploys the model
This helps you:
- Avoid repeating code
- Automate workflows
- Make experiments reproducible
- Transition smoothly to production
Think of it like a factory line: messy input goes in, intelligent predictions come out.
Step-by-Step Machine Learning Pipeline

We’ll go through each stage and build a pipeline for a real problem: predicting customer churn.
We’ll use:
pandas, numpy, seaborn, sklearn
Let’s load a dataset:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/Customer.csv")
df.head()
This dataset includes customer demographics, services, and whether they churned.
Step 1: Data Cleaning
Handle Missing Values
df.isnull().sum() # check for nulls
df.dropna(inplace=True)
Or use imputation:
df['MonthlyCharges'].fillna(df['MonthlyCharges'].median(), inplace=True)
Convert Data Types
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
📌 Tip: Always inspect your data types object types may hide numbers as strings.
Step 2: Feature Engineering
🔹 Encode Categorical Features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender']) # Male=1, Female=0
Or use one-hot encoding:
df = pd.get_dummies(df, drop_first=True)
🔹 Create Interaction Features
df['TenurePerService'] = df['tenure'] / (df['TotalCharges'] + 1)
Step 3: Splitting Data
from sklearn.model_selection import train_test_split
X = df.drop('Churn', axis=1)
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Build Pipeline with scikit-learn
Rather than writing separate preprocessing code, we can automate it using Pipeline
.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
Why use Pipeline?
- Keeps preprocessing + model tied together
- Prevents data leakage
- Easier to tune hyperparameters
Step 5: Model Evaluation
from sklearn.metrics import accuracy_score, classification_report
y_pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Try other models too:
from sklearn.ensemble import RandomForestClassifier
pipe_rf = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
pipe_rf.fit(X_train, y_train)
Step 6: Save the Model
Once you have a good model, you can save and reload it.
import joblib
joblib.dump(pipe, "churn_model.pkl")
To load:
loaded_model = joblib.load("churn_model.pkl")
loaded_model.predict(X_test)
Step 7: (Optional) Deploy with Flask or Streamlit
To deploy locally using Streamlit:
# Save this in app.py
import streamlit as st
import joblib
model = joblib.load("churn_model.pkl")
st.title("Churn Predictor")
tenure = st.number_input("Tenure")
monthly = st.number_input("Monthly Charges")
# Add other inputs...
if st.button("Predict"):
input_data = [[tenure, monthly]] # match model input structure
prediction = model.predict(input_data)
st.write("Prediction:", "Churn" if prediction[0] == 1 else "No Churn")
Run the app:
streamlit run app.py
Best Practices for Pipelines
- Use
Pipeline
for reproducibility - Preprocess with
ColumnTransformer
when using mixed data types - Always split your data early (before feature scaling)
- Use cross-validation (
GridSearchCV
) for tuning - Save both pipeline and model
Additional Resources:
- scikit-learn Pipeline Docs
- joblib Documentation
- Streamlit Docs
- PyCaret Automated Machine Learning Pipelines