Machine Learning Pipeline in Python : From Raw Data to Deployed Model

Why Learn Machine Learning Pipelines?

When you start out in machine learning, it’s tempting to jump right into training models. But in real projects, model training is just one piece of the puzzle.

A machine learning pipeline is the step-by-step process that transforms raw data into something valuable a trained model that can make predictions in the real world.

In this guide, we’ll break down:

  • What an Machine Learning pipeline is
  • The essential stages with code examples
  • How to automate your pipeline with scikit-learn
  • Best practices to structure pipelines like a pro

What Is a Machine Learning Pipeline?

A machine learning pipeline is a sequence of steps that:

  1. Ingests data
  2. Cleans it
  3. Engineers features
  4. Trains a model
  5. Evaluates performance
  6. (Optionally) Deploys the model

This helps you:

  • Avoid repeating code
  • Automate workflows
  • Make experiments reproducible
  • Transition smoothly to production

Think of it like a factory line: messy input goes in, intelligent predictions come out.


Step-by-Step Machine Learning Pipeline

Visual explanation of pipeline code structure
Code steps in a typical machine learning pipeline

We’ll go through each stage and build a pipeline for a real problem: predicting customer churn.

We’ll use:

Python
pandas, numpy, seaborn, sklearn

Let’s load a dataset:

Python
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/Customer.csv")
df.head()

This dataset includes customer demographics, services, and whether they churned.


Step 1: Data Cleaning

Handle Missing Values

Python
df.isnull().sum()  # check for nulls
df.dropna(inplace=True)

Or use imputation:

Python
df['MonthlyCharges'].fillna(df['MonthlyCharges'].median(), inplace=True)

Convert Data Types

Python
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

📌 Tip: Always inspect your data types object types may hide numbers as strings.


Step 2: Feature Engineering

🔹 Encode Categorical Features

Python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])  # Male=1, Female=0

Or use one-hot encoding:

Python
df = pd.get_dummies(df, drop_first=True)

🔹 Create Interaction Features

Python
df['TenurePerService'] = df['tenure'] / (df['TotalCharges'] + 1)

Step 3: Splitting Data

Python
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Build Pipeline with scikit-learn

Rather than writing separate preprocessing code, we can automate it using Pipeline.

Python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipe.fit(X_train, y_train)

Why use Pipeline?

  • Keeps preprocessing + model tied together
  • Prevents data leakage
  • Easier to tune hyperparameters

Step 5: Model Evaluation

Python
from sklearn.metrics import accuracy_score, classification_report

y_pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Try other models too:

Python
from sklearn.ensemble import RandomForestClassifier
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestClassifier())
])
pipe_rf.fit(X_train, y_train)

Step 6: Save the Model

Once you have a good model, you can save and reload it.

Python
import joblib
joblib.dump(pipe, "churn_model.pkl")

To load:

Python
loaded_model = joblib.load("churn_model.pkl")
loaded_model.predict(X_test)

Step 7: (Optional) Deploy with Flask or Streamlit

To deploy locally using Streamlit:

Python
# Save this in app.py
import streamlit as st
import joblib

model = joblib.load("churn_model.pkl")

st.title("Churn Predictor")
tenure = st.number_input("Tenure")
monthly = st.number_input("Monthly Charges")

# Add other inputs...

if st.button("Predict"):
    input_data = [[tenure, monthly]]  # match model input structure
    prediction = model.predict(input_data)
    st.write("Prediction:", "Churn" if prediction[0] == 1 else "No Churn")

Run the app:

Python
streamlit run app.py

Best Practices for Pipelines

  • Use Pipeline for reproducibility
  • Preprocess with ColumnTransformer when using mixed data types
  • Always split your data early (before feature scaling)
  • Use cross-validation (GridSearchCV) for tuning
  • Save both pipeline and model

Additional Resources:


🔗 More On This Topic


💌 Stay Updated with PyUniverse

Want Python and AI explained simply straight to your inbox?

Join hundreds of curious learners who get:

  • ✅ Practical Python tips & mini tutorials
  • ✅ New blog posts before anyone else
  • ✅ Downloadable cheat sheets & quick guides
  • ✅ Behind-the-scenes updates from PyUniverse

No spam. No noise. Just useful stuff that helps you grow one email at a time.

🛡️ I respect your privacy. You can unsubscribe anytime.

Leave a Comment