Overfitting vs Underfitting in Machine Learning – Complete Guide with Real Examples

When we build machine learning models, we want them to do one thing well: make accurate predictions on new, unseen data.

But there are two common pitfalls that ruin that goal:

  • Overfitting – your model learns the training data too well, including its noise and quirks.
  • Underfitting – your model is too simple to capture the underlying pattern in the data.

In this guide, I’ll break down both concepts not with confusing math, but with visual explanations, relatable analogies, and real-world code examples in Python.

By the end, you’ll understand:

  • What overfitting and underfitting look like in real datasets
  • Why these problems occur
  • How to detect them using learning curves and validation scores
  • Practical ways to fix them

What Are Overfitting and Underfitting?

Imagine you’re preparing for an exam:

  • If you memorize every question from a practice test, but don’t understand the concepts you’ll fail if the questions change. That’s overfitting.
  • If you just skim the book and barely learn anything you’ll fail because you don’t know enough. That’s underfitting.

ML models behave the same way.


Definitions and Key Differences

AspectOverfittingUnderfitting
Model BehaviorToo complex, learns noiseToo simple, misses patterns
Training AccuracyVery highLow
Test AccuracyPoorPoor
GeneralizationWeakWeak
Fix StrategySimplify, regularizeAdd complexity, better features

Real-Life Analogy

Let’s say you’re learning to drive:

  • Underfitting: You only read the rules but never practice you fail even in familiar settings.
  • Overfitting: You only drive on your home street and memorize every bump the moment you go to a new area, you panic.

A good driver (or ML model) should generalize to new roads (data).


Visual Explanation with Python Code

Let’s simulate this using Scikit-learn:

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Generate fake data
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Try different model complexities
degrees = [1, 4, 15]

plt.figure(figsize=(18, 4))

for i, d in enumerate(degrees):
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    plt.subplot(1, 3, i+1)
    plt.scatter(X, y, color='gray', label="Actual data")
    plt.plot(X, model.predict(X), label=f"Degree {d}", linewidth=2)
    plt.title(f"Model with Degree {d}")
    plt.legend()

plt.show()

Interpretation:

  • Degree 1: underfits (can’t capture sine wave)
  • Degree 15: overfits (wavy, unnatural fit)
  • Degree 4: best generalization

Learning Curves – How to Detect Overfitting/Underfitting

Overfitting vs Underfitting learning curves example
How to Detect Overfitting and Underfitting with Learning Curves

Plot training and validation scores over increasing dataset sizes:

Python
from sklearn.model_selection import learning_curve
from sklearn.metrics import mean_squared_error

train_sizes, train_scores, val_scores = learning_curve(
    make_pipeline(PolynomialFeatures(15), LinearRegression()),
    X, y, cv=5, scoring='neg_mean_squared_error',
    train_sizes=np.linspace(0.1, 1.0, 5)
)

train_mean = -np.mean(train_scores, axis=1)
val_mean = -np.mean(val_scores, axis=1)

plt.plot(train_sizes, train_mean, 'o-', label='Training Error')
plt.plot(train_sizes, val_mean, 'o-', label='Validation Error')
plt.title("Learning Curve (Overfitting Example)")
plt.xlabel("Training Set Size")
plt.ylabel("Mean Squared Error")
plt.legend()
plt.grid()
plt.show()

What to Look For:

  • Overfitting: Large gap between training and validation error.
  • Underfitting: Both errors are high and close together.

How to Fix Overfitting

  1. Simplify the model
    Use fewer features or reduce polynomial degree.
  2. Regularization
    Add penalties to large weights. Use Ridge, Lasso, or ElasticNet.
  3. More data
    The more diverse data you feed, the better the model learns general patterns.
  4. Early stopping
    For iterative algorithms (like neural networks), stop before it overfits.

How to Fix Underfitting

  1. Add more features
    If your model is too simple, give it richer information.
  2. Increase model complexity
    Switch from linear to polynomial, or try a more flexible algorithm.
  3. Decrease regularization
    Too much penalty? It can make the model too constrained.

Real-World Example: Predicting Housing Prices

Let’s compare two models on a dataset:

Python
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y)

model_under = LinearRegression()
model_under.fit(X_train[:, :2], y_train)  # only 2 features

model_over = Ridge(alpha=0.1)
model_over.fit(X_train, y_train)

print("Underfit R²:", r2_score(y_test, model_under.predict(X_test[:, :2])))
print("Overfit R²:", r2_score(y_test, model_over.predict(X_test)))

Result:

  • Using too few features → underfitting
  • Too low regularization (or too many features) → overfitting

Common Mistakes to Avoid

MistakeFix
Using test data for tuningAlways separate test/validation data
Ignoring validation performanceMonitor both train and val metrics
Blindly increasing model depthMore isn’t always better
No regularizationUse L2/L1 to avoid overfitting

External Resources



💌 Stay Updated with PyUniverse

Want Python and AI explained simply straight to your inbox?

Join hundreds of curious learners who get:

  • ✅ Practical Python tips & mini tutorials
  • ✅ New blog posts before anyone else
  • ✅ Downloadable cheat sheets & quick guides
  • ✅ Behind-the-scenes updates from PyUniverse

No spam. No noise. Just useful stuff that helps you grow one email at a time.

🛡️ I respect your privacy. You can unsubscribe anytime.

Leave a Comment