🤖 Machine Learning Course

Machine Learning in Action: Build a Spam Filter 📧

A complete beginner guide to ML classification — teach a computer to detect spam using Python, pandas, and scikit-learn. Understand the foundational concepts that power real-world AI systems.

🟡 Intermediate Beginner ⏱️ ~90 Minutes 🐍 Python + scikit-learn 🧠 Classification Project
🚀
This tutorial is part of the EgoTECH AI & Full-Stack Learning Path!

Python Basics → Python CRUD → Machine Learning (You Are Here) → Deep Learning → Django AI Apps → Full-Stack AI Developer 🎯 Every concept here — training, prediction, evaluation — maps directly to production AI systems.

📘 Table of Contents

🎯 Introduction — What Is Machine Learning Classification?

Welcome to one of the most exciting fields in modern technology! 🎉 In this tutorial, you will build a spam email classifier — a real Machine Learning model that can look at an email and decide whether it is spam or legitimate. By the time you finish, you will understand the complete ML workflow that powers everything from Gmail's spam filter to fraud detection at banks.

This is the second tutorial in our ML series. In the first, we predicted a continuous number (house price — regression). Here, we tackle a classification problem: the output is a category, not a number. Our two categories are simple:

🔴
Spam (Class 1)

Unwanted emails — usually contains suspicious keywords, promotional tricks, or phishing attempts. We label these as 1.

🟢
Not Spam / Ham (Class 0)

Legitimate emails — from real people, colleagues, or trusted services. We label these as 0.

The goal of our ML model is to learn the boundary between these two classes from example data, and then use that knowledge to classify new, unseen emails automatically.

❓ The Problem: Why We Can't Just Use Simple Rules

You might wonder: why use Machine Learning at all? Why not just write a simple rule like "if the email contains the word 'free', mark it as spam"?

Here is why rule-based systems fail in the real world:

ApproachHow It WorksProblem
Rule-Based You manually write rules: "if contains 'free' → spam" ❌ "Free shipping on your order" is NOT spam. Rules break immediately.
Keyword Blacklist Block any email with banned words ❌ Spammers just replace words with symbols: "fr3e", "w!nner"
Machine Learning Show the model thousands of examples, let it find patterns ✅ Learns complex combinations of features. Adapts to new patterns.
🔑 The Core Insight: Instead of telling the computer the rules, Machine Learning shows the computer examples and lets it figure out the rules itself. This is why ML is so powerful — it can discover patterns humans would never think to write manually.

🛠️ Prerequisites & Setup

Before we start, make sure you have Python and the required libraries installed.

✅ What You Need

📦 Install Required Libraries

Open your terminal and run this single command:

pip install pandas scikit-learn numpy matplotlib

Here is what each library does:

LibraryPurposeUsed For
pandasData manipulationLoading and organising our email dataset
scikit-learnMachine LearningLogistic Regression model, train/test split, metrics
numpyNumerical computingArray operations used internally by scikit-learn
matplotlibVisualisationPlotting the confusion matrix

1 Python Fundamentals — Feature Engineering

The first task in any ML project is converting real-world data into a format the computer can understand. This is called Feature Engineering — one of the most important skills in data science.

We cannot feed raw email text directly into most ML algorithms. Instead, we extract features — specific measurable properties of each email. For simplicity in this beginner tutorial, we will check for the presence of three keywords commonly found in spam emails: "free", "winner", and "prize".

📊 Our Dataset

We represent each email as a dictionary with four fields: three feature columns (1 = keyword present, 0 = not present) and one label column (is_spam: 1 = spam, 0 = not spam).

# ── Import our essential libraries ──────────────────
import pandas as pd
import numpy as np

# ── Our dataset: 12 emails (expanded for better training) ──
email_data = [
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},  # Spam
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},  # Ham
    {'contains_free': 1, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},  # Ham ("free shipping")
    {'contains_free': 0, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},  # Spam
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},  # Spam
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},  # Ham
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 1},  # Spam
    {'contains_free': 1, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},  # Spam
    {'contains_free': 0, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 0},  # Ham ("winner of the quiz")
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},  # Ham
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},  # Spam
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},  # Spam
]

# ── Create a Pandas DataFrame ──────────────────────
df = pd.DataFrame(email_data)

print("Our email dataset:")
print(df)
print(f"\nSpam emails: {df['is_spam'].sum()}")
print(f"Ham emails:  {len(df) - df['is_spam'].sum()}")

📤 Expected Output

Our email dataset:
    contains_free  contains_winner  contains_prize  is_spam
0               1                1               1        1
1               0                0               0        0
2               1                0               0        0
3               0                1               1        1
4               0                0               1        1
5               0                0               0        0
6               1                1               0        1
7               1                0               1        1
8               0                1               0        0
9               0                0               0        0
10              1                1               1        1
11              0                0               1        1

Spam emails: 7
Ham emails:  5
💡 Feature Engineering Insight: Notice how email row #2 (contains_free=1, everything else 0) is labelled Not Spam. This represents a legitimate email like "free shipping on your order". And row #8 (contains_winner=1, else 0) is also Not Spam — like "you were the winner of the school quiz". The model learns these combinations matter, not just individual words. This is the power of ML over simple rules!

2 Data Structures & Algorithms — The Right Tools

Our DataFrame is a two-dimensional array structure — like a table. Internally, pandas stores this as a NumPy array, which enables fast mathematical operations on entire columns at once. This is essential for ML, where we might process millions of rows.

For a real-world spam filter checking against thousands of keywords, a Python set (based on a hash map) provides near-instant lookups — far faster than checking a list one by one.

# ── Using a Set for Fast Keyword Lookup ─────────────
# A set provides O(1) membership checking vs O(n) for a list

spam_keywords = {'free', 'winner', 'prize', 'congratulations',
                 'click here', 'act now', 'limited offer'}

# Test some words from an incoming email
test_words = ['hello', 'prize', 'meeting', 'free', 'report']

print("Keyword scan results:")
for word in test_words:
    result = "🔴 SPAM WORD" if word in spam_keywords else "🟢 Clean"
    print(ff"  '{word}': {result}")

# Count how many spam keywords are in the email
spam_count = sum(1 for w in test_words if w in spam_keywords)
print(ff"\nSpam keywords found: {spam_count}/{len(test_words)}")
Keyword scan results:
  'hello':   🟢 Clean
  'prize':   🔴 SPAM WORD
  'meeting': 🟢 Clean
  'free':    🔴 SPAM WORD
  'report':  🟢 Clean

Spam keywords found: 2/5
ℹ️ Time Complexity: Checking if word in spam_keywords with a set is O(1) — constant time, regardless of how many keywords are in the set. With a list, it would be O(n) — it checks every item one by one. If you have 50,000 keywords, a set is 50,000x faster. This is why DSA matters in real ML systems!

3 Algorithmic Problem Solving — The ML Blueprint

Every successful ML project follows a clear, repeatable workflow. Before writing any model code, you must understand the complete plan. Here is the blueprint we will follow:

#StepWhat We Doscikit-learn Tool
1Feature ExtractionConvert emails to numerical features (0s and 1s)pandas DataFrame
2X/y SeparationSplit features (inputs) from labels (outputs)df[[cols]], df[col]
3Train/Test SplitReserve some data for testing — never train on test data!train_test_split()
4Model SelectionChoose Logistic Regression for binary classificationLogisticRegression()
5TrainingShow the model the training data and let it learnmodel.fit(X_train, y_train)
6PredictionUse the trained model to classify new emailsmodel.predict(X_test)
7EvaluationMeasure accuracy, precision, recall, confusion matrixmetrics module
🔵 Important — The Golden Rule of ML: Never test your model on data it was trained on. That would be like giving students the exam answers to study, then testing them on the same answers — of course they'd score 100%! We always split data into a training set (model learns from this) and a test set (we evaluate on this — model has never seen it).

4 Discrete Structures — The World of Categories

Unlike regression (which predicts a continuous number on an infinite scale), classification deals with discrete categories — a finite set of possible outputs. This is where discrete mathematics becomes essential.

Our classification model is formally a function:

f(contains_free, contains_winner, contains_prize) → { 0, 1 }

The model maps an input vector of features to one value in the finite set {0, 1}.

📐
Input Space

3 binary features = 2³ = 8 possible input combinations. The model must learn a classification for each.

✂️
Decision Boundary

A hyperplane in feature space that separates spam (class 1) from ham (class 0). The model's job is to find the best boundary.

🎯
Output Set

Only 2 possible outputs: 0 (Not Spam) or 1 (Spam). This is called binary classification.

The model learns weights for each feature. A high positive weight on contains_winner means that feature strongly pushes the prediction toward spam. A low weight means it barely matters. These weights define the decision boundary.

5 Calculus Essentials — The Sigmoid Function & Gradient Descent

How does Logistic Regression actually learn? It uses calculus to optimise its weights. Here is the intuition without going too deep into the maths:

The Sigmoid Function 📈

The model first computes a raw score z based on the input features and learned weights. It then passes this score through the Sigmoid function, which "squishes" any number into a value between 0 and 1 — a probability:

σ(z) = 1 / (1 + e−z)

For example, if the model computes a score of z = 3.0 for a suspicious email, the Sigmoid gives σ(3.0) ≈ 0.95 — a 95% probability of being spam! If z = -2.0, then σ(-2.0) ≈ 0.12 — only 12% chance of spam.

Gradient Descent — How the Model Learns ⛷️

The model starts with random weights (random guesses). It then:

  1. Makes predictions on the training data
  2. Calculates how wrong it is using a loss function (Log Loss for classification)
  3. Uses calculus (the derivative/gradient) to figure out which direction to adjust each weight
  4. Takes a small step in that direction (learning rate controls the step size)
  5. Repeats thousands of times until the loss stops decreasing
💡 Analogy: Imagine you are blindfolded on a hilly landscape. Your goal is to find the lowest valley (minimum loss). Gradient Descent is like feeling which direction is downhill with your foot and taking a step that way — repeatedly — until you reach the bottom. The gradient (derivative) tells you which direction is downhill!

The good news? scikit-learn handles all of this automatically! When you call model.fit(X, y), it runs thousands of gradient descent iterations internally and finds the best weights. You just call one line of code. ✨

6 Statistics — The Confusion Matrix & Evaluation Metrics

After training a model, how do we know if it is actually good? This is where statistics comes in. Simple accuracy (percentage of correct predictions) is misleading for imbalanced datasets.

⚠️ The Accuracy Trap: Imagine 99% of emails are not spam. A "dumb" model that always predicts "Not Spam" (never catches any spam!) would still score 99% accuracy. Clearly useless — but the accuracy number looks great. This is why we need better metrics!

The Confusion Matrix 🔲

A Confusion Matrix breaks down our model's predictions into four categories, giving us a much clearer picture of what is really happening:

Predicted
Actual Predicted: Ham (0) Predicted: Spam (1)
Actual: Ham (0) ✅ True Negative (TN)
Correctly said "not spam"
❌ False Positive (FP)
Wrongly said "spam" — annoying!
Actual: Spam (1) ❌ False Negative (FN)
Missed spam — dangerous!
✅ True Positive (TP)
Correctly caught spam

Precision vs Recall 🎯

🎯 Precision
TP / (TP + FP)

"Of all emails I called spam, how many actually were?"
High precision = few false alarms. Good for avoiding blocking legitimate emails.

🕵️ Recall
TP / (TP + FN)

"Of all actual spam emails, how many did I catch?"
High recall = catches most spam. Good for security, but may over-flag.

ℹ️ The Precision-Recall Trade-Off: You usually cannot maximise both at the same time. A very aggressive filter has high recall (catches all spam) but low precision (also blocks good emails). A very conservative filter has high precision but low recall (lets some spam through). Your application determines which matters more!

7 Machine Learning — Building the Final Classifier

Now we put it all together! We will use scikit-learn's LogisticRegression to train our spam classifier, split data properly with train_test_split, make predictions, and evaluate with the full metrics suite.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, confusion_matrix,
                               precision_score, recall_score,
                               classification_report)

# ── 1. Prepare the dataset ──────────────────────────
email_data = [
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
    {'contains_free': 1, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
    {'contains_free': 0, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 1},
    {'contains_free': 1, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
    {'contains_free': 0, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 0},
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
    {'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
    {'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
]
df = pd.DataFrame(email_data)

# ── 2. Separate features (X) from label (y) ────────
X = df[['contains_free', 'contains_winner', 'contains_prize']]
y = df['is_spam']

# ── 3. Split data: 80% training, 20% testing ───────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {len(X_train)} | Test samples: {len(X_test)}")

# ── 4. Create and train the model ──────────────────
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)  # The magic line — model learns here!
print("✅ Model training complete!")

# ── 5. Make predictions on test data ───────────────
y_pred = model.predict(X_test)

# ── 6. Evaluate performance ────────────────────────
accuracy  = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall    = recall_score(y_test, y_pred, zero_division=0)
conf_mat  = confusion_matrix(y_test, y_pred)

print(f"\n📊 Model Performance:")
print(f"   Accuracy:  {accuracy * 100:.1f}%")
print(f"   Precision: {precision * 100:.1f}%")
print(f"   Recall:    {recall * 100:.1f}%")
print(f"\nConfusion Matrix:")
print(conf_mat)
print("\nFull Report:")
print(classification_report(y_test, y_pred,
      target_names=['Not Spam', 'Spam']))

# ── 7. Predict a new email ──────────────────────────
print("\n=== Classify New Emails ===")
test_emails = [
    {"email": "FREE WINNER! Claim your PRIZE now!",  "features": [[1, 1, 1]]},
    {"email": "Team meeting tomorrow at 9am",          "features": [[0, 0, 0]]},
    {"email": "FREE shipping on your order #4521",     "features": [[1, 0, 0]]},
    {"email": "You are a WINNER! Claim your gift!",   "features": [[0, 1, 0]]},
]

for item in test_emails:
    prediction = model.predict(item["features"])[0]
    probability = model.predict_proba(item["features"])[0][1]
    verdict = "🔴 SPAM" if prediction == 1 else "🟢 NOT SPAM"
    print(ff"  {verdict} ({probability*100:.0f}% spam probability)")
    print(ff"  └─ \"{item['email']}\"\n")

📤 Understanding the Output

Training samples: 9 | Test samples: 3
✅ Model training complete!

📊 Model Performance:
   Accuracy:  100.0%
   Precision: 100.0%
   Recall:    100.0%

Confusion Matrix:
[[1 0]
 [0 2]]

Full Report:
              precision    recall  f1-score   support

    Not Spam       1.00      1.00      1.00         1
        Spam       1.00      1.00      1.00         2

    accuracy                           1.00         3

=== Classify New Emails ===
  🔴 SPAM (97% spam probability)
  └─ "FREE WINNER! Claim your PRIZE now!"

  🟢 NOT SPAM (4% spam probability)
  └─ "Team meeting tomorrow at 9am"

  🟢 NOT SPAM (31% spam probability)
  └─ "FREE shipping on your order #4521"

  🔴 SPAM (71% spam probability)
  └─ "You are a WINNER! Claim your gift!"

📖 Reading the Confusion Matrix

Our matrix [[1, 0], [0, 2]] means:

🏆 Notice the "FREE shipping" result! The model correctly predicted "FREE shipping on your order" as Not Spam (only 31% probability), even though it contains the word "free". This is because it learned from training data that "free" alone (without winner/prize) was often legitimate. This is exactly what a rule-based system could NOT do — it took a trained ML model to understand context! 🚀

🎬 Watch: Machine Learning Beginner Tutorial

Prefer learning by watching? This video gives a clear, visual explanation of Machine Learning fundamentals including classification, model training, and scikit-learn — perfect alongside this tutorial!

🎓 Machine Learning for Everybody — covers all core concepts including classification, model training, and evaluation. Watch alongside this tutorial for best results!

🛠️ How to Improve This Model

Our current model uses only 12 samples and 3 features. Here is how to make it production-grade:

📈 More Training Data

A real spam filter trains on tens of thousands of emails. More data = more patterns learned = better accuracy. Try the UCI SMS Spam Dataset on Kaggle!

📝 TF-IDF Features

Instead of 3 binary features, use TfidfVectorizer to convert the full email text into thousands of word-frequency features. This is how real spam filters work.

🌲 Try Other Algorithms

Compare Logistic Regression with Naive Bayes (MultinomialNB) — the classic spam detection algorithm — and Random Forest. scikit-learn makes swapping easy.

🔄 Cross-Validation

Use cross_val_score to split data into multiple train/test folds and get a more reliable accuracy estimate. Avoids lucky/unlucky random splits.

🧠 Naive Bayes (Advanced)

Naive Bayes is mathematically optimised for text classification. Try from sklearn.naive_bayes import MultinomialNB and compare results with Logistic Regression.

🌐 Deploy as Django API

Save your trained model with joblib.dump(model, 'spam_model.pkl'), then load it in a Django view to create a spam-checking web API. That's full-stack ML! 🚀

🌍 Real-World ML Classification Applications

The exact same technique you just learned — training a binary classifier on labelled data — powers some of the most valuable technology in the world today:

ApplicationInput FeaturesClasses (Output)Used By
📧 Spam FilterEmail keywords, sender, linksSpam / Not SpamGmail, Outlook
💳 Fraud DetectionTransaction amount, location, timeFraud / LegitimateVisa, Mastercard, PayPal
🏥 Disease DiagnosisPatient symptoms, test resultsDisease / HealthyHospitals, AI diagnostics
😊 Sentiment AnalysisReview text wordsPositive / NegativeAmazon, Twitter analytics
🎬 Content ModerationPost text, image featuresAllowed / RemoveFacebook, YouTube, TikTok
💰 Loan ApprovalIncome, credit score, historyApprove / RejectBanks, FinTech apps

🚀 Your Full-Stack AI Developer Path

This tutorial puts you firmly on the AI & full-stack development path. Here is where you are and where you are going:

1
Python Basics + CMD CRUD ✅

Functions, loops, dictionaries, CRUD operations. The foundation of everything.

2
Machine Learning Classification ← YOU ARE HERE 🤖

pandas, scikit-learn, Logistic Regression, feature engineering, confusion matrix. This tutorial.

3
ML with Real Datasets (NLP + Deep Learning)

TF-IDF text processing, Naive Bayes, neural networks with TensorFlow/Keras. Real spam filter with 10,000+ emails.

4
Django Basics — Your First AI-Powered Web App

Rebuild this spam classifier as a Django web application. Users submit emails through a form and get instant spam/ham classification. ML meets full-stack!

5
Django REST API + ML Model Serving

Save your trained ML model with joblib, expose it through a REST API endpoint. Any frontend or mobile app can now use your AI. This is production ML!

6
Deploy on Cloud (AWS / DigitalOcean / Railway)

Deploy your Django + ML API to a live server. Add automated retraining pipelines. This is what real AI engineers do every day.

7
🏆 Full-Stack AI Developer — Real Portfolio Projects

You have built, trained, deployed, and shipped real AI-powered applications. You are a full-stack AI developer. Time to apply for jobs or go freelance! 🎓

🎓 EgoTECH World AI + Full-Stack Path: We are building Sinhala-language courses covering every step above. 👉 Visit our Courses page and also check our Python CRUD tutorial if you haven't completed Step 1 yet!

❓ Frequently Asked Questions

Great question! The name is historically confusing. Logistic Regression is actually a classification algorithm. The "regression" part refers to the fact that it internally fits a regression to predict probabilities (using the Sigmoid function), but those probabilities are then thresholded (at 0.5) to produce class labels. It is one of the most widely used classification algorithms in industry.

Regression predicts a continuous number (e.g., house price = $250,000). Classification predicts a category (e.g., spam or not spam, cat or dog). The main difference is the type of output: infinite numeric scale vs. a finite set of labels. Different algorithms are used for each — though some (like Logistic Regression) bridge both worlds.

With our tiny 12-sample dataset, 100% accuracy is possible because the patterns are very clear and the test set is small. In real projects with thousands of emails, you will rarely see 100% accuracy — typically 95–99% is excellent for spam detection. Also, 100% on training data can indicate overfitting (the model memorised the data instead of learning general patterns). Always evaluate on unseen test data!

Absolutely! The exact same workflow works for:
  • Customer churn prediction (will this customer leave or stay?)
  • Medical diagnosis (positive or negative for a condition?)
  • Credit card fraud (fraudulent or legitimate transaction?)
  • Sentiment analysis (positive or negative review?)
  • Job applicant screening (qualified or not qualified?)
Just change the features and labels to match your problem. The ML pipeline is identical.

ML and AI skills are in massive global demand, and Sri Lankan developers are well-positioned to work remotely for international companies. Python + ML expertise can earn USD 40,000–120,000+ per year on international remote contracts. Locally, companies like 99X, WSO2, Calcey Technologies, and growing FinTech startups are investing in data science. The combination of Python + Django + ML is especially powerful for full-stack AI app development. Start building your portfolio now — check our Jobs board for current AI/ML opportunities!
📚 Continue Learning on EgoTECH World