A complete beginner guide to ML classification — teach a computer to detect spam using Python, pandas, and scikit-learn. Understand the foundational concepts that power real-world AI systems.
Python Basics → Python CRUD → Machine Learning (You Are Here) → Deep Learning → Django AI Apps → Full-Stack AI Developer 🎯 Every concept here — training, prediction, evaluation — maps directly to production AI systems.
Welcome to one of the most exciting fields in modern technology! 🎉 In this tutorial, you will build a spam email classifier — a real Machine Learning model that can look at an email and decide whether it is spam or legitimate. By the time you finish, you will understand the complete ML workflow that powers everything from Gmail's spam filter to fraud detection at banks.
This is the second tutorial in our ML series. In the first, we predicted a continuous number (house price — regression). Here, we tackle a classification problem: the output is a category, not a number. Our two categories are simple:
Unwanted emails — usually contains suspicious keywords, promotional tricks, or phishing attempts. We label these as 1.
Legitimate emails — from real people, colleagues, or trusted services. We label these as 0.
The goal of our ML model is to learn the boundary between these two classes from example data, and then use that knowledge to classify new, unseen emails automatically.
You might wonder: why use Machine Learning at all? Why not just write a simple rule like "if the email contains the word 'free', mark it as spam"?
Here is why rule-based systems fail in the real world:
| Approach | How It Works | Problem |
|---|---|---|
| Rule-Based | You manually write rules: "if contains 'free' → spam" | ❌ "Free shipping on your order" is NOT spam. Rules break immediately. |
| Keyword Blacklist | Block any email with banned words | ❌ Spammers just replace words with symbols: "fr3e", "w!nner" |
| Machine Learning | Show the model thousands of examples, let it find patterns | ✅ Learns complex combinations of features. Adapts to new patterns. |
Before we start, make sure you have Python and the required libraries installed.
Open your terminal and run this single command:
pip install pandas scikit-learn numpy matplotlib
Here is what each library does:
| Library | Purpose | Used For |
|---|---|---|
pandas | Data manipulation | Loading and organising our email dataset |
scikit-learn | Machine Learning | Logistic Regression model, train/test split, metrics |
numpy | Numerical computing | Array operations used internally by scikit-learn |
matplotlib | Visualisation | Plotting the confusion matrix |
The first task in any ML project is converting real-world data into a format the computer can understand. This is called Feature Engineering — one of the most important skills in data science.
We cannot feed raw email text directly into most ML algorithms. Instead, we extract features — specific measurable properties of each email. For simplicity in this beginner tutorial, we will check for the presence of three keywords commonly found in spam emails: "free", "winner", and "prize".
We represent each email as a dictionary with four fields: three feature columns (1 = keyword present,
0 = not present) and one label column (is_spam: 1 = spam, 0 = not spam).
# ── Import our essential libraries ──────────────────
import pandas as pd
import numpy as np
# ── Our dataset: 12 emails (expanded for better training) ──
email_data = [
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Ham
{'contains_free': 1, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Ham ("free shipping")
{'contains_free': 0, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Ham
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 1}, # Spam
{'contains_free': 1, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 0}, # Ham ("winner of the quiz")
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Ham
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1}, # Spam
]
# ── Create a Pandas DataFrame ──────────────────────
df = pd.DataFrame(email_data)
print("Our email dataset:")
print(df)
print(f"\nSpam emails: {df['is_spam'].sum()}")
print(f"Ham emails: {len(df) - df['is_spam'].sum()}")
Our email dataset:
contains_free contains_winner contains_prize is_spam
0 1 1 1 1
1 0 0 0 0
2 1 0 0 0
3 0 1 1 1
4 0 0 1 1
5 0 0 0 0
6 1 1 0 1
7 1 0 1 1
8 0 1 0 0
9 0 0 0 0
10 1 1 1 1
11 0 0 1 1
Spam emails: 7
Ham emails: 5
contains_free=1, everything else 0) is labelled Not Spam. This represents a legitimate email like "free shipping on your order". And row #8 (contains_winner=1, else 0) is also Not Spam — like "you were the winner of the school quiz". The model learns these combinations matter, not just individual words. This is the power of ML over simple rules!
Our DataFrame is a two-dimensional array structure — like a table. Internally, pandas stores this as a NumPy array, which enables fast mathematical operations on entire columns at once. This is essential for ML, where we might process millions of rows.
For a real-world spam filter checking against thousands of keywords, a Python set (based on a hash map) provides near-instant lookups — far faster than checking a list one by one.
# ── Using a Set for Fast Keyword Lookup ─────────────
# A set provides O(1) membership checking vs O(n) for a list
spam_keywords = {'free', 'winner', 'prize', 'congratulations',
'click here', 'act now', 'limited offer'}
# Test some words from an incoming email
test_words = ['hello', 'prize', 'meeting', 'free', 'report']
print("Keyword scan results:")
for word in test_words:
result = "🔴 SPAM WORD" if word in spam_keywords else "🟢 Clean"
print(ff" '{word}': {result}")
# Count how many spam keywords are in the email
spam_count = sum(1 for w in test_words if w in spam_keywords)
print(ff"\nSpam keywords found: {spam_count}/{len(test_words)}")
Keyword scan results:
'hello': 🟢 Clean
'prize': 🔴 SPAM WORD
'meeting': 🟢 Clean
'free': 🔴 SPAM WORD
'report': 🟢 Clean
Spam keywords found: 2/5
if word in spam_keywords with a set
is O(1) — constant time, regardless of how many keywords are in the set.
With a list, it would be O(n) — it checks every item one by one.
If you have 50,000 keywords, a set is 50,000x faster. This is why DSA matters in real ML systems!
Every successful ML project follows a clear, repeatable workflow. Before writing any model code, you must understand the complete plan. Here is the blueprint we will follow:
| # | Step | What We Do | scikit-learn Tool |
|---|---|---|---|
| 1 | Feature Extraction | Convert emails to numerical features (0s and 1s) | pandas DataFrame |
| 2 | X/y Separation | Split features (inputs) from labels (outputs) | df[[cols]], df[col] |
| 3 | Train/Test Split | Reserve some data for testing — never train on test data! | train_test_split() |
| 4 | Model Selection | Choose Logistic Regression for binary classification | LogisticRegression() |
| 5 | Training | Show the model the training data and let it learn | model.fit(X_train, y_train) |
| 6 | Prediction | Use the trained model to classify new emails | model.predict(X_test) |
| 7 | Evaluation | Measure accuracy, precision, recall, confusion matrix | metrics module |
Unlike regression (which predicts a continuous number on an infinite scale), classification deals with discrete categories — a finite set of possible outputs. This is where discrete mathematics becomes essential.
Our classification model is formally a function:
The model maps an input vector of features to one value in the finite set {0, 1}.
3 binary features = 2³ = 8 possible input combinations. The model must learn a classification for each.
A hyperplane in feature space that separates spam (class 1) from ham (class 0). The model's job is to find the best boundary.
Only 2 possible outputs: 0 (Not Spam) or 1 (Spam). This is called binary classification.
The model learns weights for each feature. A high positive weight on contains_winner
means that feature strongly pushes the prediction toward spam. A low weight means it barely matters.
These weights define the decision boundary.
How does Logistic Regression actually learn? It uses calculus to optimise its weights. Here is the intuition without going too deep into the maths:
The model first computes a raw score z based on the input features and learned weights.
It then passes this score through the Sigmoid function, which "squishes" any number
into a value between 0 and 1 — a probability:
For example, if the model computes a score of z = 3.0 for a suspicious email,
the Sigmoid gives σ(3.0) ≈ 0.95 — a 95% probability of being spam!
If z = -2.0, then σ(-2.0) ≈ 0.12 — only 12% chance of spam.
The model starts with random weights (random guesses). It then:
The good news? scikit-learn handles all of this automatically! When you call
model.fit(X, y), it runs thousands of gradient descent iterations internally and
finds the best weights. You just call one line of code. ✨
After training a model, how do we know if it is actually good? This is where statistics comes in. Simple accuracy (percentage of correct predictions) is misleading for imbalanced datasets.
A Confusion Matrix breaks down our model's predictions into four categories, giving us a much clearer picture of what is really happening:
| Predicted | ||
|---|---|---|
| Actual | Predicted: Ham (0) | Predicted: Spam (1) |
| Actual: Ham (0) | ✅ True Negative (TN) Correctly said "not spam" |
❌ False Positive (FP) Wrongly said "spam" — annoying! |
| Actual: Spam (1) | ❌ False Negative (FN) Missed spam — dangerous! |
✅ True Positive (TP) Correctly caught spam |
"Of all emails I called spam, how many actually were?"
High precision = few false alarms. Good for avoiding blocking legitimate emails.
"Of all actual spam emails, how many did I catch?"
High recall = catches most spam. Good for security, but may over-flag.
Now we put it all together! We will use scikit-learn's LogisticRegression to train
our spam classifier, split data properly with train_test_split, make predictions,
and evaluate with the full metrics suite.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, confusion_matrix,
precision_score, recall_score,
classification_report)
# ── 1. Prepare the dataset ──────────────────────────
email_data = [
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
{'contains_free': 1, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
{'contains_free': 0, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 1},
{'contains_free': 1, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
{'contains_free': 0, 'contains_winner': 1, 'contains_prize': 0, 'is_spam': 0},
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0},
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1},
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1},
]
df = pd.DataFrame(email_data)
# ── 2. Separate features (X) from label (y) ────────
X = df[['contains_free', 'contains_winner', 'contains_prize']]
y = df['is_spam']
# ── 3. Split data: 80% training, 20% testing ───────
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {len(X_train)} | Test samples: {len(X_test)}")
# ── 4. Create and train the model ──────────────────
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train) # The magic line — model learns here!
print("✅ Model training complete!")
# ── 5. Make predictions on test data ───────────────
y_pred = model.predict(X_test)
# ── 6. Evaluate performance ────────────────────────
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
conf_mat = confusion_matrix(y_test, y_pred)
print(f"\n📊 Model Performance:")
print(f" Accuracy: {accuracy * 100:.1f}%")
print(f" Precision: {precision * 100:.1f}%")
print(f" Recall: {recall * 100:.1f}%")
print(f"\nConfusion Matrix:")
print(conf_mat)
print("\nFull Report:")
print(classification_report(y_test, y_pred,
target_names=['Not Spam', 'Spam']))
# ── 7. Predict a new email ──────────────────────────
print("\n=== Classify New Emails ===")
test_emails = [
{"email": "FREE WINNER! Claim your PRIZE now!", "features": [[1, 1, 1]]},
{"email": "Team meeting tomorrow at 9am", "features": [[0, 0, 0]]},
{"email": "FREE shipping on your order #4521", "features": [[1, 0, 0]]},
{"email": "You are a WINNER! Claim your gift!", "features": [[0, 1, 0]]},
]
for item in test_emails:
prediction = model.predict(item["features"])[0]
probability = model.predict_proba(item["features"])[0][1]
verdict = "🔴 SPAM" if prediction == 1 else "🟢 NOT SPAM"
print(ff" {verdict} ({probability*100:.0f}% spam probability)")
print(ff" └─ \"{item['email']}\"\n")
Training samples: 9 | Test samples: 3
✅ Model training complete!
📊 Model Performance:
Accuracy: 100.0%
Precision: 100.0%
Recall: 100.0%
Confusion Matrix:
[[1 0]
[0 2]]
Full Report:
precision recall f1-score support
Not Spam 1.00 1.00 1.00 1
Spam 1.00 1.00 1.00 2
accuracy 1.00 3
=== Classify New Emails ===
🔴 SPAM (97% spam probability)
└─ "FREE WINNER! Claim your PRIZE now!"
🟢 NOT SPAM (4% spam probability)
└─ "Team meeting tomorrow at 9am"
🟢 NOT SPAM (31% spam probability)
└─ "FREE shipping on your order #4521"
🔴 SPAM (71% spam probability)
└─ "You are a WINNER! Claim your gift!"
Our matrix [[1, 0], [0, 2]] means:
Prefer learning by watching? This video gives a clear, visual explanation of Machine Learning fundamentals including classification, model training, and scikit-learn — perfect alongside this tutorial!
🎓 Machine Learning for Everybody — covers all core concepts including classification, model training, and evaluation. Watch alongside this tutorial for best results!
Our current model uses only 12 samples and 3 features. Here is how to make it production-grade:
A real spam filter trains on tens of thousands of emails. More data = more patterns learned = better accuracy. Try the UCI SMS Spam Dataset on Kaggle!
Instead of 3 binary features, use TfidfVectorizer to convert the full email text into thousands of word-frequency features. This is how real spam filters work.
Compare Logistic Regression with Naive Bayes (MultinomialNB) — the classic spam detection algorithm — and Random Forest. scikit-learn makes swapping easy.
Use cross_val_score to split data into multiple train/test folds and get a more reliable accuracy estimate. Avoids lucky/unlucky random splits.
Naive Bayes is mathematically optimised for text classification. Try from sklearn.naive_bayes import MultinomialNB and compare results with Logistic Regression.
Save your trained model with joblib.dump(model, 'spam_model.pkl'), then load it in a Django view to create a spam-checking web API. That's full-stack ML! 🚀
The exact same technique you just learned — training a binary classifier on labelled data — powers some of the most valuable technology in the world today:
| Application | Input Features | Classes (Output) | Used By |
|---|---|---|---|
| 📧 Spam Filter | Email keywords, sender, links | Spam / Not Spam | Gmail, Outlook |
| 💳 Fraud Detection | Transaction amount, location, time | Fraud / Legitimate | Visa, Mastercard, PayPal |
| 🏥 Disease Diagnosis | Patient symptoms, test results | Disease / Healthy | Hospitals, AI diagnostics |
| 😊 Sentiment Analysis | Review text words | Positive / Negative | Amazon, Twitter analytics |
| 🎬 Content Moderation | Post text, image features | Allowed / Remove | Facebook, YouTube, TikTok |
| 💰 Loan Approval | Income, credit score, history | Approve / Reject | Banks, FinTech apps |
This tutorial puts you firmly on the AI & full-stack development path. Here is where you are and where you are going:
Functions, loops, dictionaries, CRUD operations. The foundation of everything.
pandas, scikit-learn, Logistic Regression, feature engineering, confusion matrix. This tutorial.
TF-IDF text processing, Naive Bayes, neural networks with TensorFlow/Keras. Real spam filter with 10,000+ emails.
Rebuild this spam classifier as a Django web application. Users submit emails through a form and get instant spam/ham classification. ML meets full-stack!
Save your trained ML model with joblib, expose it through a REST API endpoint. Any frontend or mobile app can now use your AI. This is production ML!
Deploy your Django + ML API to a live server. Add automated retraining pipelines. This is what real AI engineers do every day.
You have built, trained, deployed, and shipped real AI-powered applications. You are a full-stack AI developer. Time to apply for jobs or go freelance! 🎓