Building a Spam Filter from Scratch: A Step-by-Step Tutorial for Beginners.
In our last tutorial, we predicted a continuous number (a house price). Now, we'll tackle a different kind of problem: **classification**. Our goal is to teach a computer to place items into distinct categories.
We'll build a simple spam filter. The computer will learn to classify an email as either **"Spam"** or **"Not Spam"** (often called "Ham"). We'll climb the same foundational staircase, showing again how each step is crucial for building an intelligent system.
We have a small dataset of emails. To keep it simple, we won't analyze the full email text. Instead, we've simplified each email by checking for the presence of certain keywords often found in spam: "free," "winner," and "prize." Our goal is to create a model that can look at which keywords are in a new email and decide if it's spam.
Once again, Python is our tool for organizing information. For a classification problem, we need to store our "features" (the keywords) and our "labels" (the final verdict: Spam or Not Spam). Pandas DataFrames are perfect for this structured data.
We'll create a table where each row is an email. The columns will show whether a keyword is present (1 for yes, 0 for no) and the final classification. We'll label "Spam" as 1 and "Not Spam" as 0.
# Import our essential library
import pandas as pd
# Our dataset: Each dictionary is an "email"
email_data = [
{'contains_free': 1, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Not Spam
{'contains_free': 1, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0}, # Not Spam (e.g., "free shipping")
{'contains_free': 0, 'contains_winner': 1, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 1, 'is_spam': 1}, # Spam
{'contains_free': 0, 'contains_winner': 0, 'contains_prize': 0, 'is_spam': 0} # Not Spam
]
# Create our Pandas DataFrame
df = pd.DataFrame(email_data)
# Display the structured data
print(df)
contains_free contains_winner contains_prize is_spam
0 1 1 1 1
1 0 0 0 0
2 1 0 0 0
3 0 1 1 1
4 0 0 1 1
5 0 0 0 0
Our DataFrame is a 2D array-like structure. In a real-world spam filter, you might process millions of emails and check against thousands of keywords. An efficient data structure for counting word occurrences is a **hash map** (or a `dictionary` in Python). It allows for nearly instantaneous lookups to see if a word exists in a spam keyword list.
Let's imagine our algorithm needs to quickly check if a word from an incoming email is a spam keyword. A Python `set` (a type of hash map) is perfect for this.
# A set provides very fast membership checking
spam_keywords = {'free', 'winner', 'prize', 'congratulations', 'viagra'}
incoming_email_word = 'prize'
# The 'in' keyword check is extremely fast with a set
if incoming_email_word in spam_keywords:
print(f"The word '{incoming_email_word}' is a known spam keyword!")
else:
print(f"The word '{incoming_email_word}' is not a spam keyword.")
The word 'prize' is a known spam keyword!
Using the right data structure (a set) makes our algorithm faster and more scalable. This is the essence of DSA in practice.
As before, we need a clear, step-by-step plan. Our goal is "classify emails." We must break this down into a logical sequence for the computer.
Classification is pure discrete mathematics. The output is not a continuous number; it's a choice from a finite set of categories. In our case, the set of possible outputs is {Spam, Not Spam}, which we represent as {1, 0}. We're building a function that maps an input vector (our keyword features) to a single value in this discrete set.
Our model is a function $f$ that takes an email's features as input and outputs a class label:
$$ f(\text{contains_free, contains_winner, contains_prize}) \in \{0, 1\} $$The model learns a decision boundary. Think of it as a line or a plane in the space of our features. If an email's data point falls on one side of the boundary, it's classified as "Not Spam" (0). If it falls on the other side, it's "Spam" (1).
How does the model find the *best* decision boundary? Just like before, the answer is calculus! But instead of predicting a number directly, our model will predict a **probability**.
It uses a special function called the **Sigmoid function**, which takes any real number and squishes it into a value between 0 and 1. This output can be interpreted as the probability of the email being spam.
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$The model learns to calculate a score ($z$) based on the input features. The Sigmoid function turns this score into a probability. If the probability is > 0.5, we classify it as Spam (1); otherwise, it's Not Spam (0).
To learn the best parameters, the model still uses **Gradient Descent**. It uses a cost function designed for classification (like **Log Loss**) and calculus finds the parameters that minimize the error, effectively moving the decision boundary until it best separates the data.
For classification, how we measure success is critical. Simple **accuracy** (percent of correct predictions) can be misleading. For example, if 99% of emails are not spam, a model that always predicts "Not Spam" will be 99% accurate but completely useless! We need better statistical metrics.
A **Confusion Matrix** is a table that shows our model's performance:
From this, we derive powerful metrics:
There's often a trade-off. A very aggressive filter might have high recall (catches all spam) but low precision (marks many good emails as spam). A very cautious filter might have high precision but low recall.
Now, we use a library like **scikit-learn** to assemble all these pieces. We'll use `LogisticRegression`, which, despite its name, is a premier model for classification.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# 1. Separate features (X) from the label (y)
X = df[['contains_free', 'contains_winner', 'contains_prize']]
y = df['is_spam']
# 2. Create the Logistic Regression model
model = LogisticRegression()
# 3. Train the model on our entire dataset
# The .fit() method finds the best decision boundary using our data.
model.fit(X, y)
# 4. Make predictions on the same data to see what it learned
predictions = model.predict(X)
# 5. Evaluate the performance
accuracy = accuracy_score(y, predictions)
conf_matrix = confusion_matrix(y, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\n[[True Negatives, False Positives],")
print(" [False Negatives, True Positives]]")
# Let's predict a new email
new_email = [[1, 1, 0]] # Contains 'free' and 'winner', but not 'prize'
new_prediction = model.predict(new_email)
verdict = 'Spam' if new_prediction[0] == 1 else 'Not Spam'
print(f"\nA new email with features {new_email[0]} is classified as: {ve