Module 9: AI & Machine Learning Mini-Projects

Welcome to the workshop! This module is where all the concepts from the previous lessons—data cleaning, model training, evaluation—come together. You'll work through four distinct projects, each covering a different area of machine learning. This is your chance to build a portfolio and prove that you can not only understand AI concepts but also apply them.

For each project, we will follow a simple, repeatable workflow:

Problem: Define the goal. What are we trying to achieve?
Dataset: Understand the data we'll be using.
Plan: Outline the steps from data loading to model evaluation.
Code: Implement the plan using Python and our favorite libraries.

Project 1: House Price Prediction (Regression) 🏡

Problem: Predict the median value of homes in California districts using various features like median income, house age, and number of rooms.

Dataset: The California Housing dataset from Scikit-learn.

Plan: We'll build a linear regression model. The key steps are to load the data, split it, scale the features (a best practice for many models), train our model, and then evaluate its performance using regression-specific metrics like Mean Absolute Error (MAE) and R-squared ($R^2$).

# --- Imports ---
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import pandas as pd

# --- 1. Load Data ---
housing = fetch_california_housing()
X, y = housing.data, housing.target

# --- 2. Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 3. Scale Features ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- 4. Train a Regression Model ---
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# --- 5. Make Predictions & Evaluate ---
predictions = model.predict(X_test_scaled)

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("--- House Price Prediction Results ---")
print(f"Mean Absolute Error (MAE): ${mae*100000:.2f}")
print(f"R-squared (R2 Score): {r2:.2f}")
print("\nInterpretation:")
print(f"On average, our model's predictions are off by about ${mae*100000:.2f}.")
print(f"Our model explains approximately {r2*100:.0f}% of the variance in house prices.")

Project 2: Handwritten Digit Recognition (Classification) ✍️

Problem: Build a model that can correctly identify handwritten digits (0-9) from image data. This is a classic "Hello, World!" of computer vision.

Dataset: The MNIST dataset of handwritten digits, conveniently available in Scikit-learn.

Plan: This is a multi-class classification problem. We'll load the dataset, where each image is flattened into a vector of pixel values. We'll then train a simple classifier, like Logistic Regression, and evaluate its accuracy. We'll also visualize one of the digits to see what the data looks like.

# --- Imports ---
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# --- 1. Load Data ---
digits = load_digits()
X, y = digits.data, digits.target

# --- Let's visualize one of the digits ---
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.title(f'This is the digit: {digits.target[0]}')
#plt.show() # Uncomment to display the image

# --- 2. Split Data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# --- 3. Train a Classification Model ---
# Logistic Regression is a good baseline model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# --- 4. Make Predictions & Evaluate ---
predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print("\n--- Handwritten Digit Recognition Results ---")
print(f"Model Accuracy: {accuracy:.4f}")
print("\nFull Classification Report:")
print(classification_report(y_test, predictions))

Project 3: Spam Email Detection (Text Classification) 📧

Problem: Classify text messages or emails as either "spam" or "ham" (not spam).

Dataset: We'll use a sample dataset directly in the code for simplicity.

Plan: This project introduces a new challenge: working with text. Machine learning models need numbers, not words. The key step here is **text vectorization**. We'll use a `CountVectorizer` to convert our text into numerical feature vectors. Then, we'll train a Naive Bayes classifier, which is particularly well-suited for text problems, and evaluate its performance.

# --- Imports ---
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# --- 1. Load Data ---
# Simple dataset for demonstration
data = {'message': ['Free entry in 2 a wkly comp', 'I am not home', 
                     'WINNER!! As a valued network customer...', 'Ok lar... Joking wif u oni...',
                     'URGENT! You have won a 1 week FREE membership'],
        'label': ['spam', 'ham', 'spam', 'ham', 'spam']}
df = pd.DataFrame(data)

X = df['message']
y = df['label']

# --- 2. Text Vectorization ---
# Convert text data into a matrix of token counts
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# --- 3. Split Data ---
# Note: We split AFTER vectorizing
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.3, random_state=42)

# --- 4. Train a Naive Bayes Classifier ---
model = MultinomialNB()
model.fit(X_train, y_train)

# --- 5. Make Predictions & Evaluate ---
predictions = model.predict(X_test)

print("\n--- Spam Detection Results ---")
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Project 4: Customer Segmentation (Clustering) 🛍️

Problem: A mall has collected data on its customers. They want to identify distinct groups or segments of customers to create targeted marketing strategies, but they don't have any predefined labels. This is a perfect use case for unsupervised learning.

Dataset: A sample of the "Mall Customers" dataset.

Plan: We'll use K-Means clustering. The workflow is: load and prepare the data, scale the features (critical for K-Means), use the Elbow Method to find the best number of clusters (K), train the model, and finally, visualize the resulting customer segments.

# --- Imports ---
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# --- 1. Load Data ---
data = {'Annual_Income_k': [15, 16, 20, 28, 40, 61, 70, 88, 103, 137],
        'Spending_Score': [39, 81, 6, 95, 42, 55, 77, 9, 85, 18]}
df = pd.DataFrame(data)

# --- 2. Scale Features ---
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# --- 3. Find the optimal K using the Elbow Method ---
inertia = []
for k in range(1, 8):
    kmeans = KMeans(n_clusters=k, n_init=10, random_state=42)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)
# plt.plot(range(1, 8), inertia, marker='o')
# plt.title('Elbow Method for Optimal K')
# plt.show() # Elbow appears to be around K=3 or 4

# --- 4. Train K-Means Model ---
# Let's choose K=3 based on the elbow plot
kmeans = KMeans(n_clusters=3, n_init=10, random_state=42)
kmeans.fit(df_scaled)
df['Cluster'] = kmeans.labels_

# --- 5. Visualize the Clusters ---
plt.figure(figsize=(10, 7))
plt.scatter(df['Annual_Income_k'], df['Spending_Score'], c=df['Cluster'], cmap='viridis', s=100)
plt.title('Customer Segments')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
# plt.show() # Uncomment to display the plot

print("\n--- Customer Segmentation Results ---")
print("Data with assigned clusters:")
print(df)

You're an AI Practitioner! 🏆

This is a massive achievement. You've successfully navigated four complete machine learning projects, covering the most common tasks in the field: regression, classification, text analysis, and clustering. You've proven you can apply your skills to solve diverse, real-world problems.

These projects are the foundation of your new data science portfolio.

In our final module, we'll zoom out and look at the bigger picture. We'll discuss the ethics of AI, explore potential career paths, and outline your next steps to continue growing in this incredible field.