Telecom Customer Segmentation — KNN Classifier

Overview

A telecom provider's customer base is segmented into four service tiers based on usage behavior. This project builds a K-Nearest Neighbors classifier that predicts which service category a new customer belongs to — using only their demographic profile — enabling the company to personalize offers at scale.

Tech Stack

Languages & Libraries

Python, NumPy, Pandas
scikit-learn (KNeighborsClassifier, StandardScaler, train_test_split, accuracy_score)
Matplotlib, Seaborn

Dataset

Source: IBM Developer Skills Network — Telecom Customer Dataset
Size: 1,000 customer records
Features: Region, age, marital status, education, employment, income, tenure, and more
Target: 4 service categories
1. Basic Service — 266 customers
2. E-Service — 217 customers
3. Plus Service — 281 customers
4. Total Service — 236 customers

Problem Statement

Given demographic data about a telecom customer, predict which of the four service categories they belong to. This is a multi-class classification problem where the goal is to learn patterns from labeled historical data and generalize to unseen customers.

Pipeline

Data Loading & Exploration
EDA & correlation
Normalization (StandardScaler)
Train-test split
Model training
Evaluation (accuracy, confusion matrix)

What Was Built

Exploratory Data Analysis: Examined class distribution across all four service tiers and generated a full correlation heatmap to understand inter-feature relationships.
Feature Importance via Correlation: Identified 'ed' (education level) and 'tenure' as the strongest predictors of customer category, while 'gender' and 'retire' showed minimal correlation with the target — informing which features to prioritize.
Data Normalization: Applied 'StandardScaler' to all input features before training, ensuring no single feature dominated the KNN distance metric due to scale differences. This is a critical preprocessing step for any distance-based algorithm.
Model Training: Built and evaluated KNN classifiers starting at k = 3 and k = 6, then systematically swept k from 1 to 100 — training and evaluating a separate model for each value to identify the optimal configuration.
Hyperparameter Tuning: Plotted model accuracy and standard deviation across all 100 values of k. Identified k = 38 as the optimal number of neighbors, yielding the best test set accuracy.
Bias-Variance Analysis: Separately evaluated model performance on the training set across all k values, documenting how accuracy degrades as k increases — and providing a theoretical explanation for the overfitting-to-underfitting transition.

Results

Configuration	Test Accuracy
k = 3	~34%
k = 6	~35%
k = 38 (optimal)	~41%

Key Insight

The dataset's modest accuracy ceiling (~41%) surfaced an important finding: KNN's distance-based decision boundary is sensitive to feature irrelevance and high dimensionality. The demographic features alone do not form tightly separated clusters in feature space, limiting how well any distance-based model can perform.

Potential improvements:

Feature selection or dimensionality reduction (PCA) to remove noise
Switching to a tree-based ensemble (Random Forest, XGBoost) that handles mixed-type demographic data more robustly
Exploring non-linear decision boundaries via SVM with RBF kernel

Key Learnings

Feature scaling is non-negotiable for distance-based algorithms — unscaled features cause the model to weight high-magnitude variables disproportionately, distorting the nearest-neighbor search.
The bias-variance trade-off is directly observable in KNN — low k overfits (perfect training accuracy, poor generalization), high k underfits (smooth boundary, poor fit). The optimal k must be found empirically per dataset.
Balanced class distribution simplifies evaluation — with roughly equal samples per category, standard accuracy is a reliable metric without needing weighted alternatives.
Algorithm selection is as important as tuning — the performance ceiling on this dataset suggests that no amount of k-tuning can compensate for a fundamental mismatch between the model's assumptions and the data's structure.

Project Structure

knn-telecom-classification/
├── KNN_Classification_telecom_dataset.ipynb   # Main notebook
└── README.md

References

Dataset: IBM Developer Skills Network
Algorithm: scikit-learn KNeighborsClassifier

Teleocom KNN Classification

Timeline

Role

Status

Technology Stack