Teleocom KNN Classification
KNN classification model for telecom dataset. Built with Python and scikit-learn.
Timeline
Role
Status
CompletedTechnology Stack
Telecom Customer Segmentation — KNN Classifier
Overview
A telecom provider's customer base is segmented into four service tiers based on usage behavior. This project builds a K-Nearest Neighbors classifier that predicts which service category a new customer belongs to — using only their demographic profile — enabling the company to personalize offers at scale.
Tech Stack
Languages & Libraries
- Python, NumPy, Pandas
- scikit-learn (KNeighborsClassifier, StandardScaler, train_test_split, accuracy_score)
- Matplotlib, Seaborn
Dataset
- Source: IBM Developer Skills Network — Telecom Customer Dataset
- Size: 1,000 customer records
- Features: Region, age, marital status, education, employment, income, tenure, and more
- Target: 4 service categories
- Basic Service — 266 customers
- E-Service — 217 customers
- Plus Service — 281 customers
- Total Service — 236 customers
Problem Statement
Given demographic data about a telecom customer, predict which of the four service categories they belong to. This is a multi-class classification problem where the goal is to learn patterns from labeled historical data and generalize to unseen customers.
Pipeline
- Data Loading & Exploration
- EDA & correlation
- Normalization (StandardScaler)
- Train-test split
- Model training
- Evaluation (accuracy, confusion matrix)
What Was Built
-
Exploratory Data Analysis: Examined class distribution across all four service tiers and generated a full correlation heatmap to understand inter-feature relationships.
-
Feature Importance via Correlation: Identified 'ed' (education level) and 'tenure' as the strongest predictors of customer category, while 'gender' and 'retire' showed minimal correlation with the target — informing which features to prioritize.
-
Data Normalization: Applied 'StandardScaler' to all input features before training, ensuring no single feature dominated the KNN distance metric due to scale differences. This is a critical preprocessing step for any distance-based algorithm.
-
Model Training: Built and evaluated KNN classifiers starting at k = 3 and k = 6, then systematically swept k from 1 to 100 — training and evaluating a separate model for each value to identify the optimal configuration.
-
Hyperparameter Tuning: Plotted model accuracy and standard deviation across all 100 values of k. Identified k = 38 as the optimal number of neighbors, yielding the best test set accuracy.
-
Bias-Variance Analysis: Separately evaluated model performance on the training set across all k values, documenting how accuracy degrades as k increases — and providing a theoretical explanation for the overfitting-to-underfitting transition.
Results
| Configuration | Test Accuracy |
|---|---|
| k = 3 | ~34% |
| k = 6 | ~35% |
| k = 38 (optimal) | ~41% |
Key Insight
The dataset's modest accuracy ceiling (~41%) surfaced an important finding: KNN's distance-based decision boundary is sensitive to feature irrelevance and high dimensionality. The demographic features alone do not form tightly separated clusters in feature space, limiting how well any distance-based model can perform.
Potential improvements:
- Feature selection or dimensionality reduction (PCA) to remove noise
- Switching to a tree-based ensemble (Random Forest, XGBoost) that handles mixed-type demographic data more robustly
- Exploring non-linear decision boundaries via SVM with RBF kernel
Key Learnings
-
Feature scaling is non-negotiable for distance-based algorithms — unscaled features cause the model to weight high-magnitude variables disproportionately, distorting the nearest-neighbor search.
-
The bias-variance trade-off is directly observable in KNN — low k overfits (perfect training accuracy, poor generalization), high k underfits (smooth boundary, poor fit). The optimal k must be found empirically per dataset.
-
Balanced class distribution simplifies evaluation — with roughly equal samples per category, standard accuracy is a reliable metric without needing weighted alternatives.
-
Algorithm selection is as important as tuning — the performance ceiling on this dataset suggests that no amount of k-tuning can compensate for a fundamental mismatch between the model's assumptions and the data's structure.
Project Structure
knn-telecom-classification/
├── KNN_Classification_telecom_dataset.ipynb # Main notebook
└── README.md
References
- Dataset: IBM Developer Skills Network
- Algorithm: scikit-learn KNeighborsClassifier
