CompletedPythonscikit-learnPandas+1 more

Teleocom KNN Classification

KNN classification model for telecom dataset. Built with Python and scikit-learn.

Timeline

Role

Status
Completed

Technology Stack

Python
scikit-learn
Pandas
NumPy

Telecom Customer Segmentation — KNN Classifier

Overview

A telecom provider's customer base is segmented into four service tiers based on usage behavior. This project builds a K-Nearest Neighbors classifier that predicts which service category a new customer belongs to — using only their demographic profile — enabling the company to personalize offers at scale.


Tech Stack

Languages & Libraries

  • Python, NumPy, Pandas
  • scikit-learn (KNeighborsClassifier, StandardScaler, train_test_split, accuracy_score)
  • Matplotlib, Seaborn

Dataset

  • Source: IBM Developer Skills Network — Telecom Customer Dataset
  • Size: 1,000 customer records
  • Features: Region, age, marital status, education, employment, income, tenure, and more
  • Target: 4 service categories
    1. Basic Service — 266 customers
    2. E-Service — 217 customers
    3. Plus Service — 281 customers
    4. Total Service — 236 customers

Problem Statement

Given demographic data about a telecom customer, predict which of the four service categories they belong to. This is a multi-class classification problem where the goal is to learn patterns from labeled historical data and generalize to unseen customers.


Pipeline

  1. Data Loading & Exploration
  2. EDA & correlation
  3. Normalization (StandardScaler)
  4. Train-test split
  5. Model training
  6. Evaluation (accuracy, confusion matrix)

What Was Built

  • Exploratory Data Analysis: Examined class distribution across all four service tiers and generated a full correlation heatmap to understand inter-feature relationships.

  • Feature Importance via Correlation: Identified 'ed' (education level) and 'tenure' as the strongest predictors of customer category, while 'gender' and 'retire' showed minimal correlation with the target — informing which features to prioritize.

  • Data Normalization: Applied 'StandardScaler' to all input features before training, ensuring no single feature dominated the KNN distance metric due to scale differences. This is a critical preprocessing step for any distance-based algorithm.

  • Model Training: Built and evaluated KNN classifiers starting at k = 3 and k = 6, then systematically swept k from 1 to 100 — training and evaluating a separate model for each value to identify the optimal configuration.

  • Hyperparameter Tuning: Plotted model accuracy and standard deviation across all 100 values of k. Identified k = 38 as the optimal number of neighbors, yielding the best test set accuracy.

  • Bias-Variance Analysis: Separately evaluated model performance on the training set across all k values, documenting how accuracy degrades as k increases — and providing a theoretical explanation for the overfitting-to-underfitting transition.


Results

ConfigurationTest Accuracy
k = 3~34%
k = 6~35%
k = 38 (optimal)~41%

Key Insight

The dataset's modest accuracy ceiling (~41%) surfaced an important finding: KNN's distance-based decision boundary is sensitive to feature irrelevance and high dimensionality. The demographic features alone do not form tightly separated clusters in feature space, limiting how well any distance-based model can perform.

Potential improvements:

  • Feature selection or dimensionality reduction (PCA) to remove noise
  • Switching to a tree-based ensemble (Random Forest, XGBoost) that handles mixed-type demographic data more robustly
  • Exploring non-linear decision boundaries via SVM with RBF kernel

Key Learnings

  • Feature scaling is non-negotiable for distance-based algorithms — unscaled features cause the model to weight high-magnitude variables disproportionately, distorting the nearest-neighbor search.

  • The bias-variance trade-off is directly observable in KNN — low k overfits (perfect training accuracy, poor generalization), high k underfits (smooth boundary, poor fit). The optimal k must be found empirically per dataset.

  • Balanced class distribution simplifies evaluation — with roughly equal samples per category, standard accuracy is a reliable metric without needing weighted alternatives.

  • Algorithm selection is as important as tuning — the performance ceiling on this dataset suggests that no amount of k-tuning can compensate for a fundamental mismatch between the model's assumptions and the data's structure.


Project Structure

knn-telecom-classification/
├── KNN_Classification_telecom_dataset.ipynb   # Main notebook
└── README.md

References

Developed by Dhyan Dave
© 2026. All rights reserved.