Assignment 3: Implementing Linear and Logistic Regression

Overview of the Problem Statement

Machine learning is all about extracting insights and making predictions from data. As part of the Innoquest Cohort-1 Machine Learning module, I tackled an exciting assignment focusing on regression and classification techniques. Here’s a quick overview:

Regression Task: Explore the Ames Housing dataset to predict house prices.
- Implement Simple Linear Regression, Multiple Linear Regression, and Polynomial Regression.
- Preprocess the data thoroughly for accurate model performance.
Classification Task: Work with two datasets to apply Logistic Regression and Multinomial Logistic Regression.
- Focus on binary and multiclass classification problems.
- Evaluate model performance using metrics like accuracy and confusion matrices.

Folder Structure

To keep the project organized and streamlined, I followed a structured approach:

Main Directory

Text Files:
- requirements.txt: Listed libraries and dependencies.
- encoding.txt: Documented how columns were encoded for the regression task.
- data.json: Contained custom Seaborn palettes for visualization.
Datasets:
- Ames Housing Dataset:
  - raw_data, without_na (handled missing values), without_ols (removed outliers), encoded_unscaled, scaled, and most_imp_39_features (features selected after correlation analysis).
- Customer Churn Dataset:
  - raw_data, encoded_unscaled (dataset prepared for modeling).
Notebooks:
- Preprocessing: Data cleaning, encoding, scaling, and feature selection.
- Model Building: Notebooks for each regression and classification model.

Approach and Methodology

1. Regression Task: Predicting House Prices

The Ames Housing dataset was a challenging yet rewarding choice, given its complexity. The original dataset had over 250 features after encoding. Here’s how I tackled it:

Data Preprocessing:
- Handled missing values.
- Removed outliers using statistical techniques.
- Scaled features to ensure model compatibility.
- Reduced feature size to 39 by analyzing correlations.
Model Building:
- Explored Simple Linear Regression for single-feature predictions.
- Implemented Multiple Linear Regression for multivariate analysis.
- Enhanced predictions with Polynomial Regression to capture non-linear patterns.

2. Classification Task: Predicting Customer Churn

For the classification task, I chose the Customer Churn dataset, which was slightly more manageable:

Binary Classification: Applied Logistic Regression to predict whether customers would churn.
Multiclass Classification: Used Multinomial Logistic Regression for multiclass problems (though less relevant for this dataset).

Visualizations and Business Insights

Although business exploration for the Ames Housing dataset was limited, I visualized key trends in the Customer Churn dataset, uncovering patterns like:

Features contributing to churn likelihood.
Insights into customer demographics and subscription behavior.

Results and Learnings

Regression Task Results

RMSE: ~18,000
R² Score: 0.89

Classification Task Results

Binary Classification Accuracy: ~79%
- Did not address class imbalance due to time constraints.

Key Takeaways

Technical Growth:
- Gained hands-on experience with large datasets, feature engineering, and model evaluation.
- Improved proficiency in preprocessing techniques like scaling, encoding, and handling missing data.
Time Management:
- Balancing deep dives into datasets with practical deliverables was a critical lesson.

Final Thoughts

This assignment was not just an exercise in implementing machine learning techniques but also a lesson in real-world problem-solving, where time and data complexity often dictate the scope of exploration. I’m eager to further refine these models and delve deeper into business insights in future projects!