Skip to main content

Command Palette

Search for a command to run...

Mastering Data Preprocessing and Regression Analysis: A Comprehensive Take-Home Project

Published
10 min readView as Markdown
Mastering Data Preprocessing and Regression Analysis: A Comprehensive Take-Home Project

As a data science intern at DataraFlow, I've had the opportunity to dive deep into the fundamentals of data preprocessing and regression analysis through this intensive take-home assignment. My solution represents a complete walkthrough of essential machine learning concepts, from handling missing data to building and evaluating regression models. In this article, I'll share my journey through the various tasks, highlighting the technical decisions, code implementations, and key insights gained along the way.

Introduction

My solution notebook serves as a comprehensive exploration of data preprocessing techniques and regression modeling, structured into three main parts: Tasks, Assignments, and Assessment. The primary focus is on preparing data for machine learning models and implementing both simple and multiple linear regression algorithms.

The dataset collection includes various CSV files covering different scenarios: customer data with missing values, categorical encoding examples, feature scaling demonstrations, e-commerce customer information, advertising-sales relationships, startup profit prediction, and a real-world housing price dataset.

Throughout the notebook, I demonstrated proficiency in:

  • Handling missing data through imputation strategies

  • Encoding categorical variables using OneHotEncoder and LabelEncoder

  • Feature scaling with StandardScaler

  • Building and evaluating regression models

  • Feature selection using backward elimination

  • Creating comprehensive visualizations for model interpretation

Part 1: Tasks - Building Fundamental Skills

Task 1: Missing Data Management

Objective: Practice handling missing values in datasets using appropriate imputation strategies.

Dataset: A small dataset containing 15 records with columns: Name, Age, City, Income, Product_Rating.

Implementation Approach: The task involved loading the dataset and identifying missing values across numerical (Age, Income, Product_Rating) and categorical (City) features. I used scikit-learn's SimpleImputer with different strategies:

  • Mean imputation for numerical features

  • Mode imputation for categorical features

Key Code Summary

# Numerical imputation
num_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
taskData.iloc[:, [1,3,4]] = num_imputer.fit_transform(taskData.iloc[:, [1,3,4]])

# Categorical imputation  
cat_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
taskData.iloc[:, 2] = cat_imputer.fit_transform(taskData.iloc[:, 2].values.reshape(-1,1))

Results and Insights:

The imputation successfully filled all missing values. Age had 3 missing entries, Income had 3, and City had 1. The mean values used were 34 for Age, 61416.666667 for Income, and 4.515385 for Product_Rating. The most frequent city was New York.

This approach preserves the statistical properties of the data while ensuring no information loss from dropping rows. Mean imputation works well for normally distributed numerical data, while mode imputation maintains the most representative categorical value.

Task 2: Encoding Categorical Variables

Objective: Master encoding techniques for categorical data to prepare it for machine learning algorithms.

Dataset: A small dataset with features including CustomerID, City, Product_Type (categorical) and Age, Purchase_Amount (numerical), plus Purchased as the target variable.

Implementation Approach: I separated features (X) and target (Y), then applied:

  • OneHotEncoder for independent categorical variables (City, Product_Type)

  • LabelEncoder for the binary target variable (Purchased: Yes/No)

The ColumnTransformer was used to handle multiple encoding operations simultaneously, with `remainder="passthrough"` preserving numerical features.

# split the data into features and target variable
X = task2Data.iloc[:, :-1].values
Y = task2Data.iloc[:, -1].values

# Encoding Categorical Features

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

# OneHotEncoding for features
ct = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), [1,2])], remainder="passthrough")
X_encod = np.array(ct.fit_transform(X))

# LabelEncoding for target
label_encoder = LabelEncoder()
Y_encod = label_encoder.fit_transform(Y)

Results and Insights: The original dataset shape was (20, 5). After encoding, X transformed from 5 features to 10 features due to one-hot encoding (City: 4 categories → 4 dummy variables, Product_Type: 3 categories → 3 dummy variables, plus 3 other features).

The encoding successfully converted categorical data into numerical format suitable for regression algorithms.

Task 3: Feature Scaling Comparison

Objective: Understand the impact of feature scaling on data distribution and model performance.

Dataset: task3_scaling_data.csv with features on different scales:

  • Age (23-46)

  • Annual_Salary ($32,000-$108,000)

  • Years_Experience (1-23)

  • Performance_Score (71-95).

Implementation Approach: I split the data into training (80%) and test (20%) sets, then applied StandardScaler to both sets. The scaling transformed features to have zero mean and unit variance.

Key Code Summary

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

# Standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# plot the distribution of the scaled and original data
# import the matplotlib library
import matplotlib.pyplot as plt
import seaborn as sns

# Create a figure and a set of subplots 
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(10, 4))

# Plot the Age original data on the first row, first column
sns.histplot(X_train[:,0], bins=20, kde=True, ax=ax[0,0])
ax[0,0].set_title('Age Original Data Distribution')
ax[0,0].set_xlabel('Value')
ax[0,0].set_ylabel('Frequency')

# Plot the Age Scaled data on the first row, second column
sns.histplot(X_train_scaled[:,0], bins=20, kde=True,ax=ax[0,1])
ax[0,1].set_title('Age Scaled Data Distribution')
ax[0,1].set_xlabel('Value')
ax[0,1].set_ylabel('Frequency')

# Plot the Annual Salary original data on the second row, first column
sns.histplot(X_train[:,1], bins=20, kde=True, ax=ax[1,0])
ax[1,0].set_title('Annual Salary Original Data Distribution')
ax[1,0].set_xlabel('Value')
ax[1,0].set_ylabel('Frequency')

# Plot the Annual Salary Scaled data on the second row, second column
sns.histplot(X_train_scaled[:,1], bins=20, kde=True,ax=ax[1,1])
ax[1,1].set_title('Annual Salary Scaled Data Distribution')
ax[1,1].set_xlabel('Value')
ax[1,1].set_ylabel('Frequency')

# Plot the Years_Experience original data on the third row, first column
sns.histplot(X_train[:,2], bins=20, kde=True, ax=ax[2,0])
ax[2,0].set_title('Years of Experience Original Data Distribution')
ax[2,0].set_xlabel('Value')
ax[2,0].set_ylabel('Frequency')

# Plot the Years_Experience Scaled data on the third row, second column
sns.histplot(X_train_scaled[:,2], bins=20, kde=True,ax=ax[2,1])
ax[2,1].set_title('Years of Experience Scaled Data Distribution')
ax[2,1].set_xlabel('Value')
ax[2,1].set_ylabel('Frequency')

# Adjust layout to prevent titles/labels from overlapping
plt.tight_layout()

# Display the plot
plt.show()

Results and Insights: Before scaling, the features had vastly different ranges and variances. After scaling, all features had mean ≈ 0 and standard deviation ≈ 1. The visualizations showed that while the shape of distributions remained similar, the scales were normalized, making features comparable for algorithms sensitive to scale differences.

This preprocessing step is crucial for gradient-based algorithms and distance-based methods, ensuring no single feature dominates the learning process due to its larger scale.

Part 2: Assignments - Applying Concepts in Practice

Assignment 1: Complete Data Preprocessing Pipeline

Objective: Build an end-to-end preprocessing workflow for an e-commerce customer dataset.

Dataset: Ecommerce data with 100 rows and 7 columns including customer demographics and purchase behavior.

Implementation Approach: I implemented a comprehensive pipeline covering:

  1. Data exploration and quality assessment

  2. Missing value imputation (mean for numerical, mode for categorical)

  3. Categorical encoding with dummy variable trap handling

  4. Train-test splitting (70/30)

  5. Feature scaling on numerical features

Key Code Summary

# Imputation
num_imputer = SimpleImputer(strategy='mean')
ecomData[['Age', 'Annual_Income']] = num_imputer.fit_transform(ecomData[['Age', 'Annual_Income']])

# Encoding
CT = ColumnTransformer(transformers=[('One_Hot_Encoder', OneHotEncoder(drop='first'), [0,2])], remainder='passthrough')
ecom_X = np.array(CT.fit_transform(ecom_X))

# Scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Results and Insights: The pipeline successfully processed the data, resulting in 70 training samples and 30 test samples. Missing values in Age (3), Annual_Income (2), and Country (1) were imputed appropriately. The final feature set included encoded categorical variables and scaled numerical features.

Business Implications: This preprocessing pipeline ensures that machine learning models can effectively learn from customer data, potentially improving predictions of repeat purchase behavior.

Assignment 2: Simple Linear Regression Analysis

Objective: Implement and evaluate a simple linear regression model to understand advertising-sales relationships.

Dataset: Advertising sales csv with advertising spend and sales revenue data.

Implementation Approach: I built a simple linear regression model, trained it on 70% of the data, and evaluated performance using multiple metrics. Visualizations included scatter plots with regression lines for both training and test sets.

Relationship Between Advertisement spending and Revenue

Scatter Plot with Regression Line for Training and Test Sets

Key Code Summary

# Model training
lr_model = LinearRegression()
lr_model.fit(X_train.reshape(-1,1), Y_train.reshape(-1,1))

# Predictions
Y_pred = lr_model.predict(X_test.reshape(-1,1))

# Evaluation
train_r2 = r2_score(Y_train, Y_train_pred)
test_r2 = r2_score(Y_test, Y_pred)

Results and Insights: The model achieved exceptional performance with R² scores of 0.997 (training) and 0.998 (test), indicating the model explains ~99.8% of variance in sales revenue. The regression equation was: y = 4.86x + 38.97.

For $50,000 advertising spend, the model predicted $243,048 in sales revenue.

Business Recommendations:

  1. Increase advertising investment as it strongly correlates with revenue

  2. Use the model for budget optimization and forecasting

  3. Validate predictions with A/B testing

Assignment 3: Multiple Linear Regression with Feature Selection

Objective: Build a multiple regression model and optimize it using backward elimination for a startup profit prediction scenario.

Dataset: Startup Profit csv with 6 features including location categories and business metrics.

Implementation Approach: I implemented backward elimination using statsmodels OLS, iteratively removing features with p-values > 0.05. The process reduced the model from 6 features to 2 significant predictors.

The following table compares the initial model (all features) with the optimized model (selected features)

MetricInitial ModelOptimized ModelImprovement
Number of Features624 features removed
R² Score0.97110.97410.30%
Adjusted R²0.93640.96833.40%
RMSE9988.329465.265.24%
MSE99766474.2589591174.5110.20%

Key Code Summary

# Backward elimination with statsmodels
import statsmodels.api as sm
startupXtrain = sm.add_constant(X_train_final)
ols_model = sm.OLS(Y_train, startupXtrain).fit()

# Feature removal based on p-values
# Iteratively removed: Urban, Employee_Count, Administration_Cost, Suburban
# Retained: RD_Spend, Marketing_Spend

Results and Insights: The optimized model (R² = 0.974, Adjusted R² = 0.968) performed similarly to the full model (R² = 0.971, Adjusted R² = 0.936) while using fewer features. Key findings showed RD_Spend and Marketing_Spend as the most significant profit predictors.

Business Recommendations:

  • Prioritize R&D and marketing investments

  • Streamline operations in less impactful areas

  • Use the simplified model for efficient forecasting

Part 3: Assessment - Real-World Housing Price Prediction

Phase 1: Data Understanding & Preprocessing

Dataset: Housing Price Data with 100+ records containing various housing features.

Implementation Approach: I performed comprehensive EDA including correlation analysis, distribution plots, and outlier detection. The preprocessing pipeline handled categorical encoding and feature scaling while avoiding the dummy variable trap.

Key Findings:

  • No missing values or outliers detected

  • Strong correlations between house price and features like area, property tax

  • Categorical variables (Neighborhood, Garage, Pool) required encoding

Phase 2: Model Development

I built two models:

  1. Full Model: Multiple linear regression with all features

  2. Optimized Model: After backward elimination, retaining 6 significant features

Results: Both models achieved exceptional performance (R² = 0.9994), with the optimized model maintaining accuracy while reducing complexity.

Phase 3: Model Evaluation & Validation

Visualizations Created:

  • Scatter plots of predicted vs actual prices

  • Residual plots and distribution analysis

  • Feature importance visualizations

  • Prediction error distributions

Key Insights: The models showed minimal residuals and normally distributed errors, indicating good model fit. Feature importance analysis revealed Property_Tax and Area as strongest predictors.

Phase 4: Business Insights & Recommendations

Key Findings:

  • Property tax emerged as the strongest price predictor

  • Luxury neighborhoods command significant premiums

  • Pool and garage features have moderate positive impacts

Recommendations:

  1. Use the model for automated property valuation

  2. Focus marketing on high-impact features

  3. Implement regular model retraining with new data

Sample Predictions:

  • A 2000 sq ft luxury home with pool: ~$850,000

  • A 1500 sq ft standard home: ~$450,000

  • A 1800 sq ft urban home with garage: ~$620,000

Challenges Encountered

  • Ensuring proper column indexing during imputation

  • Handling the dummy variable trap by dropping first dummy columns

  • Maintaining data integrity through the pipeline

Conclusion

This comprehensive notebook demonstrates a complete data science workflow from raw data to actionable business insights. The key skills I've demonstrated include:

Technical Proficiency:

  • Advanced data preprocessing techniques

  • Regression modeling and evaluation

  • Feature engineering and selection

  • Statistical analysis and interpretation

Problem-Solving Approach:

  • Systematic handling of real-world data challenges

  • Iterative model optimization

  • Rigorous validation and testing

Business Acumen:

  • Translating technical results into actionable recommendations

  • Understanding model limitations and practical applications

The project reinforced that successful machine learning isn't just about algorithms—it's about thoughtful data preparation, rigorous evaluation, and clear communication of insights. As I continue my data science journey, these foundational skills will serve as the bedrock for more advanced modeling techniques and complex problem-solving scenarios.

The experience has been invaluable in bridging the gap between theoretical knowledge and practical application, preparing me to contribute meaningfully to real-world data science projects at DataraFlow.

Resources That Helped Me

More from this blog

Untitled Publication

16 posts