Mastering Data Preprocessing and Regression Analysis: A Comprehensive Take-Home Project

As a data science intern at DataraFlow, I've had the opportunity to dive deep into the fundamentals of data preprocessing and regression analysis through this intensive take-home assignment. My solution represents a complete walkthrough of essential machine learning concepts, from handling missing data to building and evaluating regression models. In this article, I'll share my journey through the various tasks, highlighting the technical decisions, code implementations, and key insights gained along the way.
Introduction
My solution notebook serves as a comprehensive exploration of data preprocessing techniques and regression modeling, structured into three main parts: Tasks, Assignments, and Assessment. The primary focus is on preparing data for machine learning models and implementing both simple and multiple linear regression algorithms.
The dataset collection includes various CSV files covering different scenarios: customer data with missing values, categorical encoding examples, feature scaling demonstrations, e-commerce customer information, advertising-sales relationships, startup profit prediction, and a real-world housing price dataset.
Throughout the notebook, I demonstrated proficiency in:
Handling missing data through imputation strategies
Encoding categorical variables using
OneHotEncoderandLabelEncoderFeature scaling with
StandardScalerBuilding and evaluating regression models
Feature selection using backward elimination
Creating comprehensive visualizations for model interpretation
Part 1: Tasks - Building Fundamental Skills
Task 1: Missing Data Management
Objective: Practice handling missing values in datasets using appropriate imputation strategies.
Dataset: A small dataset containing 15 records with columns: Name, Age, City, Income, Product_Rating.
Implementation Approach: The task involved loading the dataset and identifying missing values across numerical (Age, Income, Product_Rating) and categorical (City) features. I used scikit-learn's SimpleImputer with different strategies:
Mean imputation for numerical features
Mode imputation for categorical features
Key Code Summary
# Numerical imputation
num_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
taskData.iloc[:, [1,3,4]] = num_imputer.fit_transform(taskData.iloc[:, [1,3,4]])
# Categorical imputation
cat_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
taskData.iloc[:, 2] = cat_imputer.fit_transform(taskData.iloc[:, 2].values.reshape(-1,1))
Results and Insights:
The imputation successfully filled all missing values. Age had 3 missing entries, Income had 3, and City had 1. The mean values used were 34 for Age, 61416.666667 for Income, and 4.515385 for Product_Rating. The most frequent city was New York.
This approach preserves the statistical properties of the data while ensuring no information loss from dropping rows. Mean imputation works well for normally distributed numerical data, while mode imputation maintains the most representative categorical value.
Task 2: Encoding Categorical Variables
Objective: Master encoding techniques for categorical data to prepare it for machine learning algorithms.
Dataset: A small dataset with features including CustomerID, City, Product_Type (categorical) and Age, Purchase_Amount (numerical), plus Purchased as the target variable.
Implementation Approach: I separated features (X) and target (Y), then applied:
OneHotEncoder for independent categorical variables (City, Product_Type)
LabelEncoder for the binary target variable (Purchased: Yes/No)
The ColumnTransformer was used to handle multiple encoding operations simultaneously, with `remainder="passthrough"` preserving numerical features.
# split the data into features and target variable
X = task2Data.iloc[:, :-1].values
Y = task2Data.iloc[:, -1].values
# Encoding Categorical Features
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
# OneHotEncoding for features
ct = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), [1,2])], remainder="passthrough")
X_encod = np.array(ct.fit_transform(X))
# LabelEncoding for target
label_encoder = LabelEncoder()
Y_encod = label_encoder.fit_transform(Y)
Results and Insights: The original dataset shape was (20, 5). After encoding, X transformed from 5 features to 10 features due to one-hot encoding (City: 4 categories → 4 dummy variables, Product_Type: 3 categories → 3 dummy variables, plus 3 other features).
The encoding successfully converted categorical data into numerical format suitable for regression algorithms.
Task 3: Feature Scaling Comparison
Objective: Understand the impact of feature scaling on data distribution and model performance.
Dataset: task3_scaling_data.csv with features on different scales:
Age (23-46)
Annual_Salary ($32,000-$108,000)
Years_Experience (1-23)
Performance_Score (71-95).
Implementation Approach: I split the data into training (80%) and test (20%) sets, then applied StandardScaler to both sets. The scaling transformed features to have zero mean and unit variance.
Key Code Summary
# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# plot the distribution of the scaled and original data
# import the matplotlib library
import matplotlib.pyplot as plt
import seaborn as sns
# Create a figure and a set of subplots
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(10, 4))
# Plot the Age original data on the first row, first column
sns.histplot(X_train[:,0], bins=20, kde=True, ax=ax[0,0])
ax[0,0].set_title('Age Original Data Distribution')
ax[0,0].set_xlabel('Value')
ax[0,0].set_ylabel('Frequency')
# Plot the Age Scaled data on the first row, second column
sns.histplot(X_train_scaled[:,0], bins=20, kde=True,ax=ax[0,1])
ax[0,1].set_title('Age Scaled Data Distribution')
ax[0,1].set_xlabel('Value')
ax[0,1].set_ylabel('Frequency')
# Plot the Annual Salary original data on the second row, first column
sns.histplot(X_train[:,1], bins=20, kde=True, ax=ax[1,0])
ax[1,0].set_title('Annual Salary Original Data Distribution')
ax[1,0].set_xlabel('Value')
ax[1,0].set_ylabel('Frequency')
# Plot the Annual Salary Scaled data on the second row, second column
sns.histplot(X_train_scaled[:,1], bins=20, kde=True,ax=ax[1,1])
ax[1,1].set_title('Annual Salary Scaled Data Distribution')
ax[1,1].set_xlabel('Value')
ax[1,1].set_ylabel('Frequency')
# Plot the Years_Experience original data on the third row, first column
sns.histplot(X_train[:,2], bins=20, kde=True, ax=ax[2,0])
ax[2,0].set_title('Years of Experience Original Data Distribution')
ax[2,0].set_xlabel('Value')
ax[2,0].set_ylabel('Frequency')
# Plot the Years_Experience Scaled data on the third row, second column
sns.histplot(X_train_scaled[:,2], bins=20, kde=True,ax=ax[2,1])
ax[2,1].set_title('Years of Experience Scaled Data Distribution')
ax[2,1].set_xlabel('Value')
ax[2,1].set_ylabel('Frequency')
# Adjust layout to prevent titles/labels from overlapping
plt.tight_layout()
# Display the plot
plt.show()
Results and Insights: Before scaling, the features had vastly different ranges and variances. After scaling, all features had mean ≈ 0 and standard deviation ≈ 1. The visualizations showed that while the shape of distributions remained similar, the scales were normalized, making features comparable for algorithms sensitive to scale differences.

This preprocessing step is crucial for gradient-based algorithms and distance-based methods, ensuring no single feature dominates the learning process due to its larger scale.
Part 2: Assignments - Applying Concepts in Practice
Assignment 1: Complete Data Preprocessing Pipeline
Objective: Build an end-to-end preprocessing workflow for an e-commerce customer dataset.
Dataset: Ecommerce data with 100 rows and 7 columns including customer demographics and purchase behavior.
Implementation Approach: I implemented a comprehensive pipeline covering:
Data exploration and quality assessment
Missing value imputation (mean for numerical, mode for categorical)
Categorical encoding with dummy variable trap handling
Train-test splitting (70/30)
Feature scaling on numerical features
Key Code Summary
# Imputation
num_imputer = SimpleImputer(strategy='mean')
ecomData[['Age', 'Annual_Income']] = num_imputer.fit_transform(ecomData[['Age', 'Annual_Income']])
# Encoding
CT = ColumnTransformer(transformers=[('One_Hot_Encoder', OneHotEncoder(drop='first'), [0,2])], remainder='passthrough')
ecom_X = np.array(CT.fit_transform(ecom_X))
# Scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Results and Insights: The pipeline successfully processed the data, resulting in 70 training samples and 30 test samples. Missing values in Age (3), Annual_Income (2), and Country (1) were imputed appropriately. The final feature set included encoded categorical variables and scaled numerical features.
Business Implications: This preprocessing pipeline ensures that machine learning models can effectively learn from customer data, potentially improving predictions of repeat purchase behavior.
Assignment 2: Simple Linear Regression Analysis
Objective: Implement and evaluate a simple linear regression model to understand advertising-sales relationships.
Dataset: Advertising sales csv with advertising spend and sales revenue data.
Implementation Approach: I built a simple linear regression model, trained it on 70% of the data, and evaluated performance using multiple metrics. Visualizations included scatter plots with regression lines for both training and test sets.
Relationship Between Advertisement spending and Revenue

Scatter Plot with Regression Line for Training and Test Sets


Key Code Summary
# Model training
lr_model = LinearRegression()
lr_model.fit(X_train.reshape(-1,1), Y_train.reshape(-1,1))
# Predictions
Y_pred = lr_model.predict(X_test.reshape(-1,1))
# Evaluation
train_r2 = r2_score(Y_train, Y_train_pred)
test_r2 = r2_score(Y_test, Y_pred)
Results and Insights: The model achieved exceptional performance with R² scores of 0.997 (training) and 0.998 (test), indicating the model explains ~99.8% of variance in sales revenue. The regression equation was: y = 4.86x + 38.97.
For $50,000 advertising spend, the model predicted $243,048 in sales revenue.
Business Recommendations:
Increase advertising investment as it strongly correlates with revenue
Use the model for budget optimization and forecasting
Validate predictions with A/B testing
Assignment 3: Multiple Linear Regression with Feature Selection
Objective: Build a multiple regression model and optimize it using backward elimination for a startup profit prediction scenario.
Dataset: Startup Profit csv with 6 features including location categories and business metrics.
Implementation Approach: I implemented backward elimination using statsmodels OLS, iteratively removing features with p-values > 0.05. The process reduced the model from 6 features to 2 significant predictors.
The following table compares the initial model (all features) with the optimized model (selected features)
| Metric | Initial Model | Optimized Model | Improvement |
| Number of Features | 6 | 2 | 4 features removed |
| R² Score | 0.9711 | 0.9741 | 0.30% |
| Adjusted R² | 0.9364 | 0.9683 | 3.40% |
| RMSE | 9988.32 | 9465.26 | 5.24% |
| MSE | 99766474.25 | 89591174.51 | 10.20% |
Key Code Summary
# Backward elimination with statsmodels
import statsmodels.api as sm
startupXtrain = sm.add_constant(X_train_final)
ols_model = sm.OLS(Y_train, startupXtrain).fit()
# Feature removal based on p-values
# Iteratively removed: Urban, Employee_Count, Administration_Cost, Suburban
# Retained: RD_Spend, Marketing_Spend
Results and Insights: The optimized model (R² = 0.974, Adjusted R² = 0.968) performed similarly to the full model (R² = 0.971, Adjusted R² = 0.936) while using fewer features. Key findings showed RD_Spend and Marketing_Spend as the most significant profit predictors.
Business Recommendations:
Prioritize R&D and marketing investments
Streamline operations in less impactful areas
Use the simplified model for efficient forecasting
Part 3: Assessment - Real-World Housing Price Prediction
Phase 1: Data Understanding & Preprocessing
Dataset: Housing Price Data with 100+ records containing various housing features.
Implementation Approach: I performed comprehensive EDA including correlation analysis, distribution plots, and outlier detection. The preprocessing pipeline handled categorical encoding and feature scaling while avoiding the dummy variable trap.
Key Findings:
No missing values or outliers detected
Strong correlations between house price and features like area, property tax
Categorical variables (Neighborhood, Garage, Pool) required encoding
Phase 2: Model Development
I built two models:
Full Model: Multiple linear regression with all features
Optimized Model: After backward elimination, retaining 6 significant features
Results: Both models achieved exceptional performance (R² = 0.9994), with the optimized model maintaining accuracy while reducing complexity.
Phase 3: Model Evaluation & Validation
Visualizations Created:
Scatter plots of predicted vs actual prices

Residual plots and distribution analysis

Feature importance visualizations

Prediction error distributions

Key Insights: The models showed minimal residuals and normally distributed errors, indicating good model fit. Feature importance analysis revealed Property_Tax and Area as strongest predictors.
Phase 4: Business Insights & Recommendations
Key Findings:
Property tax emerged as the strongest price predictor
Luxury neighborhoods command significant premiums
Pool and garage features have moderate positive impacts
Recommendations:
Use the model for automated property valuation
Focus marketing on high-impact features
Implement regular model retraining with new data
Sample Predictions:
A 2000 sq ft luxury home with pool: ~$850,000
A 1500 sq ft standard home: ~$450,000
A 1800 sq ft urban home with garage: ~$620,000
Challenges Encountered
Ensuring proper column indexing during imputation
Handling the dummy variable trap by dropping first dummy columns
Maintaining data integrity through the pipeline
Conclusion
This comprehensive notebook demonstrates a complete data science workflow from raw data to actionable business insights. The key skills I've demonstrated include:
Technical Proficiency:
Advanced data preprocessing techniques
Regression modeling and evaluation
Feature engineering and selection
Statistical analysis and interpretation
Problem-Solving Approach:
Systematic handling of real-world data challenges
Iterative model optimization
Rigorous validation and testing
Business Acumen:
Translating technical results into actionable recommendations
Understanding model limitations and practical applications
The project reinforced that successful machine learning isn't just about algorithms—it's about thoughtful data preparation, rigorous evaluation, and clear communication of insights. As I continue my data science journey, these foundational skills will serve as the bedrock for more advanced modeling techniques and complex problem-solving scenarios.
The experience has been invaluable in bridging the gap between theoretical knowledge and practical application, preparing me to contribute meaningfully to real-world data science projects at DataraFlow.



