Skip to main content

Command Palette

Search for a command to run...

From Forests to Funnels: A Complete ML Workflow Across Regression, Classification, and Churn Prediction

Updated
13 min read
From Forests to Funnels: A Complete ML Workflow Across Regression, Classification, and Churn Prediction

Week 16 of my data science internship at DataraFlow pushed everything to a new level. We moved beyond individual model building into end-to-end machine learning workflows; complete with multi-model comparisons, class imbalance handling, threshold tuning, and production deployment strategy. Here's how it all unfolded.


Overview

This week was structured in three layers of increasing complexity:

  • Part 1 (Tasks): Foundational model building - Random Forest regression, comprehensive metric evaluation, and binary spam classification.

  • Part 2 (Assignments): Comparative analysis across multiple models for house price prediction, imbalanced marketing campaign classification, and multi-class credit risk modeling.

  • Part 3 (Assessment): A full end-to-end churn prediction project with preprocessing pipelines, hyperparameter tuning, threshold optimization, feature importance analysis, and business recommendations.

Let's get into it.


Part 1 — The Foundations

Task 1: Random Forest Regression - Crop Yield Prediction

The first task introduced Random Forest Regression using a simple two-column dataset with a single Feature and a Target representing crop yield values.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

X = rf_data[['Feature']]
Y = rf_data['Target']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, Y_train)
Y_pred = rf_model.predict(X_test)

print(f'R² Score: {r2_score(Y_test, Y_pred):.3f}')  # R² = 0.878

Result: R² = 0.878; a strong result, especially with only 20 data points and a single predictor. The feature importance confirmed that Feature was the sole driver (importance = 1.0), which is expected in a univariate setup.

Key takeaway: Even on minimal data, Random Forest can capture non-linear relationships effectively, though with very few samples the model's generalizability should always be questioned.


Task 2: Comprehensive Model Evaluation - Salary Prediction

This task shifted focus from building to measuring, requiring all five standard regression metrics on a 3-feature salary dataset (Experience, Training Hours, Previous Projects).

salary_rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
salary_rf_model.fit(X_train, Y_train)
sal_Y_pred = salary_rf_model.predict(X_test)

r2 = r2_score(Y_test, sal_Y_pred)
n, k = X_test.shape
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
mae = mean_absolute_error(Y_test, sal_Y_pred)
rmse = np.sqrt(mean_squared_error(Y_test, sal_Y_pred))

Metric

Value

R² Score

0.9912

Adjusted R²

0.9890

MAE

2.1513

RMSE

2.5821

Near-perfect scores across the board. The model correctly learned that salary scales predictably with experience, training hours, and prior project exposure. The actual vs. predicted scatter plot showed points tightly clustering around the ideal diagonal line; a visual confirmation of model quality.

Why Adjusted R² matters: Unlike R², Adjusted R² penalizes unnecessary features. When you have multiple predictors, Adjusted R² tells the truer story by adjusting for the number of variables used.


Task 3: Binary Classification - Spam Detection

For spam detection, Logistic Regression was applied to a 100-sample dataset of emails with features like word count, link count, sender reputation, capital ratio, and exclamation count.

spamLog_reg = LogisticRegression(max_iter=1000, random_state=42)
spamLog_reg.fit(X_train, Y_train)
spamY_pred = spamLog_reg.predict(X_test)

Results: Accuracy, Precision, Recall, and F1-Score all returned 1.0; a perfect classifier on the test set.

While this sounds almost too good, the dataset was synthetically generated and well-separated. The confusion matrix heatmap showed zero misclassifications across both classes.

The metric interpretation discussion: In spam detection, Precision tends to be the more critical metric. A False Positive (classifying a legitimate email as spam) is far costlier than a False Negative (spam slipping through). Losing an important work email to a spam folder is a real consequence. However, a high F1-Score confirms both precision and recall are simultaneously strong; the ideal outcome.


Part 2 — Deeper Analysis and Comparisons

Assignment 1: Comparative Regression - House Price Prediction

This assignment compared three regression approaches on a 150-sample house price dataset with features including Square Footage, Bedrooms, Bathrooms, Age, Distance to City, Garage, and Pool.

EDA Insights:

The price distribution was approximately normal, peaking in the 310–350 range. The correlation heatmap told a clean story:

  • Square_Feet dominated with a correlation of 0.80

  • Age showed a moderate negative correlation of -0.33 (older = cheaper)

  • Distance_to_City had a mild negative effect of -0.17

Other features like bedrooms, bathrooms, and amenities offered incremental contributions.

Model Comparison:

# Three models trained and evaluated
lin_reg = LinearRegression()
house_dtree_reg = DecisionTreeRegressor(max_depth=10, random_state=42)
house_rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

Model

Adj. R²

MAE

RMSE

Linear Regression

0.894

0.861

27.721

32.505

Random Forest

0.862

0.818

31.848

37.137

Decision Tree

0.563

0.424

53.913

66.076

The Surprising Winner: Linear Regression outperformed both ensemble methods.

This is the "when simpler is better" lesson in action. When the dominant relationship in data is predominantly linear (as Square Footage vs. Price clearly is), adding tree complexity doesn't help — it introduces minor bias and variance trade-offs that actually hurt performance. Random Forest shines when relationships are non-linear and feature interactions are complex. Here, they weren't.

Feature Importance (Random Forest):

Feature

Importance

Square_Feet

0.681

Age

0.161

Distance_to_City

0.063

Bathrooms

0.038

Bedrooms

0.028

Pool

0.018

Garage

0.011

Square footage alone accounts for 68% of the model's predictive power. Size is king.


Assignment 2: Binary Classification - Marketing Campaign Conversion

This assignment tackled class imbalance head-on. The dataset of 300 customers had a striking imbalance: 87% responded (Class 1) vs 13% did not respond (Class 0).

Two Logistic Regression models were compared:

# Model A: Default
log_modelA = LogisticRegression(random_state=42, max_iter=1000)

# Model B: Class-balanced
log_modelB = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)

Results Comparison:

Metric

Model A (Default)

Model B (Balanced)

Accuracy

85.33%

64.00%

Precision

92.75%

95.74%

Recall

91.43%

64.29%

F1-Score

92.09%

76.92%

AUC

0.637

0.631

Key Insight: class_weight='balanced' is designed for severe imbalance. With an 87/13 split, the imbalance is real but not extreme. Forcing balance caused the model to sacrifice recall significantly. Model A naturally benefits from the class distribution and performs better across all key metrics.

Exploratory boxplots revealed that responders tended to be younger, higher earners, with longer membership, more prior purchases, and stronger digital engagement (more email opens, more website visits). These are the segments to target with personalized marketing.

Business Recommendation: Deploy Model A for campaign targeting. Focus on younger, digitally active, high-income customers with established membership history.


Assignment 3: Multi-Class Classification - Credit Risk Prediction

Three risk levels were predicted (Low, Medium, High) across 400 customer applications using features like Credit Score, Income, Debt-to-Income ratio, Employment Years, and Previous Defaults.

Class Distribution:

  • Medium Risk: 40.5%

  • Low Risk: 32.0%

  • High Risk: 27.5%

Relatively balanced — no aggressive resampling needed.

Three classifiers were trained:

log_model = OneVsRestClassifier(LogisticRegression(random_state=42, max_iter=1000))
credit_dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

Accuracy Summary:

Model

Overall Accuracy

Logistic Regression

78.33%

Random Forest

77.50%

Decision Tree

75.00%

All three models struggled most with the Medium Risk class; the transitional zone between Low and High where feature patterns overlap significantly.

Model choice depends on business priority:

  • Logistic Regression if High Risk detection is the non-negotiable priority (92% recall for High Risk)

  • Random Forest if balanced performance across all three risk levels is needed in production

Feature Importance (Random Forest):

Feature

Importance

Credit_Score

0.2046

Previous_Defaults

0.1926

Debt_to_Income

0.1644

Credit_History_Length

0.0953

Income

0.0947

Age

0.0898

Employment_Years

0.0818

Loan_Amount

0.0769

Credit behavior and repayment history dominate. The actual loan amount is the least predictive; it's not what you borrow but how reliably you've repaid in the past.


Part 3 — End-to-End Assessment: Customer Churn Prediction

This was the capstone; a full production-style ML workflow for a telecommunications company seeking to predict and prevent customer churn.

The Dataset

500 customer records with 19 features:

  • Demographics: Age, Gender

  • Account: Tenure, Contract Type, Payment Method

  • Services: Internet, Streaming, Online Security, Tech Support

  • Engagement: Support Calls, Customer Satisfaction Score

  • Target: Churn (0 = Active, 1 = Churned)

Churn rate: 45.6%; nearly balanced, but with real class signal.

Phase 1: EDA Insights

Key churn patterns from the data:

  • Month-to-month contracts had significantly higher churn than one or two-year commitments

  • Fiber optic customers churned more; likely a price-vs-value perception gap

  • Electronic check users showed elevated churn, suggesting billing friction

  • Two-year contract holders and bank transfer/mailed check payers were the most stable

Gender analysis showed females retained at a higher rate; males had a smaller gap between churn and no-churn counts.

Phase 2: Preprocessing Pipeline

A ColumnTransformer + Pipeline approach handled encoding and scaling in a clean, reproducible way:

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first'), categorical_features),
    ('bin', 'passthrough', binary_features)
])

pipeline_log = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, max_iter=1000))
])

Pipelines are production-friendly: they prevent data leakage (fit only on training data, transform test data), and they make deployment straightforward.

Phase 3: Model Building and Comparison

Five models were built and evaluated:

Model

Accuracy

Churn Recall

Churn F1

Logistic Regression (Baseline)

0.63

0.52

0.56

Logistic Regression (GridSearchCV)

0.58

0.61

0.57

LR (Threshold = 0.3)

0.53

0.85

0.61

Random Forest

0.57

0.46

0.49

Decision Tree

0.56

0.54

0.53

The threshold trick: Instead of tuning model parameters, adjusting the decision threshold from 0.5 to 0.3 had the most dramatic impact. The model became much more sensitive to churn signals, achieving 85% recall at the cost of lower precision.

threshold = 0.3
Y_pred_thresh = (pipeline_log.predict_proba(X_test)[:, 1] >= threshold).astype(int)

Why this matters for business: In churn prediction, the asymmetry of error costs is critical. Missing a churner (false negative) means losing a customer entirely; a costly outcome. Incorrectly flagging a loyal customer (false positive) means sending them an unnecessary retention offer (a minor cost). Lowering the threshold is a deliberate business tradeoff, not a model failure.

Phase 4: Feature Importance

Using Logistic Regression coefficients to interpret drivers:

Feature

Coefficient

Direction

Two-Year Contract

-1.317

Reduces churn (strongest signal)

Streaming Movies

-0.588

Reduces churn

Fiber Optic Internet

+0.427

Increases churn

Credit Card Payment

+0.368

Increases churn

Customer Satisfaction Score

-0.323

Higher score = less churn

Support Calls

+0.251

More calls = more churn

The negative coefficient on Two-Year contracts is the standout; it's the largest single driver by magnitude. Locking customers into longer commitments is the most powerful lever for retention.

Phase 5: Business Recommendations

Six targeted actions emerged directly from the model insights:

1. Push Two-Year Contract Upgrades Month-to-month customers are the highest churn risk. Offer discounted upgrades with loyalty perks. Expected impact: significant reduction in price-sensitive churners.

2. Address Fiber Optic Dissatisfaction Fiber customers churn more; likely a perceived value gap. Launch targeted satisfaction surveys and offer loyalty discounts or service quality improvements.

3. Migrate Electronic Check Users to Auto-Pay Billing friction drives churn. Incentivize auto-pay adoption (bank transfer or credit card) with a small monthly discount.

4. Proactively Engage Low-Satisfaction Customers Customer Satisfaction Score is among the strongest churn predictors. Trigger retention outreach when satisfaction scores drop below 3/5; before the decision to leave is made.

5. Bundle Tech Support and Online Security Customers without these add-ons show higher churn. Offer free trials or discounted bundles to increase product attachment.

6. Proactive Callback for High-Support Customers Customers with 3+ support calls are frustrated. Implement a dedicated outreach program for this segment, offering account managers and service credits.

Phase 5: Implementation Plan

A production deployment would follow this structure:

  • Retraining: Quarterly, plus trigger-based retraining if recall or AUC drops below thresholds on live data

  • Monitoring metrics: Churn recall, precision, F1, AUC-ROC monthly; data drift detection on feature distributions

  • Business impact measurement: A/B testing (treatment vs. control groups), revenue retained calculation, campaign ROI tracking

  • Next steps: Collect more data (1,000+ records), engineer time-series behavioral features, explore XGBoost and LightGBM for improved accuracy


The Week's Biggest Lessons

1. Simpler models often win when the signal is linear. Linear Regression outperformed Random Forest on house prices. Never assume complexity equals performance.

2. Class imbalance handling requires judgment, not automation. class_weight='balanced' hurt the marketing model because the imbalance wasn't severe enough. Always check whether balancing actually improves the metrics that matter for your use case.

3. Threshold tuning is an underrated lever. Adjusting the decision threshold from 0.5 to 0.3 achieved 85% recall on churn without any architectural changes. Sometimes the right tool is already in your pipeline; you just need to calibrate it.

4. Feature importance is a business conversation starter. Credit Score, Previous Defaults, and Debt-to-Income ratio drove credit risk. Two-Year contracts and Customer Satisfaction drove churn. These aren't just model outputs; they're strategic insights that tell retention, risk, and product teams where to focus.

5. Pipelines are the bridge between experimentation and production. Wrapping preprocessing and modeling in Pipeline objects prevents data leakage, simplifies deployment, and makes the workflow reproducible. It's a habit worth building from day one.


Final Thoughts

Week 16 didn't just add more models to the toolkit; it forced a shift in thinking from "build a model" to "solve a business problem." Every metric decision, threshold choice, and feature insight was anchored to a real-world consequence: reducing churn, flagging credit risk, catching spam, or understanding what drives house prices.

The next stage involves deploying these insights beyond notebooks; into APIs, dashboards, and real-time scoring systems. That's where the real work begins.


If you found this useful, follow along for more weekly deep-dives into data science at DataraFlow.