From Forests to Funnels: A Complete ML Workflow Across Regression, Classification, and Churn Prediction

Week 16 of my data science internship at DataraFlow pushed everything to a new level. We moved beyond individual model building into end-to-end machine learning workflows; complete with multi-model comparisons, class imbalance handling, threshold tuning, and production deployment strategy. Here's how it all unfolded.
Overview
This week was structured in three layers of increasing complexity:
Part 1 (Tasks): Foundational model building - Random Forest regression, comprehensive metric evaluation, and binary spam classification.
Part 2 (Assignments): Comparative analysis across multiple models for house price prediction, imbalanced marketing campaign classification, and multi-class credit risk modeling.
Part 3 (Assessment): A full end-to-end churn prediction project with preprocessing pipelines, hyperparameter tuning, threshold optimization, feature importance analysis, and business recommendations.
Let's get into it.
Part 1 — The Foundations
Task 1: Random Forest Regression - Crop Yield Prediction
The first task introduced Random Forest Regression using a simple two-column dataset with a single Feature and a Target representing crop yield values.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X = rf_data[['Feature']]
Y = rf_data['Target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, Y_train)
Y_pred = rf_model.predict(X_test)
print(f'R² Score: {r2_score(Y_test, Y_pred):.3f}') # R² = 0.878
Result: R² = 0.878; a strong result, especially with only 20 data points and a single predictor. The feature importance confirmed that Feature was the sole driver (importance = 1.0), which is expected in a univariate setup.
Key takeaway: Even on minimal data, Random Forest can capture non-linear relationships effectively, though with very few samples the model's generalizability should always be questioned.
Task 2: Comprehensive Model Evaluation - Salary Prediction
This task shifted focus from building to measuring, requiring all five standard regression metrics on a 3-feature salary dataset (Experience, Training Hours, Previous Projects).
salary_rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
salary_rf_model.fit(X_train, Y_train)
sal_Y_pred = salary_rf_model.predict(X_test)
r2 = r2_score(Y_test, sal_Y_pred)
n, k = X_test.shape
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
mae = mean_absolute_error(Y_test, sal_Y_pred)
rmse = np.sqrt(mean_squared_error(Y_test, sal_Y_pred))
Metric | Value |
|---|---|
R² Score | 0.9912 |
Adjusted R² | 0.9890 |
MAE | 2.1513 |
RMSE | 2.5821 |
Near-perfect scores across the board. The model correctly learned that salary scales predictably with experience, training hours, and prior project exposure. The actual vs. predicted scatter plot showed points tightly clustering around the ideal diagonal line; a visual confirmation of model quality.
Why Adjusted R² matters: Unlike R², Adjusted R² penalizes unnecessary features. When you have multiple predictors, Adjusted R² tells the truer story by adjusting for the number of variables used.
Task 3: Binary Classification - Spam Detection
For spam detection, Logistic Regression was applied to a 100-sample dataset of emails with features like word count, link count, sender reputation, capital ratio, and exclamation count.
spamLog_reg = LogisticRegression(max_iter=1000, random_state=42)
spamLog_reg.fit(X_train, Y_train)
spamY_pred = spamLog_reg.predict(X_test)
Results: Accuracy, Precision, Recall, and F1-Score all returned 1.0; a perfect classifier on the test set.
While this sounds almost too good, the dataset was synthetically generated and well-separated. The confusion matrix heatmap showed zero misclassifications across both classes.
The metric interpretation discussion: In spam detection, Precision tends to be the more critical metric. A False Positive (classifying a legitimate email as spam) is far costlier than a False Negative (spam slipping through). Losing an important work email to a spam folder is a real consequence. However, a high F1-Score confirms both precision and recall are simultaneously strong; the ideal outcome.
Part 2 — Deeper Analysis and Comparisons
Assignment 1: Comparative Regression - House Price Prediction
This assignment compared three regression approaches on a 150-sample house price dataset with features including Square Footage, Bedrooms, Bathrooms, Age, Distance to City, Garage, and Pool.
EDA Insights:
The price distribution was approximately normal, peaking in the 310–350 range. The correlation heatmap told a clean story:
Square_Feetdominated with a correlation of 0.80Ageshowed a moderate negative correlation of -0.33 (older = cheaper)Distance_to_Cityhad a mild negative effect of -0.17
Other features like bedrooms, bathrooms, and amenities offered incremental contributions.
Model Comparison:
# Three models trained and evaluated
lin_reg = LinearRegression()
house_dtree_reg = DecisionTreeRegressor(max_depth=10, random_state=42)
house_rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
Model | R² | Adj. R² | MAE | RMSE |
|---|---|---|---|---|
Linear Regression | 0.894 | 0.861 | 27.721 | 32.505 |
Random Forest | 0.862 | 0.818 | 31.848 | 37.137 |
Decision Tree | 0.563 | 0.424 | 53.913 | 66.076 |
The Surprising Winner: Linear Regression outperformed both ensemble methods.
This is the "when simpler is better" lesson in action. When the dominant relationship in data is predominantly linear (as Square Footage vs. Price clearly is), adding tree complexity doesn't help — it introduces minor bias and variance trade-offs that actually hurt performance. Random Forest shines when relationships are non-linear and feature interactions are complex. Here, they weren't.
Feature Importance (Random Forest):
Feature | Importance |
|---|---|
Square_Feet | 0.681 |
Age | 0.161 |
Distance_to_City | 0.063 |
Bathrooms | 0.038 |
Bedrooms | 0.028 |
Pool | 0.018 |
Garage | 0.011 |
Square footage alone accounts for 68% of the model's predictive power. Size is king.
Assignment 2: Binary Classification - Marketing Campaign Conversion
This assignment tackled class imbalance head-on. The dataset of 300 customers had a striking imbalance: 87% responded (Class 1) vs 13% did not respond (Class 0).
Two Logistic Regression models were compared:
# Model A: Default
log_modelA = LogisticRegression(random_state=42, max_iter=1000)
# Model B: Class-balanced
log_modelB = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
Results Comparison:
Metric | Model A (Default) | Model B (Balanced) |
|---|---|---|
Accuracy | 85.33% | 64.00% |
Precision | 92.75% | 95.74% |
Recall | 91.43% | 64.29% |
F1-Score | 92.09% | 76.92% |
AUC | 0.637 | 0.631 |
Key Insight: class_weight='balanced' is designed for severe imbalance. With an 87/13 split, the imbalance is real but not extreme. Forcing balance caused the model to sacrifice recall significantly. Model A naturally benefits from the class distribution and performs better across all key metrics.
Exploratory boxplots revealed that responders tended to be younger, higher earners, with longer membership, more prior purchases, and stronger digital engagement (more email opens, more website visits). These are the segments to target with personalized marketing.
Business Recommendation: Deploy Model A for campaign targeting. Focus on younger, digitally active, high-income customers with established membership history.
Assignment 3: Multi-Class Classification - Credit Risk Prediction
Three risk levels were predicted (Low, Medium, High) across 400 customer applications using features like Credit Score, Income, Debt-to-Income ratio, Employment Years, and Previous Defaults.
Class Distribution:
Medium Risk: 40.5%
Low Risk: 32.0%
High Risk: 27.5%
Relatively balanced — no aggressive resampling needed.
Three classifiers were trained:
log_model = OneVsRestClassifier(LogisticRegression(random_state=42, max_iter=1000))
credit_dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
Accuracy Summary:
Model | Overall Accuracy |
|---|---|
Logistic Regression | 78.33% |
Random Forest | 77.50% |
Decision Tree | 75.00% |
All three models struggled most with the Medium Risk class; the transitional zone between Low and High where feature patterns overlap significantly.
Model choice depends on business priority:
Logistic Regression if High Risk detection is the non-negotiable priority (92% recall for High Risk)
Random Forest if balanced performance across all three risk levels is needed in production
Feature Importance (Random Forest):
Feature | Importance |
|---|---|
Credit_Score | 0.2046 |
Previous_Defaults | 0.1926 |
Debt_to_Income | 0.1644 |
Credit_History_Length | 0.0953 |
Income | 0.0947 |
Age | 0.0898 |
Employment_Years | 0.0818 |
Loan_Amount | 0.0769 |
Credit behavior and repayment history dominate. The actual loan amount is the least predictive; it's not what you borrow but how reliably you've repaid in the past.
Part 3 — End-to-End Assessment: Customer Churn Prediction
This was the capstone; a full production-style ML workflow for a telecommunications company seeking to predict and prevent customer churn.
The Dataset
500 customer records with 19 features:
Demographics: Age, Gender
Account: Tenure, Contract Type, Payment Method
Services: Internet, Streaming, Online Security, Tech Support
Engagement: Support Calls, Customer Satisfaction Score
Target:
Churn(0 = Active, 1 = Churned)
Churn rate: 45.6%; nearly balanced, but with real class signal.
Phase 1: EDA Insights
Key churn patterns from the data:
Month-to-month contracts had significantly higher churn than one or two-year commitments
Fiber optic customers churned more; likely a price-vs-value perception gap
Electronic check users showed elevated churn, suggesting billing friction
Two-year contract holders and bank transfer/mailed check payers were the most stable
Gender analysis showed females retained at a higher rate; males had a smaller gap between churn and no-churn counts.
Phase 2: Preprocessing Pipeline
A ColumnTransformer + Pipeline approach handled encoding and scaling in a clean, reproducible way:
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first'), categorical_features),
('bin', 'passthrough', binary_features)
])
pipeline_log = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42, max_iter=1000))
])
Pipelines are production-friendly: they prevent data leakage (fit only on training data, transform test data), and they make deployment straightforward.
Phase 3: Model Building and Comparison
Five models were built and evaluated:
Model | Accuracy | Churn Recall | Churn F1 |
|---|---|---|---|
Logistic Regression (Baseline) | 0.63 | 0.52 | 0.56 |
Logistic Regression (GridSearchCV) | 0.58 | 0.61 | 0.57 |
LR (Threshold = 0.3) | 0.53 | 0.85 | 0.61 |
Random Forest | 0.57 | 0.46 | 0.49 |
Decision Tree | 0.56 | 0.54 | 0.53 |
The threshold trick: Instead of tuning model parameters, adjusting the decision threshold from 0.5 to 0.3 had the most dramatic impact. The model became much more sensitive to churn signals, achieving 85% recall at the cost of lower precision.
threshold = 0.3
Y_pred_thresh = (pipeline_log.predict_proba(X_test)[:, 1] >= threshold).astype(int)
Why this matters for business: In churn prediction, the asymmetry of error costs is critical. Missing a churner (false negative) means losing a customer entirely; a costly outcome. Incorrectly flagging a loyal customer (false positive) means sending them an unnecessary retention offer (a minor cost). Lowering the threshold is a deliberate business tradeoff, not a model failure.
Phase 4: Feature Importance
Using Logistic Regression coefficients to interpret drivers:
Feature | Coefficient | Direction |
|---|---|---|
Two-Year Contract | -1.317 | Reduces churn (strongest signal) |
Streaming Movies | -0.588 | Reduces churn |
Fiber Optic Internet | +0.427 | Increases churn |
Credit Card Payment | +0.368 | Increases churn |
Customer Satisfaction Score | -0.323 | Higher score = less churn |
Support Calls | +0.251 | More calls = more churn |
The negative coefficient on Two-Year contracts is the standout; it's the largest single driver by magnitude. Locking customers into longer commitments is the most powerful lever for retention.
Phase 5: Business Recommendations
Six targeted actions emerged directly from the model insights:
1. Push Two-Year Contract Upgrades Month-to-month customers are the highest churn risk. Offer discounted upgrades with loyalty perks. Expected impact: significant reduction in price-sensitive churners.
2. Address Fiber Optic Dissatisfaction Fiber customers churn more; likely a perceived value gap. Launch targeted satisfaction surveys and offer loyalty discounts or service quality improvements.
3. Migrate Electronic Check Users to Auto-Pay Billing friction drives churn. Incentivize auto-pay adoption (bank transfer or credit card) with a small monthly discount.
4. Proactively Engage Low-Satisfaction Customers Customer Satisfaction Score is among the strongest churn predictors. Trigger retention outreach when satisfaction scores drop below 3/5; before the decision to leave is made.
5. Bundle Tech Support and Online Security Customers without these add-ons show higher churn. Offer free trials or discounted bundles to increase product attachment.
6. Proactive Callback for High-Support Customers Customers with 3+ support calls are frustrated. Implement a dedicated outreach program for this segment, offering account managers and service credits.
Phase 5: Implementation Plan
A production deployment would follow this structure:
Retraining: Quarterly, plus trigger-based retraining if recall or AUC drops below thresholds on live data
Monitoring metrics: Churn recall, precision, F1, AUC-ROC monthly; data drift detection on feature distributions
Business impact measurement: A/B testing (treatment vs. control groups), revenue retained calculation, campaign ROI tracking
Next steps: Collect more data (1,000+ records), engineer time-series behavioral features, explore XGBoost and LightGBM for improved accuracy
The Week's Biggest Lessons
1. Simpler models often win when the signal is linear. Linear Regression outperformed Random Forest on house prices. Never assume complexity equals performance.
2. Class imbalance handling requires judgment, not automation. class_weight='balanced' hurt the marketing model because the imbalance wasn't severe enough. Always check whether balancing actually improves the metrics that matter for your use case.
3. Threshold tuning is an underrated lever. Adjusting the decision threshold from 0.5 to 0.3 achieved 85% recall on churn without any architectural changes. Sometimes the right tool is already in your pipeline; you just need to calibrate it.
4. Feature importance is a business conversation starter. Credit Score, Previous Defaults, and Debt-to-Income ratio drove credit risk. Two-Year contracts and Customer Satisfaction drove churn. These aren't just model outputs; they're strategic insights that tell retention, risk, and product teams where to focus.
5. Pipelines are the bridge between experimentation and production. Wrapping preprocessing and modeling in Pipeline objects prevents data leakage, simplifies deployment, and makes the workflow reproducible. It's a habit worth building from day one.
Final Thoughts
Week 16 didn't just add more models to the toolkit; it forced a shift in thinking from "build a model" to "solve a business problem." Every metric decision, threshold choice, and feature insight was anchored to a real-world consequence: reducing churn, flagging credit risk, catching spam, or understanding what drives house prices.
The next stage involves deploying these insights beyond notebooks; into APIs, dashboards, and real-time scoring systems. That's where the real work begins.
If you found this useful, follow along for more weekly deep-dives into data science at DataraFlow.


