đź§ą Data Science Fundamentals: Data Preparation and Model Fitting
Data Science Fundamentals: Getting Your Data Right
Before diving into fancy algorithms and complex models, there’s something even more important: your data. As the saying goes, “garbage in, garbage out.” Let’s learn how to prepare your data properly and avoid common pitfalls.
The Data Preparation Pipeline
1. Data Cleaning Basics
import pandas as pd
import numpy as np
# Load your data
df = pd.read_csv('raw_data.csv')
# Basic cleaning steps
df = df.dropna() # Remove missing values
df = df.drop_duplicates() # Remove exact duplicates
2. Understanding Deduplication
What is Deduplication?
Deduplication is removing redundant data points that might skew your model’s learning:
- Exact Duplicates: Identical rows in your dataset
- Near Duplicates: Very similar entries that represent the same information
- Semantic Duplicates: Different representations of the same thing
Example: Customer Data Deduplication
def normalize_text(text):
"""Basic text normalization"""
return str(text).lower().strip()
# Create normalized versions of key fields
df['name_normalized'] = df['name'].apply(normalize_text)
df['email_normalized'] = df['email'].apply(normalize_text)
# Find similar entries using fuzzy matching
from fuzzywuzzy import fuzz
def find_similar_entries(df, threshold=80):
similar_pairs = []
for i in range(len(df)):
for j in range(i + 1, len(df)):
similarity = fuzz.ratio(
df.iloc[i]['name_normalized'],
df.iloc[j]['name_normalized']
)
if similarity > threshold:
similar_pairs.append((i, j, similarity))
return similar_pairs
3. Feature Scaling and Normalization
Different scales can bias your model:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization (mean=0, std=1)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[numeric_columns])
# Normalization (0-1 range)
normalizer = MinMaxScaler()
df_normalized = normalizer.fit_transform(df[numeric_columns])
The Train-Test-Validation Split
Why Split Your Data?
Imagine studying for an exam using the actual test questions - that’s cheating! Similarly, testing your model on training data doesn’t prove it can handle new data.
The Right Way to Split
from sklearn.model_selection import train_test_split
# First split: training and temporary test set
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Second split: create validation and test sets
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")
The Golden Rule: No Leakage!
❌ Wrong:
# DON'T DO THIS
scaler.fit(X) # Fitting on all data
X_scaled = scaler.transform(X)
X_train, X_test = train_test_split(X_scaled, ...)
âś… Right:
# DO THIS
X_train, X_test = train_test_split(X, ...)
scaler.fit(X_train) # Fit only on training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Understanding Model Fit
1. Underfitting: When Your Model Is Too Simple
Think of it like trying to draw a circle using only straight lines.
Signs of Underfitting:
- Poor performance on training data
- Poor performance on test data
- Model is too simple for the problem
# Example: Linear model trying to fit non-linear data
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Might underfit if data is non-linear
2. Just Right: The Goldilocks Zone
Your model learns the patterns without memorizing the noise.
Signs of Good Fit:
- Good performance on training data
- Similar performance on test data
- Model complexity matches problem complexity
# Example: Using cross-validation to verify
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Mean CV Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
3. Overfitting: When Your Model Memorizes
Like memorizing test answers without understanding the subject.
Signs of Overfitting:
- Excellent performance on training data
- Poor performance on test data
- Model is too complex
# Example: Decision tree without proper pruning
from sklearn.tree import DecisionTreeClassifier
# Likely to overfit
complex_tree = DecisionTreeClassifier(max_depth=None)
# Better balanced
balanced_tree = DecisionTreeClassifier(max_depth=3)
Visual Diagnosis
Here’s how to visualize model fit:
import matplotlib.pyplot as plt
def plot_learning_curves(model, X_train, X_val, y_train, y_val):
train_sizes = np.linspace(0.1, 1.0, 10)
train_scores = []
val_scores = []
for size in train_sizes:
# Train on subset
subset_size = int(len(X_train) * size)
model.fit(X_train[:subset_size], y_train[:subset_size])
# Record scores
train_scores.append(model.score(X_train[:subset_size],
y_train[:subset_size]))
val_scores.append(model.score(X_val, y_val))
plt.plot(train_sizes, train_scores, 'b-', label='Training score')
plt.plot(train_sizes, val_scores, 'r-', label='Validation score')
plt.xlabel('Training set size')
plt.ylabel('Score')
plt.legend()
plt.show()
Best Practices Checklist
- Data Preparation
- Remove duplicates
- Handle missing values
- Scale/normalize features
- Check for data leakage
- Data Splitting
- Split before any preprocessing
- Use stratification for imbalanced data
- Keep test set untouched until final evaluation
- Model Fitting
- Start simple
- Use cross-validation
- Monitor training and validation metrics
- Apply regularization when needed
Next Steps
Now that you understand data preparation and model fitting, you might want to explore:
- Model Evaluation: Beyond Accuracy
- Cross-Validation and Overfitting
- Feature Engineering: The Art of Creating Better Data
Key Takeaways
- Clean and deduplicate your data before modeling
- Always split your data properly to avoid leakage
- Watch for signs of underfitting and overfitting
- Start simple and increase complexity only when needed
- Use cross-validation to verify your model’s performance
Remember: A well-prepared dataset and proper validation strategy are often more important than choosing the “perfect” algorithm!