Machine Learning Workflows & ML Models Type — ML/MLOps Pipelines DeepDive
Never Forget Another ML Step: Battle-Tested Memory Techniques for the Complete Data Science Pipeline..
📋 What You'll Master
- 11-Step ML Workflow: From data loading to model serving
- MLOps Pipeline: Production deployment & monitoring
- Code Structure: Clean, maintainable ML code
- Library-Specific: Pandas, Scikit-learn, TensorFlow, PyTorch
- Function Patterns: ETL, Training, Deployment, Monitoring
🎯 The Complete ML Workflow (11 Steps)
| Step | Purpose | Key Action |
|---|---|---|
| 1. LOAD DATA | Read dataset from file into memory | pd.read_csv() |
| 2. PREPROCESS | Clean data, handle missing values | df.dropna() |
| 3. SPLIT DATA | Separate training and testing sets | train_test_split() |
| 4. SCALE FEATURES | Normalize features (mean=0, std=1) | StandardScaler() |
| 5. INITIALIZE MODEL | Set up model with hyperparameters | RandomForestRegressor() |
| 6. TRAIN MODEL | Learn patterns from training data | model.fit() |
| 7. PREDICT | Apply learned patterns to new data | model.predict() |
| 8. EVALUATE | Measure model performance | r2_score() |
| 9. SAVE MODEL | Persist trained model to disk | joblib.dump() |
| 10. LOAD MODEL | Restore saved model from disk | joblib.load() |
| 11. SERVE PREDICTIONS | Deploy model as API for predictions | @app.route('/predict') |
🧠 MEGA MNEMONICS & MEMORY TRICKS
Primary Master Mnemonic (11 Steps)
"LAZY PROGRAMMERS SHOULD SKIP INTERNET, TRAIN PYTHON EVERYDAY SAVING LOADS OF SERVER POWER"
- Lazy = LOAD DATA
- Programmers = PREPROCESS
- Should = SPLIT DATA
- Skip = SCALE FEATURES
- Internet = INITIALIZE MODEL
- Train = TRAIN MODEL
- Python = PREDICT
- Everyday = EVALUATE
- Saving = SAVE MODEL
- Loads = LOAD MODEL
- Of Server Power = SERVE PREDICTIONS
Alternative Simpler Mnemonic
"LET'S PROMPTLY START SCIENCE: INVESTIGATE, TEACH, PREDICT, EVALUATE, STORE, LAUNCH, SERVE"
L Load → P Preprocess → S Split → S Scale → I Initialize → T Train → P Predict → E Evaluate → S Save → L Load → S Serve
🔢 INDIVIDUAL STEP MNEMONICS
1. LOAD DATA — "LOAD"
LOAD = Locate, Open, Arrange, Decode
- Locate file path
- Open with pandas/numpy
- Arrange in dataframe
- Decode column types
# L - Locate file path
# O - Open with pandas/numpy
# A - Arrange in dataframe
# D - Decode column types
df = pd.read_csv('data.csv')
Memory Trick: "Load the LOAD truck with data boxes"
2. PREPROCESS — "CLEAN-UP"
CLEAN-UP = Check, Look, Eliminate, Adjust, Nulls, Uniformize, Process
- Check for missing values
- Look at data types
- Eliminate duplicates
- Adjust outliers
- Nulls handling
- Uniformize formats
- Process encodings
df.dropna()
df.fillna(df.mean())
df.drop_duplicates()
Memory Trick: "Clean up your messy data room"
3. SPLIT DATA — "SPLIT"
SPLIT = Separate, Portions, Into, Learns, Test
- Separate X and y
- Portions (80/20 rule)
- Into train/test
- Leave test untouched
- Train gets bigger portion
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Memory Trick: "Split the pizza into training slices and testing slices"
4. SCALE FEATURES — "SCALE"
SCALE = Standardize, Center, Adjust, Level, Equalize
- Standardize range
- Center mean to 0
- Adjust std to 1
- Level all features
- Equalize importance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Memory Trick: "Scale the mountain - make it flat and even"
Alternative: NORM = Normalize, Optimize, Range, Mean
5. INITIALIZE MODEL — "SETUP"
SETUP = Select, Establish, Tune, Understand, Parameters
- Select algorithm
- Establish architecture
- Tune hyperparameters
- Understand defaults
- Parameters configuration
model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1
)
Memory Trick: "Set up your model like setting up a tent"
6. TRAIN MODEL — "TRAIN"
TRAIN = Teach, Repeat, Adjust, Iterate, Numbers
- Teach from data
- Repeat epochs
- Adjust weights
- Iterate batches
- Numbers converge
model.fit(X_train, y_train)
Memory Trick: "Train the model like training a dog - repeat until learned"
Alternative: FIT = Feed, Iterate, Transform
7. PREDICT — "PREDICT"
PREDICT = Process, Run, Extract, Determine, Infer, Calculate, Transform
- Process new data
- Run through model
- Extract patterns
- Determine output
- Infer results
- Calculate probabilities
- Transform to predictions
y_pred = model.predict(X_test)
Memory Trick: "Predict the future with your crystal ball model"
8. EVALUATE — "ASSESS"
ASSESS = Analyze, Score, Statistics, Errors, Summarize, Success
- Analyze performance
- Score predictions
- Statistics calculation
- Errors measurement
- Summarize results
- Success metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
Memory Trick: "Assess your student's (model's) test performance"
9. SAVE MODEL — "SAVE"
SAVE = Serialize, Archive, Version, Export
- Serialize object
- Archive to disk
- Version control
- Export artifacts
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'scaler.pkl')
Memory Trick: "Save your game progress before closing"
10. LOAD MODEL — "LOAD"
LOAD = Locate, Open, Access, Deploy
- Locate saved file
- Open from disk
- Access model object
- Deploy for use
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')
Memory Trick: "Load your saved game to continue playing"
11. SERVE PREDICTIONS — "SERVE"
SERVE = Setup, Endpoint, Route, Validate, Expose
- Setup API server
- Endpoint creation
- Route requests
- Validate inputs
- Expose predictions
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
predictions = model.predict(data)
return jsonify({'predictions': predictions.tolist()})
Memory Trick: "Serve predictions like serving food at a restaurant"
🏗️ CODE STRUCTURE MNEMONICS
Python Module Structure — "DICFM"
"DICFM = Dick's Infamous Chocolate Fudge Makes"
| Letter | Component | Description |
|---|---|---|
| D | Docstring | Module description at the top |
| I | Imports | All library imports |
| C | Constants | Global constants and config |
| F | Functions | Function definitions |
| M | Main | if __name__ == '__main__': |
"""
D - Docstring (module description)
"""
# I - Imports
import pandas as pd
import numpy as np
# C - Constants
MAX_DEPTH = 10
RANDOM_STATE = 42
# F - Functions
def train_model(X, y):
"""Train the model"""
model = RandomForestRegressor()
model.fit(X, y)
return model
# M - Main execution
if __name__ == '__main__':
df = pd.read_csv('data.csv')
model = train_model(X, y)
Function Structure — "DAPRO"
DAPRO = Docstring, Arguments, Process, Return, Output
def function_name(args):
"""
D - Docstring (what, args, returns)
"""
# A - Arguments validation
if args is None:
raise ValueError("Args cannot be None")
# P - Process/logic
result = process(args)
# R - Return statement
return result
# O - Output (logged/printed)
Class Structure — "DIMPF"
DIMPF = Docstring, Init, Methods, Properties, Friends
class ModelPipeline:
"""
D - Docstring (class purpose)
"""
# I - Init method
def __init__(self):
self.model = None
self.scaler = None
# M - Methods
def train(self, X, y):
"""Train the model"""
self.model.fit(X, y)
# P - Properties
@property
def is_trained(self):
return self.model is not None
# F - Friends (helper methods)
def _helper_method(self):
pass
🔧 MLOPS FUNCTION MNEMONICS
Complete MLOps Pipeline — "DTVMDR"
"DTVMDR = Don't Trust Very Much During Retirement"
| Letter | Pipeline Stage | Purpose |
|---|---|---|
| D | Data Pipeline | Extract, Transform, Load data |
| T | Train Pipeline | Model training and tracking |
| V | Validate Pipeline | Model validation and testing |
| M | Monitor Pipeline | Performance monitoring and alerts |
| D | Deploy Pipeline | Production deployment |
| R | Retrain Pipeline | Automated retraining triggers |
Data Pipeline Functions — "ETL"
ETL = Extract, Transform, Load
- Extract: Pull data from sources (APIs, databases, files)
- Transform: Clean, process, and feature engineer
- Load: Store processed data in destination
# E - Extract
def extract_data(source):
return pd.read_csv(source)
# T - Transform
def transform_data(raw_data):
data = raw_data.dropna()
return data.drop_duplicates()
# L - Load
def load_data(processed_data, destination):
processed_data.to_parquet(destination)
Training Pipeline Functions — "FETV"
FETV = Fit, Evaluate, Track, Version
- Fit: Train the model on data
- Evaluate: Measure performance metrics
- Track: Log experiments with MLflow
- Version: Save model with version tags
Deployment Pipeline Functions — "BTVD"
BTVD = Build, Test, Validate, Deploy
- Build: Create Docker container
- Test: Run unit and integration tests
- Validate: Check in staging environment
- Deploy: Push to production
Monitoring Functions — "MADR"
MADR = Measure, Alert, Diagnose, Respond
- Measure: Collect metrics (latency, accuracy, drift)
- Alert: Send notifications on anomalies
- Diagnose: Investigate issues and root causes
- Respond: Trigger retraining or rollback
📚 LIBRARY-SPECIFIC MNEMONICS
Pandas Operations — "CREAM"
CREAM = Create, Read, Explore, Aggregate, Modify
| Operation | Code |
|---|---|
| Create | df = pd.DataFrame(data) |
| Read | df = pd.read_csv('data.csv') |
| Explore | df.info(), df.describe() |
| Aggregate | df.groupby('col').agg('mean') |
| Modify | df['new'] = df['col'].apply(func) |
Scikit-learn Workflow — "SPLIT-FIT-PREDICT-SCORE"
# SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y)
# FIT
model.fit(X_train, y_train)
# PREDICT
y_pred = model.predict(X_test)
# SCORE
score = model.score(X_test, y_test)
TensorFlow/Keras Workflow — "COMPILE-FIT-EVALUATE"
# COMPILE
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# FIT
model.fit(X_train, y_train, epochs=100)
# EVALUATE
model.evaluate(X_test, y_test)
PyTorch Workflow — "FORWARD-LOSS-BACKWARD-STEP"
# FORWARD
output = model(input)
# LOSS
loss = criterion(output, target)
# BACKWARD
loss.backward()
# STEP
optimizer.step()
MLflow Tracking — "TRACK"
TRACK = Tag, Record, Artifacts, Checkpoint, Keep
# T - Tag experiment
mlflow.set_tag('version', 'v1')
# R - Record parameters
mlflow.log_param('n_estimators', 100)
# A - Artifacts logging
mlflow.log_artifact('model.pkl')
# C - Checkpoint metrics
mlflow.log_metric('accuracy', 0.95)
# K - Keep model
mlflow.log_model(model, 'model')
🎴 QUICK REFERENCE CARDS
┌─────────────────────────────────────┐
│ ML WORKFLOW MNEMONIC │
├─────────────────────────────────────┤
│ LAZY PROGRAMMERS SHOULD SKIP │
│ INTERNET, TRAIN PYTHON EVERYDAY │
│ SAVING LOADS OF SERVER POWER │
├─────────────────────────────────────┤
│ L - LOAD DATA │
│ P - PREPROCESS │
│ S - SPLIT DATA │
│ S - SCALE FEATURES │
│ I - INITIALIZE MODEL │
│ T - TRAIN MODEL │
│ P - PREDICT │
│ E - EVALUATE │
│ S - SAVE MODEL │
│ L - LOAD MODEL │
│ O - SERVE (Online) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ MLOPS PIPELINE MNEMONIC │
├─────────────────────────────────────┤
│ DTVMDR = Don't Trust Very Much │
│ During Retirement │
├─────────────────────────────────────┤
│ D - DATA PIPELINE │
│ T - TRAIN PIPELINE │
│ V - VALIDATE PIPELINE │
│ M - MONITOR PIPELINE │
│ D - DEPLOY PIPELINE │
│ R - RETRAIN PIPELINE │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ CODE STRUCTURE MNEMONIC │
├─────────────────────────────────────┤
│ DICFM = Dick's Infamous Chocolate │
│ Fudge Makes │
├─────────────────────────────────────┤
│ D - DOCSTRING │
│ I - IMPORTS │
│ C - CONSTANTS │
│ F - FUNCTIONS │
│ M - MAIN │
└─────────────────────────────────────┘
💻 COMPLETE ML WORKFLOW - PYTHON TEMPLATE
Copy-paste ready template following the 11-step workflow with all major ML libraries
Full End-to-End ML Pipeline Template
# ============================================
# COMPLETE ML WORKFLOW - 11 STEPS TEMPLATE
# Following: LAZY PROGRAMMERS SHOULD SKIP
# INTERNET, TRAIN PYTHON EVERYDAY
# SAVING LOADS OF SERVER POWER
# ============================================
# IMPORTS - All Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
from flask import Flask, request, jsonify
# ============================================
# STEP 1: LOAD DATA (L)
# ============================================
df = pd.read_csv('data.csv')
print(f"Dataset shape: {df.shape}")
# ============================================
# STEP 2: PREPROCESS (P)
# ============================================
# Handle missing values
df = df.dropna()
# Remove duplicates
df = df.drop_duplicates()
# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)
# ============================================
# STEP 3: SPLIT DATA (S)
# ============================================
X = df.drop(columns=['target'])
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ============================================
# STEP 4: SCALE FEATURES (S)
# ============================================
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# ============================================
# STEP 5: INITIALIZE MODEL (I)
# ============================================
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
# ============================================
# STEP 6: TRAIN MODEL (T)
# ============================================
model.fit(X_train_scaled, y_train)
print("Model trained!")
# ============================================
# STEP 7: PREDICT (P)
# ============================================
y_pred = model.predict(X_test_scaled)
# ============================================
# STEP 8: EVALUATE (E)
# ============================================
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
# ============================================
# STEP 9: SAVE MODEL (S)
# ============================================
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'scaler.pkl')
# ============================================
# STEP 10: LOAD MODEL (L)
# ============================================
model = joblib.load('model.pkl')
scaler = joblib.load('scaler.pkl')
# ============================================
# STEP 11: SERVE PREDICTIONS (S)
# ============================================
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
input_df = pd.DataFrame([data])
input_scaled = scaler.transform(input_df)
prediction = model.predict(input_scaled)
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)
💡 Quick Start Guide:
- Install:
pip install pandas numpy scikit-learn flask joblib - Replace
'data.csv'and'target'with your data - Run:
python ml_workflow.py - Test API:
curl -X POST http://localhost:5000/predict -d '{"feature1":1}'
📦 Step 1: Install Required Packages
Open your terminal and run:
pip3 install pandas numpy scikit-learn flask joblib
If you get permission errors, try:
pip3 install --user pandas numpy scikit-learn flask joblib
Verify installation:
python3 -c "import pandas; print('✓ Pandas:', pandas.__version__)"
python3 -c "import sklearn; print('✓ Scikit-learn:', sklearn.__version__)"
📊 Step 2: Create Dummy Dataset (data.csv)
Create a file named create_data.py and run it to generate the dummy dataset:
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Create dummy dataset - Loan Approval Prediction
n_samples = 1000
data = {
'age': np.random.randint(18, 80, n_samples),
'income': np.random.randint(20000, 150000, n_samples),
'credit_score': np.random.randint(300, 850, n_samples),
'loan_amount': np.random.randint(5000, 50000, n_samples),
'employment_length': np.random.randint(0, 40, n_samples),
'debt_to_income': np.random.uniform(0, 1, n_samples).round(2),
'num_credit_lines': np.random.randint(1, 15, n_samples),
'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
'home_ownership': np.random.choice(['Rent', 'Own', 'Mortgage'], n_samples),
}
# Create target: Loan approved (1) or rejected (0)
data['target'] = (
(data['income'] > 60000) &
(data['credit_score'] > 650) &
(data['debt_to_income'] < 0.5)
).astype(int)
# Add some randomness (10% noise)
random_flip = np.random.random(n_samples) < 0.1
data['target'] = np.where(random_flip, 1 - data['target'], data['target'])
# Save to CSV
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
print("✓ data.csv created successfully!")
print(f"✓ Shape: {df.shape}")
print(f"✓ Target distribution:")
print(df['target'].value_counts())
📌 Run this to generate data.csv:
python3 create_data.py
✓ Creates 1,000 loan applications with 9 features + 1 target
🚀 Step 3: Run the ML Workflow
Save the template code as ml_workflow.py and run:
python3 ml_workflow.py
Expected Output:
Dataset shape: (1000, 10)
Model trained!
Accuracy: 0.9850
precision recall f1-score support
0 0.99 0.99 0.99 160
1 0.96 0.95 0.95 40
accuracy 0.99 200
* Serving Flask app 'ml_workflow'
* Running on http://0.0.0.0:5000
🧪 Step 4: Test the API
Open a new terminal (keep the Flask server running) and test:
Test 1: High-quality applicant (Expected: Approved ✓)
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"age": 35,
"income": 80000,
"credit_score": 750,
"loan_amount": 25000,
"employment_length": 10,
"debt_to_income": 0.3,
"num_credit_lines": 5,
"education_Bachelor": 1,
"education_Master": 0,
"education_PhD": 0,
"home_ownership_Own": 1,
"home_ownership_Rent": 0
}'
Expected Response:
{"prediction": [1]}
Test 2: Low-quality applicant (Expected: Rejected ✗)
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"age": 25,
"income": 30000,
"credit_score": 550,
"loan_amount": 40000,
"employment_length": 2,
"debt_to_income": 0.8,
"num_credit_lines": 2,
"education_Bachelor": 0,
"education_Master": 0,
"education_PhD": 0,
"home_ownership_Own": 0,
"home_ownership_Rent": 1
}'
Expected Response:
{"prediction": [0]}
🔧 Troubleshooting Common Issues
| Error | Solution |
|---|---|
ModuleNotFoundError: pandas |
Run: pip3 install pandas numpy scikit-learn |
FileNotFoundError: data.csv |
Create data.csv using the script in Step 2 |
Address already in use |
Change port: app.run(port=5001) |
Permission denied |
Use: pip3 install --user [package] |
| Port 5000 not responding | Check Flask server is running in terminal |
✅ Validation Checklist
- ✓ All packages installed without errors
- ✓
data.csvcreated (1,000 rows, 10 columns) - ✓ Model trains successfully (no errors)
- ✓ Accuracy > 95%
- ✓ Files created:
model.pkl,scaler.pkl - ✓ Flask server starts on port 5000
- ✓ API responds with JSON predictions
- ✓ Test 1 returns
{"prediction": [1]} - ✓ Test 2 returns
{"prediction": [0]}
📚 Dataset Information
| Feature | Description | Range |
|---|---|---|
| age | Applicant's age | 18-80 years |
| income | Annual income | $20k-$150k |
| credit_score | Credit score | 300-850 |
| loan_amount | Requested loan amount | $5k-$50k |
| employment_length | Years employed | 0-40 years |
| debt_to_income | Debt to income ratio | 0.0-1.0 |
| num_credit_lines | Number of credit lines | 1-15 |
| education | Education level | Categorical |
| home_ownership | Home ownership status | Rent/Own/Mortgage |
| target | Loan approved (1) or rejected (0) | 0 or 1 |
🎉 Success!
If all tests pass, you've successfully:
- ✅ Executed the complete 11-step ML workflow
- ✅ Trained a model with ~98% accuracy
- ✅ Saved and loaded a production model
- ✅ Deployed a REST API for predictions
- ✅ Tested the API with real requests
Ready to adapt this template for your own ML projects! 🚀
🔑 ULTIMATE MASTER KEY
The One Mnemonic to Rule Them All
"Lazy Programmers Should Skip Internet Training, Predicting Every Semester, Loading Servers - Data Trained Validates, Monitors Deploy, Retraining Continuously"
This single sentence encodes:
- ✅ Complete 11-step ML workflow (Load → Serve)
- ✅ 6-step MLOps pipeline (Data → Retrain)
- ✅ Everything you need for production ML!
✅ QUICK CHECKLIST MNEMONICS
Pre-Training Checklist — "DATA-READY"
- ✓ Data loaded?
- ✓ All preprocessing done?
- ✓ Train/test split?
- ✓ All features scaled?
- ✓ Random seed set?
- ✓ Everything validated?
- ✓ Architecture initialized?
- ✓ Data shapes correct?
- ✓ You're ready to train!
Post-Training Checklist — "MODEL-SAFE"
- ✓ Model trained?
- ✓ Overfitting checked?
- ✓ Data predictions made?
- ✓ Evaluation complete?
- ✓ Logged in MLflow?
- ✓ Saved to disk?
- ✓ Artifacts versioned?
- ✓ Final tests passed?
- ✓ Everything documented?
💡 PRACTICAL USAGE TIPS
Morning Practice
Write these letters: L-P-S-S-I-T-P-E-S-L-S
Say: "Lazy Programmers Should Skip Internet, Train Python..."
Do this for 7 days → automatic recall!
During Coding
Stuck? Ask: "Where am I in LAZY PROGRAMMERS?"
Oh, I just did Split (S), next is Scale (S)!
Code Review
Use DICFM checklist:
- ✓ Docstring?
- ✓ Imports?
- ✓ Constants?
- ✓ Functions?
- ✓ Main?
🤖 ML MODEL TYPES, LIBRARIES & FILE FORMATS
Model Save/Load Formats by Library
| Library | Model Types | File Extensions | Save/Load Methods |
|---|---|---|---|
| Scikit-learn | Random Forest, SVM, LogisticRegression, KNN, Decision Trees | .pkl, .joblib |
joblib.dump() / joblib.load() |
| TensorFlow/Keras | Neural Networks, CNN, RNN, LSTM, Transformers | .h5, .keras, .pb, .tflite |
model.save() / load_model() |
| PyTorch | Neural Networks, CNN, RNN, GAN, Transformers | .pt, .pth, .onnx |
torch.save() / torch.load() |
| XGBoost | Gradient Boosting (Trees) | .model, .json, .ubj |
save_model() / load_model() |
| LightGBM | Gradient Boosting (Trees) | .txt, .model |
save_model() / Booster() |
| CatBoost | Gradient Boosting (Categorical) | .cbm, .json |
save_model() / load_model() |
| Hugging Face | BERT, GPT, T5, Transformers | .bin, .safetensors |
save_pretrained() / from_pretrained() |
| ONNX | Universal (Cross-platform) | .onnx |
onnx.save() / onnx.load() |
Complete Model Saving/Loading Examples
1. Scikit-learn Models (.pkl / .joblib)
import joblib
from sklearn.ensemble import RandomForestClassifier
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# SAVE - Option 1: Joblib (Recommended for sklearn)
joblib.dump(model, 'model.joblib')
# SAVE - Option 2: Pickle
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
# LOAD
loaded_model = joblib.load('model.joblib')
# OR
with open('model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
2. TensorFlow/Keras Models (.h5 / .keras)
import tensorflow as tf
from tensorflow import keras
# Create model
model = keras.Sequential([
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
# Train model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=10)
# SAVE - Option 1: Native Keras format (Recommended)
model.save('model.keras')
# SAVE - Option 2: HDF5 format
model.save('model.h5')
# SAVE - Option 3: SavedModel format (TensorFlow)
model.save('saved_model/')
# LOAD
loaded_model = keras.models.load_model('model.keras')
# OR
loaded_model = keras.models.load_model('model.h5')
3. PyTorch Models (.pt / .pth)
import torch
import torch.nn as nn
# Define model
class NeuralNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.fc2 = nn.Linear(128, 10)
model = NeuralNet()
# SAVE - Option 1: State dict (Recommended)
torch.save(model.state_dict(), 'model.pth')
# SAVE - Option 2: Entire model
torch.save(model, 'model_complete.pt')
# LOAD - Option 1: State dict
model = NeuralNet()
model.load_state_dict(torch.load('model.pth'))
model.eval()
# LOAD - Option 2: Entire model
model = torch.load('model_complete.pt')
model.eval()
4. XGBoost Models (.model / .json)
import xgboost as xgb
# Train model
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
# SAVE - Option 1: Binary format (Recommended)
model.save_model('xgb_model.model')
# SAVE - Option 2: JSON format
model.save_model('xgb_model.json')
# SAVE - Option 3: Universal Binary JSON
model.save_model('xgb_model.ubj')
# LOAD
loaded_model = xgb.XGBClassifier()
loaded_model.load_model('xgb_model.model')
5. LightGBM Models (.txt / .model)
import lightgbm as lgb
# Train model
model = lgb.LGBMClassifier()
model.fit(X_train, y_train)
# SAVE - Option 1: Text format
model.booster_.save_model('lgb_model.txt')
# SAVE - Option 2: Binary format
model.booster_.save_model('lgb_model.model')
# LOAD
loaded_model = lgb.Booster(model_file='lgb_model.txt')
6. Hugging Face Transformers (.bin / .safetensors)
from transformers import BertForSequenceClassification, BertTokenizer
# Load pre-trained model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# SAVE
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
# LOAD
loaded_model = BertForSequenceClassification.from_pretrained('./my_model')
loaded_tokenizer = BertTokenizer.from_pretrained('./my_model')
Model Format Comparison
| Format | Pros | Cons | Best For |
|---|---|---|---|
| .pkl (Pickle) | Python native, universal | Security risks, Python-only | Quick prototypes |
| .joblib | Fast for large numpy arrays | Python-only | Scikit-learn models |
| .h5 (HDF5) | Efficient, widely supported | Large file size | Keras/TensorFlow |
| .pt/.pth | PyTorch native, flexible | PyTorch-only | PyTorch models |
| .onnx | Cross-platform, optimized | Conversion complexity | Production deployment |
| .pb (ProtoBuf) | TensorFlow production format | Complex structure | TensorFlow Serving |
| .tflite | Small size, mobile-optimized | Limited operations | Mobile/Edge devices |
| .safetensors | Secure, fast loading | Newer format | Hugging Face models |
Quick Reference: Save & Load Cheat Sheet
# ==========================================
# SCIKIT-LEARN
# ==========================================
import joblib
joblib.dump(model, 'model.joblib') # Save
model = joblib.load('model.joblib') # Load
# ==========================================
# TENSORFLOW/KERAS
# ==========================================
model.save('model.keras') # Save
model = keras.models.load_model('model.keras') # Load
# ==========================================
# PYTORCH
# ==========================================
torch.save(model.state_dict(), 'model.pth') # Save
model.load_state_dict(torch.load('model.pth')) # Load
# ==========================================
# XGBOOST
# ==========================================
model.save_model('model.model') # Save
model.load_model('model.model') # Load
# ==========================================
# LIGHTGBM
# ==========================================
model.booster_.save_model('model.txt') # Save
model = lgb.Booster(model_file='model.txt') # Load
# ==========================================
# HUGGING FACE
# ==========================================
model.save_pretrained('./model') # Save
model = Model.from_pretrained('./model') # Load
💡 Best Practices:
- ✓ Version control: Include version in filename:
model_v1.0.pkl - ✓ Save metadata: Store preprocessing objects (scalers, encoders) separately
- ✓ Production: Use
.onnxfor cross-platform deployment - ✓ Mobile: Use
.tflitefor TensorFlow mobile apps - ✓ Security: Avoid pickle for untrusted sources
- ✓ Size: Compress large models:
gzip model.pkl
🎯 ML WORKFLOW DECISION TREE
Model Deployment Decision Tree
START: Need to deploy a model?
│
├─→ Real-time predictions needed?
│ │
│ ├─→ YES → Latency < 1 second?
│ │ │
│ │ ├─→ YES → Traffic pattern?
│ │ │ │
│ │ │ ├─→ Constant/Predictable
│ │ │ │ → REAL-TIME ENDPOINT
│ │ │ │ • Always-on server
│ │ │ │ • Auto-scaling
│ │ │ │ • ML instance types
│ │ │ │
│ │ │ └─→ Intermittent/Unpredictable
│ │ │ → SERVERLESS INFERENCE
│ │ │ • Auto-scales to zero
│ │ │ • Cold start acceptable
│ │ │ • Pay per invoke
│ │ │
│ │ └─→ NO → Processing time > 60 sec?
│ │ │
│ │ └─→ YES → ASYNCHRONOUS INFERENCE
│ │ • Queue-based processing
│ │ • S3 trigger integration
│ │ • Long-running tasks
│ │
│ └─→ NO → Large batch of data?
│ │
│ └─→ YES → BATCH TRANSFORM
│ • Process entire datasets
│ • No endpoint needed
│ • Cost-effective for bulk
│
└─→ Deploy to edge devices?
│
└─→ YES → EDGE DEPLOYMENT
• Compile for IoT
• No internet required
• Optimized inference
ML Problem Type Decision Tree
START: What ML problem do you have?
│
├─→ Do you have labeled data?
│ │
│ ├─→ YES → What type of output?
│ │ │
│ │ ├─→ Categories/Classes
│ │ │ → CLASSIFICATION
│ │ │ • Binary (2 classes)
│ │ │ • Multi-class (3+ classes)
│ │ │ • Multi-label (multiple outputs)
│ │ │ Examples: Spam detection, Image recognition
│ │ │
│ │ ├─→ Continuous Numbers
│ │ │ → REGRESSION
│ │ │ • Predict numeric values
│ │ │ • Linear/Non-linear relationships
│ │ │ Examples: House prices, Stock forecasting
│ │ │
│ │ └─→ Sequence/Text
│ │ → SEQUENCE MODELING
│ │ • Time series prediction
│ │ • Text generation
│ │ • Language translation
│ │
│ └─→ NO → What's your goal?
│ │
│ ├─→ Find patterns/groups
│ │ → CLUSTERING
│ │ • K-Means, DBSCAN
│ │ • Customer segmentation
│ │ • Anomaly detection
│ │
│ ├─→ Reduce dimensions
│ │ → DIMENSIONALITY REDUCTION
│ │ • PCA, t-SNE, UMAP
│ │ • Feature extraction
│ │ • Visualization
│ │
│ └─→ Learn from rewards
│ → REINFORCEMENT LEARNING
│ • Agent-based learning
│ • Game playing, Robotics
│ • Sequential decisions
Data Preprocessing Decision Tree
START: How to handle your data?
│
├─→ Missing values present?
│ │
│ ├─→ YES → How much missing?
│ │ │
│ │ ├─→ < 5% missing
│ │ │ → DROP ROWS
│ │ │ • df.dropna()
│ │ │ • Minimal data loss
│ │ │
│ │ ├─→ 5-40% missing
│ │ │ → IMPUTE VALUES
│ │ │ • Mean/Median (numeric)
│ │ │ • Mode (categorical)
│ │ │ • Forward/Backward fill
│ │ │ • KNN Imputer
│ │ │
│ │ └─→ > 40% missing
│ │ → DROP COLUMN
│ │ • Too much missing data
│ │ • Not reliable for training
│ │
│ └─→ NO → Continue to next check
│
├─→ Categorical features?
│ │
│ ├─→ Ordinal (has order)
│ │ → LABEL ENCODING
│ │ • LabelEncoder()
│ │ • Low=0, Medium=1, High=2
│ │
│ └─→ Nominal (no order)
│ │
│ ├─→ Few categories (< 10)
│ │ → ONE-HOT ENCODING
│ │ • pd.get_dummies()
│ │ • Binary columns per category
│ │
│ └─→ Many categories (> 10)
│ → TARGET ENCODING
│ • Replace with mean target
│ • Reduces dimensionality
│
├─→ Numeric features with different scales?
│ │
│ ├─→ Features have outliers?
│ │ │
│ │ ├─→ YES → ROBUST SCALING
│ │ │ • RobustScaler()
│ │ │ • Uses median & IQR
│ │ │ • Outlier resistant
│ │ │
│ │ └─→ NO → Distribution type?
│ │ │
│ │ ├─→ Normal distribution
│ │ │ → STANDARD SCALING
│ │ │ • StandardScaler()
│ │ │ • Mean=0, Std=1
│ │ │
│ │ └─→ Not normal
│ │ → MIN-MAX SCALING
│ │ • MinMaxScaler()
│ │ • Range [0, 1]
│ │
│ └─→ NO → Data ready!
│
└─→ Imbalanced classes?
│
├─→ Slightly imbalanced (60:40)
│ → CLASS WEIGHTS
│ • class_weight='balanced'
│ • Penalize errors differently
│
├─→ Moderately imbalanced (80:20)
│ → RESAMPLING
│ • SMOTE (oversample minority)
│ • Random undersample majority
│
└─→ Severely imbalanced (95:5)
→ ANOMALY DETECTION
• Treat as outlier problem
• Use Isolation Forest
• One-class SVM
Model Selection Decision Tree
START: Which algorithm should I use?
│
├─→ For CLASSIFICATION problems:
│ │
│ ├─→ Data size?
│ │ │
│ │ ├─→ Small dataset (< 10K rows)
│ │ │ │
│ │ │ ├─→ Need interpretability?
│ │ │ │ │
│ │ │ │ ├─→ YES → LOGISTIC REGRESSION
│ │ │ │ │ • Simple, interpretable
│ │ │ │ │ • Linear decision boundary
│ │ │ │ │ • Fast training
│ │ │ │ │
│ │ │ │ └─→ NO → SVM (RBF kernel)
│ │ │ │ • Non-linear boundaries
│ │ │ │ • High accuracy
│ │ │ │ • Works well in high dimensions
│ │ │ │
│ │ │ └─→ Want tree-based?
│ │ │ → DECISION TREE / RANDOM FOREST
│ │ │ • Handle non-linear data
│ │ │ • Feature importance
│ │ │ • No scaling needed
│ │ │
│ │ └─→ Large dataset (> 10K rows)
│ │ │
│ │ ├─→ Structured/Tabular data?
│ │ │ │
│ │ │ └─→ YES → GRADIENT BOOSTING
│ │ │ • XGBoost, LightGBM, CatBoost
│ │ │ • Best for tabular data
│ │ │ • High accuracy
│ │ │ • Handles missing values
│ │ │
│ │ └─→ Unstructured (images/text)?
│ │ │
│ │ ├─→ Images → CONVOLUTIONAL NEURAL NET (CNN)
│ │ │ • ResNet, VGG, EfficientNet
│ │ │ • Transfer learning available
│ │ │ • GPU recommended
│ │ │
│ │ └─→ Text → TRANSFORMER MODELS
│ │ • BERT, GPT, RoBERTa
│ │ • Pre-trained available
│ │ • Fine-tune on your data
│ │
│ └─→ Special cases:
│ │
│ ├─→ Many features, few samples → NAIVE BAYES
│ ├─→ Need probability estimates → LOGISTIC REGRESSION
│ └─→ Multi-label classification → ONE-VS-REST + BASE MODEL
│
├─→ For REGRESSION problems:
│ │
│ ├─→ Linear relationship?
│ │ │
│ │ ├─→ YES → LINEAR REGRESSION
│ │ │ • Simple, fast
│ │ │ • Ridge/Lasso for regularization
│ │ │ • ElasticNet for both L1/L2
│ │ │
│ │ └─→ NO → Non-linear patterns?
│ │ │
│ │ ├─→ Tree-based preferred
│ │ │ → RANDOM FOREST REGRESSOR
│ │ │ • Handles non-linearity
│ │ │ • Robust to outliers
│ │ │ • Feature importance
│ │ │
│ │ └─→ Need highest accuracy
│ │ → GRADIENT BOOSTING REGRESSOR
│ │ • XGBoost, LightGBM
│ │ • Best performance
│ │ • Ensemble method
│ │
│ └─→ Time series data?
│ → SPECIALIZED TIME SERIES
│ • ARIMA, Prophet
│ • LSTM, GRU (deep learning)
│ • Seasonal decomposition
│
└─→ For CLUSTERING problems:
│
├─→ Know number of clusters?
│ │
│ ├─→ YES → K-MEANS
│ │ • Fast, scalable
│ │ • Spherical clusters
│ │ • Need to specify K
│ │
│ └─→ NO → Density-based needed?
│ │
│ └─→ YES → DBSCAN
│ • Finds arbitrary shapes
│ • Handles noise/outliers
│ • Auto determines clusters
│
└─→ Hierarchical structure?
→ HIERARCHICAL CLUSTERING
• Dendrogram visualization
• Agglomerative/Divisive
• Good for small datasets
🎯 CONCLUSION
Remember These Three Master Mnemonics
- ML Workflow: "LAZY PROGRAMMERS SHOULD SKIP INTERNET, TRAIN PYTHON EVERYDAY SAVING LOADS OF SERVER POWER"
- MLOps Pipeline: "DTVMDR - Don't Trust Very Much During Retirement"
- Code Structure: "DICFM - Dick's Infamous Chocolate Fudge Makes"
Practice tip: Write out the first letter of each step every morning for a week. By day 7, it'll be automatic!
🎉 Now You'll Never Forget the ML Workflow!
With these mnemonics, you have a complete mental framework for:
- ✅ Building end-to-end ML pipelines
- ✅ Writing clean, structured code
- ✅ Deploying production MLOps systems
- ✅ Remembering library-specific patterns
Happy modeling! 🚀

Comments