MLOps Audit Checklist: Ensuring Your ML Systems Work in Production
A comprehensive checklist for auditing your MLOps pipeline. Learn what to look for and how to fix common issues that cause ML systems to fail in production.
Most ML models work great in Jupyter notebooks but fail catastrophically in production. The difference between a demo and a production-ready ML system is proper MLOps. This checklist helps you audit your MLOps pipeline and identify the gaps that are causing your models to fail.
Why MLOps Audits Matter
After auditing dozens of ML systems for clients, I've found that 80% of production ML failures are due to MLOps issues, not model problems:
- Data drift: Models trained on old data fail on new data
- Model degradation: Performance drops over time without detection
- Deployment issues: Models work in staging but fail in production
- Monitoring gaps: No visibility into model performance
- Rollback failures: Can't quickly revert when things go wrong
The MLOps Audit Framework
1. Data Pipeline Audit
Data Quality Checks
- Data validation: Automated checks for missing values, outliers, and data types
- Schema validation: Data structure remains consistent over time
- Data freshness: Monitoring data staleness and update frequency
- Data lineage: Tracking data from source to model training
# Example: Data quality monitoring
import great_expectations as ge
from datetime import datetime, timedelta
class DataQualityMonitor:
def __init__(self, data_source):
self.data_source = data_source
self.expectations = self._load_expectations()
def validate_data(self, df):
results = {}
for expectation in self.expectations:
try:
result = expectation.validate(df)
results[expectation.name] = result.success
except Exception as e:
results[expectation.name] = False
self.logger.error(f"Validation failed: {e}")
return results
def check_data_freshness(self, max_age_hours=24):
latest_data = self.data_source.get_latest_timestamp()
age = datetime.now() - latest_data
return age < timedelta(hours=max_age_hours)Data Versioning
- Data versioning system: Track different versions of training data
- Reproducibility: Can recreate exact training data from any point in time
- Data catalog: Searchable metadata about datasets
- Access control: Proper permissions for data access
2. Model Development Audit
Experiment Tracking
- Experiment logging: All experiments logged with parameters and results
- Model versioning: Unique identifiers for each model version
- Artifact storage: Models, metrics, and logs stored properly
- Reproducibility: Can recreate any experiment exactly
# Example: Experiment tracking setup
import mlflow
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score
class ExperimentTracker:
def __init__(self, experiment_name):
mlflow.set_experiment(experiment_name)
self.client = mlflow.tracking.MlflowClient()
def log_experiment(self, model, X_train, X_test, y_train, y_test, params):
with mlflow.start_run():
# Log parameters
mlflow.log_params(params)
# Train model
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
# Log metrics
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("precision", precision)
mlflow.log_metric("recall", recall)
# Log model
mlflow.sklearn.log_model(model, "model")
return {
'run_id': mlflow.active_run().info.run_id,
'accuracy': accuracy,
'precision': precision,
'recall': recall
}Model Testing
- Unit tests: Individual components tested in isolation
- Integration tests: End-to-end model testing
- Performance tests: Latency and throughput testing
- A/B testing: Comparing model versions in production
3. Model Deployment Audit
CI/CD Pipeline
- Automated testing: Tests run on every code change
- Model validation: Automated model quality checks
- Staging deployment: Models tested in staging environment
- Production deployment: Automated, safe production deployments
# Example: GitHub Actions MLOps pipeline
name: MLOps Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest
- name: Run tests
run: pytest tests/
- name: Run model validation
run: python scripts/validate_model.py
deploy-staging:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to staging
run: |
kubectl apply -f k8s/staging/
python scripts/run_integration_tests.py
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Deploy to production
run: |
kubectl apply -f k8s/production/
python scripts/run_smoke_tests.pyInfrastructure
- Scalability: System can handle increased load
- Reliability: High availability and fault tolerance
- Security: Proper access controls and data protection
- Resource management: Efficient use of compute resources
4. Model Monitoring Audit
Performance Monitoring
- Model metrics: Accuracy, precision, recall tracked over time
- Data drift detection: Automated detection of data distribution changes
- Model drift detection: Automated detection of model performance degradation
- Alerting: Automated alerts when metrics exceed thresholds
# Example: Model monitoring setup
import numpy as np
from scipy import stats
from datetime import datetime, timedelta
class ModelMonitor:
def __init__(self, model, baseline_data, threshold=0.05):
self.model = model
self.baseline_data = baseline_data
self.threshold = threshold
self.baseline_stats = self._calculate_baseline_stats()
def detect_data_drift(self, new_data):
"""Detect if new data has drifted from baseline"""
new_stats = self._calculate_data_stats(new_data)
# Statistical tests for drift
drift_detected = False
for feature in self.baseline_stats:
if feature in new_stats:
# Kolmogorov-Smirnov test
ks_stat, p_value = stats.ks_2samp(
self.baseline_data[feature],
new_data[feature]
)
if p_value < self.threshold:
drift_detected = True
self.logger.warning(f"Data drift detected in feature {feature}")
return drift_detected
def detect_model_drift(self, X_test, y_test):
"""Detect if model performance has degraded"""
y_pred = self.model.predict(X_test)
current_accuracy = accuracy_score(y_test, y_pred)
# Compare with baseline performance
baseline_accuracy = self.baseline_stats['accuracy']
performance_drop = baseline_accuracy - current_accuracy
if performance_drop > self.threshold:
self.logger.warning(f"Model drift detected: {performance_drop:.3f} drop in accuracy")
return True
return FalseBusiness Metrics
- Business KPIs: Model impact on business metrics tracked
- ROI measurement: Return on investment from ML initiatives
- User feedback: Direct feedback on model performance
- Cost tracking: Infrastructure and operational costs
5. Model Governance Audit
Model Lifecycle Management
- Model registry: Centralized repository of all models
- Approval process: Formal process for model promotion
- Retirement process: Process for decommissioning old models
- Documentation: Comprehensive model documentation
Compliance and Ethics
- Bias detection: Regular bias testing and mitigation
- Explainability: Model decisions can be explained
- Privacy compliance: GDPR, CCPA, and other privacy regulations
- Audit trails: Complete logs of model decisions and changes
Common MLOps Anti-Patterns
1. The "Works on My Machine" Problem
Problem: Models work in development but fail in production
Solution: Use containerization and infrastructure as code
2. The "Set and Forget" Problem
Problem: Models deployed once and never updated
Solution: Implement continuous monitoring and retraining
3. The "Black Box" Problem
Problem: No visibility into model performance or behavior
Solution: Comprehensive monitoring and logging
4. The "Data Leakage" Problem
Problem: Future data accidentally used in training
Solution: Proper data pipeline design and validation
MLOps Maturity Assessment
Rate your MLOps maturity on a scale of 1-5:
Level 1: Ad Hoc (1-2)
- Manual processes
- No versioning
- No monitoring
- Frequent production failures
Level 2: Basic (2-3)
- Some automation
- Basic versioning
- Limited monitoring
- Occasional production issues
Level 3: Intermediate (3-4)
- Automated CI/CD
- Good versioning
- Comprehensive monitoring
- Rare production issues
Level 4: Advanced (4-5)
- Full automation
- Advanced monitoring
- Proactive issue detection
- High reliability
Level 5: Expert (5)
- Self-healing systems
- Predictive monitoring
- Continuous optimization
- 99.9%+ uptime
Quick MLOps Health Check
Answer these questions to assess your MLOps health:
- Can you reproduce any model from 6 months ago? (Yes/No)
- Do you know when your model performance degrades? (Yes/No)
- Can you rollback a model deployment in under 5 minutes? (Yes/No)
- Do you have automated tests for your ML pipeline? (Yes/No)
- Can you trace a prediction back to the training data? (Yes/No)
If you answered "No" to any of these, you have MLOps gaps that need addressing.
MLOps Audit Checklist Summary
Data Management
- Data quality monitoring
- Data versioning
- Data lineage tracking
- Data access controls
Model Development
- Experiment tracking
- Model versioning
- Automated testing
- Reproducibility
Deployment
- CI/CD pipeline
- Staging environment
- Automated deployment
- Rollback capability
Monitoring
- Performance monitoring
- Drift detection
- Alerting
- Business metrics
Governance
- Model registry
- Approval process
- Documentation
- Compliance
Next Steps
If your MLOps audit reveals gaps, here's how to prioritize fixes:
- Critical: Fix anything causing production failures
- High: Implement monitoring and alerting
- Medium: Improve CI/CD and testing
- Low: Enhance documentation and governance
Conclusion
Proper MLOps is essential for ML systems that work in production. Most ML failures are MLOps failures, not model failures. The systems I audit and fix for clients follow these best practices and achieve 99%+ reliability.
If you're struggling with ML systems that work in demos but fail in production, I offer a comprehensive MLOps Audit + Quickstart service that identifies and fixes these issues. Ready to make your ML systems production-ready? Book a 30-minute consultation to discuss your MLOps challenges.