Production & LLMOps 12 min read

How I'd Build a Fraud Detection Pipeline (Real-Time + Batch)

Graph-based features + GBM real-time scorer + rule layer, processing 50K transactions/second. Feature engineering for fraud, class imbalance strategies, latency budget at p99, model staleness risk, and the monitoring stack that catches drift before revenue impact.

The Problem Is Not Classification. It's an Adversarial Cat-and-Mouse.

Fraud detection at a payments company like PhonePe or Juspay is not a standard ML classification problem. The label is imbalanced (0.01% to 0.1% fraud rate). The distribution shifts daily as fraudsters adapt to your model. Latency is hard-constrained (must score in <50ms to not block the transaction). False positives hurt real customers and drive churn. False negatives lose money. And the moment you deploy a model that works, fraudsters start probing it to find the blind spots.

Feature Engineering: Where the Real Work Is

Raw transaction features (amount, merchant, timestamp) are necessary but insufficient. The real signal is in behavioral patterns: velocity features (how many transactions has this card made in the last 1 hour / 24 hours / 7 days?), graph features (is this merchant connected to other merchants that were recently flagged?), device fingerprint (is this device ID associated with multiple account IDs?), location coherence (is the transaction location consistent with the user's historical behavior?).

Computing these features in real time at transaction time requires a streaming feature pipeline. Kafka ingests transaction events. Flink or Spark Streaming computes velocity aggregations and updates in real time. The feature store serves pre-computed and real-time features to the scoring model within the latency budget.

Model Architecture

Layer 1 — Rule engine (< 5ms): hard rules that block obviously fraudulent transactions immediately. Same account logged in from 3 countries in 1 hour → block. Transaction amount 50x the account's historical maximum → review queue. Rules are fast, interpretable, and auditable. They catch the easy cases and reduce the load on the ML model.

Layer 2 — ML model (< 30ms): a gradient-boosted tree (LightGBM or XGBoost) scoring the full feature set. Why trees over neural nets for the core model? Interpretability (regulators require explanations), speed (tree inference is faster than neural net inference at this latency budget), and robustness to noisy/missing features. Output: fraud probability score.

Layer 3 — Graph neural network (async, for enrichment): a GNN over the transaction graph detects fraud rings — coordinated groups of accounts and merchants working together. This runs async and enriches the feature store, not the real-time path.

Handling Class Imbalance

Undersample majority class: train on 10% of legitimate transactions + 100% of fraud. Adjust decision threshold at inference to account for the sampling ratio.
Focal loss: downweight easy (clearly legitimate) examples so training focuses on hard cases near the decision boundary.
Anomaly detection as a complementary signal: train an autoencoder on legitimate transactions only. Transactions with high reconstruction error are anomalous — good for catching novel fraud patterns the classifier hasn't seen.
Calibration: the model output is a score, not a calibrated probability. Platt scaling or isotonic regression calibrates the score to a true probability, enabling threshold selection with meaningful precision-recall interpretation.

The Adversarial Problem

Fraudsters probe your model. They run small test transactions to find the threshold below which you don't flag. They rotate device IDs and account IDs to defeat velocity features. They use money mule networks to launder through legitimate-looking accounts. The ML model must be retrained frequently (daily or weekly) on recent fraud patterns. New fraud patterns that don't match historical training data require anomaly detection. Human reviewers close the loop — their decisions feed back into training labels.

The interview question that trips people: 'your fraud model has 99.9% accuracy — is it good?' No. At 0.1% fraud rate, predicting 'legitimate' for every transaction achieves 99.9% accuracy while catching zero fraud. The right metric is precision-recall at a fixed threshold (e.g., precision@80% recall) or area under the PR curve.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →