ML decision systems in production (financial markets) | Case Studies

A trading and market operations team needed 24/7 coverage and higher-quality signals without proportionally higher operational cost. Their existing setup relied on manual analysis and rules that could not adapt to regime changes. They had tried one-off ML experiments, but moving from backtest to live required a different level of engineering: consistent features, hourly inference, and a scanner and execution engine that could consume predictions in real time. They came to us for a production-grade system that would improve entry quality and regime awareness while keeping latency and operational burden low.

The goal was not to replace human judgment but to give the team a scalable signal layer. The system would score many symbols, flag regimes (trending vs flat), and provide confidence-calibrated outputs so that position sizing and entry logic could use ML without becoming a black box. We aligned on that outcome and on the constraint that the pipeline had to run reliably: hourly batch inference, sub-minute consumption by the engine, and clear monitoring so that staleness or model drift would be caught before they affected live decisions.

The challenge

Financial market data is severely imbalanced. In the timeframes and symbols the client cared about, upward moves might represent only a few percent of samples, downward moves a bit more, and flat or range-bound behavior the vast majority. A naive model trained on that distribution would predict flat almost always and show weak precision on the directional labels that actually mattered for entry and exit. The client had seen that problem in earlier experiments: great backtest accuracy, poor live precision on the moves they wanted to capture.

A second challenge was scale. The team needed coverage across 100+ symbols and multiple timeframes (e.g. 1h, 4h, daily). Running separate models per symbol or per timeframe would have been a maintenance and compute nightmare. They wanted a single general model that could be trained on combined data and then used for all symbols, with inference running on a schedule (e.g. every hour) and results cached so the trading engine could read them with minimal latency. That required a clean separation between the ML service (Python, training and inference) and the trading engine (Node.js, scanning and execution), with a shared store (e.g. MongoDB) for predictions and metadata.

A third challenge was regime awareness. In flat markets, the same entry logic that works in trending markets can generate false signals and whipsaws. The system needed to distinguish regimes and adjust scoring: for example, boost symbols in a trending regime and penalize or skip symbols when the higher timeframe was flat. That logic had to be consistent across the scanner (which builds the watchlist) and the engine (which confirms entry). We designed the label set and the inference output so that both trend and flat labels were available, and the downstream logic could use them for regime-based filtering.

Our approach

We designed a general model trained on 100+ symbols with multi-timeframe features: returns, momentum, trend, volatility, and volume. To handle class imbalance, we used multi-ATR balance selection: we balanced the training set separately for 1x, 2x, and 3x ATR levels and chose the level that yielded the most balanced samples. The model was trained to predict multiple labels at once (trend direction, breakout, and flat regimes) with a single LightGBM-based pipeline. That way one inference run produced all the signals the scanner and engine needed, and we avoided the complexity of multiple models and sync issues.

Training used a hybrid sampling method (uniform, stratified by time, and volatility regimes) so that the model saw a representative mix of market conditions. We enforced a minimum history (e.g. 208 days) per symbol so that inference was only run when enough data existed. The model was exported and used in a Python service that ran every hour at a fixed offset. It fetched the latest candle data, generated features, ran inference, and wrote predictions to MongoDB with a timestamp. The Node.js scanner and trading engine then queried the store every minute, using predictions that were no older than a set max age. If predictions were missing or stale, the system could fall back to rules or skip the symbol, so live behavior was never driven by bad or outdated data.

The scanner used the predictions to score symbols. It applied regime-based boosts and penalties. The engine used the same predictions for entry confirmation and position sizing. ML confidence was mapped to size within the risk framework. That end-to-end flow gave the team a single pipeline that was explainable, auditable, and tunable without redeploying the whole stack.

Results and metrics

Directional precision improved sharply. The 1440m (daily) up label went from roughly 14% precision to 50–65%; the 1440m down label from roughly 47% to 75–85%. Watchlist quality improved: the share of symbols in a trending regime went from about 50% to about 80%. Operationally, one pipeline now serves scanning, regime detection, and execution. The team does not maintain separate models for different timeframes or symbols. Retraining follows a schedule with the same code and data pipeline. The result is production-grade ML that the team trusts and can iterate on.

What this means for you

If you are running or building automated trading or signal systems and have struggled with class imbalance, regime changes, or the gap between backtest and live, this case study shows that a general model with careful balance selection and a clear inference-to-execution path can work. We brought experience in both the ML pipeline and the trading engine so the client did not have to integrate two separate vendors.

We went from one-off experiments to a system that runs every hour and feeds the engine. The regime logic and confidence calibration made the difference between backtest and live.Quant / engineering lead

If you are evaluating ML for trading, risk, or decision systems and want a partner who has shipped this kind of pipeline to production, we can walk through your data, your labels, and your latency requirements in a discovery call.