projects / ml-trading-pipelinein_progress
category: ml_and_finance

ML Trading Pipeline (research)

Research and backtesting tooling: a modular ML pipeline for stock prediction with proper temporal splitting, SHAP explainability, and vectorbt portfolio simulation.

▸ fig. 1 · ml-trading-pipeline● live
ML Trading Pipeline (research)

What this is (and what it isn't)

This is the research half of a two-project trading stack: it ingests historical data, engineers features, trains and compares supervised ML models, generates SHAP explanations, and runs realistic backtests with transaction costs. The companion project is the Algorithmic Trading Platform, which is the live execution half: event-driven, real-time, microservices.

This pipeline never sees the live market. Its job is to answer "would this strategy have worked, with proper temporal hygiene?" before any of it goes near real money.

Pipeline architecture

  1. Data ingestion. Alpaca Markets historical API with Parquet caching for fast iteration.
  2. Feature engineering. 12 technical indicators: SMA, EMA, RSI, MACD, Bollinger Bands, ATR, OBV, and others.
  3. Temporal splitting. Train / validation / test with strict lookahead prevention. Every transformation that could leak future information is gated.
  4. Model training. Random Forest, XGBoost, LightGBM, all with hyperparameter tuning. The tuning runs on the validation window only, and the test window is locked.
  5. Explainability. SHAP values per model so feature importance is comparable across the three candidates, and so a particular trade signal can be inspected.
  6. Backtesting. vectorbt for portfolio simulation with transaction costs, slippage assumptions, and configurable execution lag.

Why the test suite exists

42 unit and integration tests, with the heavy bias toward integration. The reason is specific to this domain: in finance ML, almost every bug is a silent one. A leaked feature gives you 99% accuracy that doesn't survive contact with the live market. The tests guard against the specific failure modes: accidental shifting of labels, uncached data clobbering live data, indicators that read into the future.

If something here looks over-engineered for a personal project, it's because the alternative (catching a leak the wrong way, in production, with real money) is much worse.

Stack

Python, scikit-learn, XGBoost, LightGBM, SHAP, vectorbt, Alpaca API, Pandas, NumPy, pytest.