projects / ml4qs-exercise-classifierconf: 0.97
category: machine_learning

Sensor-Based Exercise Classifier

Classifying 8 gym exercises from smartphone sensor data. Caught a data leak that was inflating accuracy to 100%, then learned what the upper bound actually is.

▸ fig. 1 · ml4qs-exercise-classifier● live
Sensor-Based Exercise Classifier

Overview

Group project for Machine Learning for Quantitative Self (ML4QS) at Vrije Universiteit Amsterdam (M.Sc. AI). Built a system that classifies gym exercises from smartphone sensor data strapped to the upper arm, comparing classical ML and an LSTM. The interesting result is what happened with the data, not the final accuracy.

Data collection

  • Smartphone running PhyPhox strapped to the upper left arm during workouts.
  • Sensors: 3-axis gyroscope, 3-axis accelerometer, compensated accelerometer, light, proximity, GPS, magnetometer.
  • Polling at ~500Hz (0.002s intervals) for accelerometer and gyroscope; sporadic for GPS and light (interrupt-driven).
  • 8 exercises across 4 muscle groups: Chest (Bench Press, Cable Flys), Back (Deadlift, Pull-ups), Arms (Bicep Curls, Shoulder Press), Core (Crunches, Russian Twists). Plus a Rest class. 9-way classification → random baseline ≈ 11%.

Feature engineering

Time aggregated to 0.02s intervals, velocity from acceleration integration, resultant acceleration magnitude, Butterworth low-pass (50 Hz base, 55 Hz cutoff). KNN imputation for the gaps left by the time aggregation. Removed GPS and magnetometer features after correlation analysis showed they were not picking up signal in stationary exercises.

The data leak

Random Forest hit 100% accuracy on the test set. Stratified split also gave 100%. SVM and KNN, same. That was the red flag. After tracing the pipeline, the cause was an aggregation step that produced a column behaving like a row index, strongly correlated with class because exercises were appended in a fixed order.

After dropping the leaked column, accuracy dropped to ~99% on stratified splits, which still felt high. Once a truly held-out validation set (from a different recording session) was used, the real number revealed itself.

What the actual upper bound looks like

| Model | Accuracy on truly unseen data | | :-- | :-- | | Random Forest | ~30% (best classical model) | | RF + Recursive Feature Elimination | 23% | | RF + PCA (n=8, 95% variance retained) | 16% | | LSTM (varied batch sizes 32 / 64 / 128) | Did not generalize | | Random baseline | ~11% |

Better than chance, but a long way from a deployable classifier. The LSTM trained but did not improve over the fully connected baseline; the report concludes that the temporal model couldn't be successfully implemented for this task within the project scope.

What I'd do differently

The post-mortem in the report identifies the real fixes: train one model per body group (much smaller class space), use TCNs over LSTMs for time-series, and derive frequency-domain features instead of stopping at time-domain. The current pipeline conflates very different motion signatures (e.g. bench press and pull-ups are both upper-body but produce nothing alike at the arm-mounted sensor).

The honest takeaway from this project is catching your own data leak before someone else does, and being able to articulate why a 30% classifier is closer to the truth than a 100% one.

Stack

Python, scikit-learn (Random Forest, SVM, KNN, RFE, PCA), TensorFlow/Keras (LSTM), Pandas, NumPy, PhyPhox.