Project overview

This system is a thesis project developed at FEU Institute of Technology under the BS Computer Science with Specialization in Data Science program.

The research applies ensemble learning techniques (Random Forest, Gradient Boosting, XGBoost) combined through stacking to predict the commercial success of digital games on Steam using pre-launch features only.

Research team

  • Balajadia, Gabriel Patrick A.
  • Casas, Juan Carlo C.
  • Estrada, Raphael G.
  • Garcia, Aaron Jacob G.

Thesis adviser: Elisa V. Malasaga

Department: Computer Science, FEU Institute of Technology

Submission: March 2026

Methodology

Data source

Trained on 10,000 games collected from the Steam Web API and SteamSpy, focusing exclusively on pre-launch attributes (no post-launch reviews or playtime).

Target variable

Games are classified into 6 ordinal owner tiers:

  • Class 0: ≤10K owners — Common Indie
  • Class 1: 10K–35K owners — Niche
  • Class 2: 35K–75K owners — Growing
  • Class 3: 75K–150K owners — Established
  • Class 4: 150K–350K owners — Popular
  • Class 5: ≥750K owners — Breakout Hit

Note: classes 5–8 from the original 9-class scheme were merged into Class 5 due to sparse samples.

Ensemble architecture

  1. Base models: Random Forest, Gradient Boosting, XGBoost (trained independently)
  2. Meta-learner: XGBoost classifier combining base-model probabilities
  • Random Forest — robust feature importance, handles non-linearity
  • Gradient Boosting — sequential error correction, reduces bias
  • XGBoost — regularization, handles missing data, captures complex patterns

Feature engineering

From raw Steam data, 48 pre-launch features were extracted and engineered:

  • Raw (41): pricing, release month, genre, platform, languages, screenshots, trailers, developer/publisher info, Steam features, tags, DLC
  • Composite (7): store page quality, marketing investment, platform reach, publisher backing, localization, Steam integration, mature-content flag

Headline results

Weighted F1
0.6211
Macro F1
0.2688
Accuracy
70.85%

Model interpretability (SHAP)

SHAP (SHapley Additive exPlanations) values were computed to understand which features most strongly influence predictions and how they impact each class.

Global feature importance

SHAP global feature importance

Mean absolute SHAP

SHAP summary

Top single-prediction explanation

SHAP top features

Class-specific SHAP values

SHAP class-specific analysis

Partial dependence (top 4)

Partial dependence plot

Permutation importance

Permutation importance

Note: Click any plot to enlarge. Source files live in model_outputs_prelaunch/shap_outputs/; copy them into static/images/ for display.

Key findings

  • Most influential features: store page quality (trailers, screenshots, detailed descriptions), publisher backing, platform reach, localization.
  • Genre effects: indie dominates the dataset but shows lower average success; Action, Strategy, RPG correlate with higher tiers.
  • Pricing strategy: mid-range ($9.99–$19.99) performs best. Free-to-play is bimodal (very successful or fails fast).
  • Release timing: October–December slightly outperforms (holiday sales effect).
  • Steam integration: achievements, trading cards, and Workshop support correlate positively with owner tier.

Limitations

  • Platform-specific: Steam only; predictions for Epic / GOG / consoles are unreliable.
  • Temporal drift: gaming trends evolve; retrain periodically with fresh data.
  • Data quality: SteamSpy provides estimated owner ranges (not exact counts), introducing noise.
  • Causation vs correlation: high prediction ≠ guaranteed success — game quality matters most.
  • Niche genres: under-represented genres (Racing, Sports) are less reliable.

Download full thesis

The complete thesis document is available for download:

Download thesis PDF

If the PDF is not available, place it at static/thesis/Bit-by-Bit_Thesis.pdf.

References

  1. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems.
  2. Ma, X. (2025). Predicting Video Game Success Using Multi-Platform Data.
  3. Mienye, I. D., et al. (2022). A Survey of Ensemble Learning Techniques.
  4. Wood, M., et al. (2023). Diversity in Ensemble Learning: Theoretical and Empirical Analysis.
  5. Andraž, P., et al. (2021). Validating Steam as a Data Source for Success Prediction.
  6. Kainz, F., & Pirker, J. (2023). Large-Scale Steam Data Collection and Analysis.

Acknowledgments

Our thanks to thesis adviser Ms. Elisa V. Malasaga for her guidance, and to the FEU Institute of Technology Department of Computer Science for resources and support. Special thanks to the maintainers of scikit-learn, XGBoost, SHAP, Pandas, Flask, and Gunicorn.

Contact

Email: [email protected]

Institution: FEU Institute of Technology, Manila, Philippines