About — SAGE

Project overview

This system is a thesis project developed at FEU Institute of Technology under the BS Computer Science with Specialization in Data Science program.

The research applies ensemble learning techniques (Random Forest, Gradient Boosting, XGBoost) combined through stacking to predict the commercial success of digital games on Steam using pre-launch features only.

Research team

Balajadia, Gabriel Patrick A.
Casas, Juan Carlo C.
Estrada, Raphael G.
Garcia, Aaron Jacob G.

Thesis adviser: Elisa V. Malasaga

Department: Computer Science, FEU Institute of Technology

Submission: March 2026

Methodology

Data source

Trained on 10,000 games collected from the Steam Web API and SteamSpy, focusing exclusively on pre-launch attributes (no post-launch reviews or playtime).

Target variable

Games are classified into 6 ordinal owner tiers:

Class 0: ≤10K owners — Common Indie
Class 1: 10K–35K owners — Niche
Class 2: 35K–75K owners — Growing
Class 3: 75K–150K owners — Established
Class 4: 150K–350K owners — Popular
Class 5: ≥750K owners — Breakout Hit

Note: classes 5–8 from the original 9-class scheme were merged into Class 5 due to sparse samples.

Ensemble architecture

Base models: Random Forest, Gradient Boosting, XGBoost (trained independently)
Meta-learner: XGBoost classifier combining base-model probabilities

Random Forest — robust feature importance, handles non-linearity
Gradient Boosting — sequential error correction, reduces bias
XGBoost — regularization, handles missing data, captures complex patterns

Feature engineering

From raw Steam data, 48 pre-launch features were extracted and engineered:

Raw (41): pricing, release month, genre, platform, languages, screenshots, trailers, developer/publisher info, Steam features, tags, DLC
Composite (7): store page quality, marketing investment, platform reach, publisher backing, localization, Steam integration, mature-content flag

Headline results

Weighted F1

0.6211

Macro F1

0.2688

Accuracy

70.85%

Model interpretability (SHAP)

SHAP (SHapley Additive exPlanations) values were computed to understand which features most strongly influence predictions and how they impact each class.

Global feature importance

Mean absolute SHAP

Top single-prediction explanation

Class-specific SHAP values

Partial dependence (top 4)

Permutation importance

Note: Click any plot to enlarge. Source files live in model_outputs_prelaunch/shap_outputs/; copy them into static/images/ for display.

Key findings

Most influential features: store page quality (trailers, screenshots, detailed descriptions), publisher backing, platform reach, localization.
Genre effects: indie dominates the dataset but shows lower average success; Action, Strategy, RPG correlate with higher tiers.
Pricing strategy: mid-range ($9.99–$19.99) performs best. Free-to-play is bimodal (very successful or fails fast).
Release timing: October–December slightly outperforms (holiday sales effect).
Steam integration: achievements, trading cards, and Workshop support correlate positively with owner tier.

Limitations

Platform-specific: Steam only; predictions for Epic / GOG / consoles are unreliable.
Temporal drift: gaming trends evolve; retrain periodically with fresh data.
Data quality: SteamSpy provides estimated owner ranges (not exact counts), introducing noise.
Causation vs correlation: high prediction ≠ guaranteed success — game quality matters most.
Niche genres: under-represented genres (Racing, Sports) are less reliable.

Download full thesis

The complete thesis document is available for download:

Download thesis PDF

If the PDF is not available, place it at static/thesis/Bit-by-Bit_Thesis.pdf.

References

Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems.
Ma, X. (2025). Predicting Video Game Success Using Multi-Platform Data.
Mienye, I. D., et al. (2022). A Survey of Ensemble Learning Techniques.
Wood, M., et al. (2023). Diversity in Ensemble Learning: Theoretical and Empirical Analysis.
Andraž, P., et al. (2021). Validating Steam as a Data Source for Success Prediction.
Kainz, F., & Pirker, J. (2023). Large-Scale Steam Data Collection and Analysis.

Acknowledgments

Our thanks to thesis adviser Ms. Elisa V. Malasaga for her guidance, and to the FEU Institute of Technology Department of Computer Science for resources and support. Special thanks to the maintainers of scikit-learn, XGBoost, SHAP, Pandas, Flask, and Gunicorn.

Contact

Email: [email protected]

Institution: FEU Institute of Technology, Manila, Philippines