Training Data
The dataset behind SAGE
SAGE is trained on roughly 10,000 Steam games merged from
SteamSpy (ownership & review aggregates) and the
Steam Web API (store-page metadata). The pipeline
below shows exactly how the raw sources become the
steam_10k_prelaunch.csv file used by the model.
10,000
Games (rows)
71
Columns total
48
Pre-launch features used
6
Owner-tier classes
1 · Data sources
SteamSpy (base CSV)
Provides aggregate ownership buckets, review counts, playtime
statistics, price, genre flags, and release month. This forms the
base table keyed by appid.
- owners
- positive / negative
- price
- release_month
- genre flags
Steam Web API (JSON)
Per-game store-page metadata: descriptions, screenshots, trailers,
supported languages, platforms, categories, tags, packages, and
achievements. Loaded as a dict keyed by appid.
- screenshots
- movies
- categories
- tags
- supported_languages
- publishers
2 · Preprocessing pipeline
Two scripts produce the final training file.
The first enriches and merges; the second selects pre-launch-only
features and builds the classification target.
-
1
Load base CSV & Steam JSON
Read SteamSpy CSV into a DataFrame; load Steam Web API JSON
keyed by appid string.
-
2
Extract pre-launch features from JSON
For each game, derive ~30 columns: platform support,
language counts, screenshot & trailer counts, description
lengths, publisher/developer counts, Steam category flags
(achievements, workshop, cloud, controller, VR…), tag counts
and top-tag vote totals, package & SKU counts.
-
3
Merge on appid (left join)
Base CSV ⟕ JSON features. Rows with no JSON match keep
their base columns; new numeric features are filled with
0.
-
4
Engineer composite scores
Seven derived features combine raw signals into
interpretable proxies: store_page_score,
platform_reach, marketing_score,
publisher_backing, localization_score,
steam_integration, and an
is_mature_content flag.
-
5
Drop post-launch columns (leakage prevention)
Reviews, playtime, CCU, and owners themselves are excluded
from the model — they wouldn't exist before launch and would
leak the target.
-
6
Build ordinal target & split
Owners bucket → ordinal class 0–5 (top sparse buckets
merged into ≥750K). Median imputation for any
remaining numeric NaNs, then a stratified 80/20 train-test
split followed by z-score scaling.
3 · Target variable
The model predicts an ordinal owner
tier, not a raw owner count. The original 9 SteamSpy buckets
are collapsed to 6 because the top 4 had too few samples (n = 136, 88,
53, 19) for reliable learning.
| Class | Owner tier | Notes |
0 | ≤ 10K | Majority class (~69%) |
1 | ≤ 35K | |
2 | ≤ 75K | |
3 | ≤ 150K | |
4 | ≤ 350K | |
5 | ≥ 750K | Merged top 4 sparse buckets |
4 · Pre-launch features (49)
Only signals that exist before a game
ships are used. Grouped by origin:
Pricing & release
- price
- initialprice
- is_free
- release_month
Genre flags (one-hot)
- Action
- Adventure
- RPG
- Strategy
- Simulation
- Indie
- Sports
- Racing
Store-page richness
- screenshot_count
- has_trailer
- trailer_count
- about_length
- has_detailed_desc
- has_website
- has_support_email
Platform & localization
- platform_count
- supported_languages_count
- full_audio_languages_count
- required_age
Studio signals
- developer_count
- publisher_count
- has_publisher
- is_solo_dev
Steam ecosystem flags
- is_multiplayer
- has_achievements
- has_trading_cards
- has_workshop
- has_cloud_save
- has_controller_support
- has_vr_support
- has_in_app_purchases
- has_family_sharing
Categories, tags & packaging
- category_count
- tag_count
- has_multiplayer_tag
- top_tag_votes_total
- top_tag_votes_mean
- dlc_count
- package_count
- sku_count
- achievement_count
Engineered composite scores
- store_page_score
- platform_reach
- marketing_score
- publisher_backing
- localization_score
- steam_integration
- is_mature_content
Columns explicitly excluded (post-launch / leakage)
- positive
- negative
- total_reviews
- positive_ratio
- average_forever
- average_2weeks
- median_forever
- median_2weeks
- ccu
- owners
- log_owners
- json_price_raw
- appid
5 · Reproducibility
The full pipeline lives in two Python scripts.
Output is the file the model trains on:
enrich_prelaunch.py
Merges SteamSpy CSV with Steam API JSON and engineers composite features.
train_prelaunch_model.py
Selects pre-launch features, builds the ordinal target, trains the stacked ensemble.
steam_10k_prelaunch.csv
Final 10,000 × 71 training file consumed by the model.