Training Data

The dataset behind SAGE

SAGE is trained on roughly 10,000 Steam games merged from SteamSpy (ownership & review aggregates) and the Steam Web API (store-page metadata). The pipeline below shows exactly how the raw sources become the steam_10k_prelaunch.csv file used by the model.

10,000 Games (rows)
71 Columns total
48 Pre-launch features used
6 Owner-tier classes

1 · Data sources

SteamSpy (base CSV)

Provides aggregate ownership buckets, review counts, playtime statistics, price, genre flags, and release month. This forms the base table keyed by appid.

  • owners
  • positive / negative
  • price
  • release_month
  • genre flags

Steam Web API (JSON)

Per-game store-page metadata: descriptions, screenshots, trailers, supported languages, platforms, categories, tags, packages, and achievements. Loaded as a dict keyed by appid.

  • screenshots
  • movies
  • categories
  • tags
  • supported_languages
  • publishers

2 · Preprocessing pipeline

Two scripts produce the final training file. The first enriches and merges; the second selects pre-launch-only features and builds the classification target.

  1. 1

    Load base CSV & Steam JSON

    Read SteamSpy CSV into a DataFrame; load Steam Web API JSON keyed by appid string.

  2. 2

    Extract pre-launch features from JSON

    For each game, derive ~30 columns: platform support, language counts, screenshot & trailer counts, description lengths, publisher/developer counts, Steam category flags (achievements, workshop, cloud, controller, VR…), tag counts and top-tag vote totals, package & SKU counts.

  3. 3

    Merge on appid (left join)

    Base CSV ⟕ JSON features. Rows with no JSON match keep their base columns; new numeric features are filled with 0.

  4. 4

    Engineer composite scores

    Seven derived features combine raw signals into interpretable proxies: store_page_score, platform_reach, marketing_score, publisher_backing, localization_score, steam_integration, and an is_mature_content flag.

  5. 5

    Drop post-launch columns (leakage prevention)

    Reviews, playtime, CCU, and owners themselves are excluded from the model — they wouldn't exist before launch and would leak the target.

  6. 6

    Build ordinal target & split

    Owners bucket → ordinal class 0–5 (top sparse buckets merged into ≥750K). Median imputation for any remaining numeric NaNs, then a stratified 80/20 train-test split followed by z-score scaling.

3 · Target variable

The model predicts an ordinal owner tier, not a raw owner count. The original 9 SteamSpy buckets are collapsed to 6 because the top 4 had too few samples (n = 136, 88, 53, 19) for reliable learning.

ClassOwner tierNotes
0≤ 10KMajority class (~69%)
1≤ 35K
2≤ 75K
3≤ 150K
4≤ 350K
5≥ 750KMerged top 4 sparse buckets

4 · Pre-launch features (49)

Only signals that exist before a game ships are used. Grouped by origin:

Pricing & release

  • price
  • initialprice
  • is_free
  • release_month

Genre flags (one-hot)

  • Action
  • Adventure
  • RPG
  • Strategy
  • Simulation
  • Indie
  • Sports
  • Racing

Store-page richness

  • screenshot_count
  • has_trailer
  • trailer_count
  • about_length
  • has_detailed_desc
  • has_website
  • has_support_email

Platform & localization

  • platform_count
  • supported_languages_count
  • full_audio_languages_count
  • required_age

Studio signals

  • developer_count
  • publisher_count
  • has_publisher
  • is_solo_dev

Steam ecosystem flags

  • is_multiplayer
  • has_achievements
  • has_trading_cards
  • has_workshop
  • has_cloud_save
  • has_controller_support
  • has_vr_support
  • has_in_app_purchases
  • has_family_sharing

Categories, tags & packaging

  • category_count
  • tag_count
  • has_multiplayer_tag
  • top_tag_votes_total
  • top_tag_votes_mean
  • dlc_count
  • package_count
  • sku_count
  • achievement_count

Engineered composite scores

  • store_page_score
  • platform_reach
  • marketing_score
  • publisher_backing
  • localization_score
  • steam_integration
  • is_mature_content
Columns explicitly excluded (post-launch / leakage)
  • positive
  • negative
  • total_reviews
  • positive_ratio
  • average_forever
  • average_2weeks
  • median_forever
  • median_2weeks
  • ccu
  • owners
  • log_owners
  • json_price_raw
  • appid

5 · Reproducibility

The full pipeline lives in two Python scripts. Output is the file the model trains on:

enrich_prelaunch.py Merges SteamSpy CSV with Steam API JSON and engineers composite features.
train_prelaunch_model.py Selects pre-launch features, builds the ordinal target, trains the stacked ensemble.
steam_10k_prelaunch.csv Final 10,000 × 71 training file consumed by the model.