Training Data

The dataset behind SAGE

SAGE is trained on roughly 10,000 Steam games merged from SteamSpy (ownership & review aggregates) and the Steam Web API (store-page metadata). The pipeline below shows exactly how the raw sources become the steam_10k_prelaunch.csv file used by the model.

10,000 Games (rows)

71 Columns total

48 Pre-launch features used

6 Owner-tier classes

1 · Data sources

SteamSpy (base CSV)

Provides aggregate ownership buckets, review counts, playtime statistics, price, genre flags, and release month. This forms the base table keyed by appid.

owners
positive / negative
price
release_month
genre flags

Steam Web API (JSON)

Per-game store-page metadata: descriptions, screenshots, trailers, supported languages, platforms, categories, tags, packages, and achievements. Loaded as a dict keyed by appid.

screenshots
movies
categories
tags
supported_languages
publishers

2 · Preprocessing pipeline

Two scripts produce the final training file. The first enriches and merges; the second selects pre-launch-only features and builds the classification target.

1

Load base CSV & Steam JSON

Read SteamSpy CSV into a DataFrame; load Steam Web API JSON keyed by appid string.
2

Extract pre-launch features from JSON

For each game, derive ~30 columns: platform support, language counts, screenshot & trailer counts, description lengths, publisher/developer counts, Steam category flags (achievements, workshop, cloud, controller, VR…), tag counts and top-tag vote totals, package & SKU counts.
3

Merge on appid (left join)

Base CSV ⟕ JSON features. Rows with no JSON match keep their base columns; new numeric features are filled with 0.
4

Engineer composite scores

Seven derived features combine raw signals into interpretable proxies: store_page_score, platform_reach, marketing_score, publisher_backing, localization_score, steam_integration, and an is_mature_content flag.
5

Drop post-launch columns (leakage prevention)

Reviews, playtime, CCU, and owners themselves are excluded from the model — they wouldn't exist before launch and would leak the target.
6

Build ordinal target & split

Owners bucket → ordinal class 0–5 (top sparse buckets merged into ≥750K). Median imputation for any remaining numeric NaNs, then a stratified 80/20 train-test split followed by z-score scaling.

3 · Target variable

The model predicts an ordinal owner tier, not a raw owner count. The original 9 SteamSpy buckets are collapsed to 6 because the top 4 had too few samples (n = 136, 88, 53, 19) for reliable learning.

Class	Owner tier	Notes
`0`	≤ 10K	Majority class (~69%)
`1`	≤ 35K
`2`	≤ 75K
`3`	≤ 150K
`4`	≤ 350K
`5`	≥ 750K	Merged top 4 sparse buckets

4 · Pre-launch features (49)

Only signals that exist before a game ships are used. Grouped by origin:

Pricing & release

price
initialprice
is_free
release_month

Genre flags (one-hot)

Action
Adventure
RPG
Strategy
Simulation
Indie
Sports
Racing

Store-page richness

screenshot_count
has_trailer
trailer_count
about_length
has_detailed_desc
has_website
has_support_email

Platform & localization

platform_count
supported_languages_count
full_audio_languages_count
required_age

Studio signals

developer_count
publisher_count
has_publisher
is_solo_dev

Steam ecosystem flags

is_multiplayer
has_achievements
has_trading_cards
has_workshop
has_cloud_save
has_controller_support
has_vr_support
has_in_app_purchases
has_family_sharing

Engineered composite scores

store_page_score
platform_reach
marketing_score
publisher_backing
localization_score
steam_integration
is_mature_content

Columns explicitly excluded (post-launch / leakage)

positive
negative
total_reviews
positive_ratio
average_forever
average_2weeks
median_forever
median_2weeks
ccu
owners
log_owners
json_price_raw
appid

5 · Reproducibility

The full pipeline lives in two Python scripts. Output is the file the model trains on:

enrich_prelaunch.py Merges SteamSpy CSV with Steam API JSON and engineers composite features.

train_prelaunch_model.py Selects pre-launch features, builds the ordinal target, trains the stacked ensemble.

steam_10k_prelaunch.csv Final 10,000 × 71 training file consumed by the model.

The dataset behind SAGE

1 · Data sources

SteamSpy (base CSV)

Steam Web API (JSON)

2 · Preprocessing pipeline

Load base CSV & Steam JSON

Extract pre-launch features from JSON

Merge on `appid` (left join)

Engineer composite scores

Drop post-launch columns (leakage prevention)

Build ordinal target & split

3 · Target variable

4 · Pre-launch features (49)

Pricing & release

Genre flags (one-hot)

Store-page richness

Platform & localization

Studio signals

Steam ecosystem flags

Categories, tags & packaging

Engineered composite scores

5 · Reproducibility

1 · Data sources

SteamSpy (base CSV)

Steam Web API (JSON)

2 · Preprocessing pipeline

Load base CSV & Steam JSON

Extract pre-launch features from JSON

Merge on appid (left join)

Engineer composite scores

Drop post-launch columns (leakage prevention)

Build ordinal target & split

3 · Target variable

4 · Pre-launch features (49)

Pricing & release

Genre flags (one-hot)

Store-page richness

Platform & localization

Studio signals

Steam ecosystem flags

Categories, tags & packaging

Engineered composite scores

5 · Reproducibility

Merge on `appid` (left join)