Methodology - Volleyball ELO Rankings

The short version

Every team starts at 1500. After each match, both teams' ratings move up or down based on the result, adjusted by how they won (point margin, set margin, match outcome), who they played (stronger opponents = bigger reward for a win), where they played (bracket matches count more than pool play), and how much we trust the signal (more matches = more confidence).

The engine processes every match chronologically from the current season (Aug 2025 – Jul 2026). If you re-run the pipeline, the numbers are fully reproducible from the raw match CSV.

1 Expected Score

Before a match, the engine asks: given the two teams' current ratings, what's the expected win probability for each side? This is the classic ELO formula — a team rated 200 points higher has roughly a 76% expected win rate.

E_A = 1 / (1 + 10^((ELO_B − ELO_A) / 400))

After the match, the engine compares the actual result to this expected result. Beating a team you were supposed to beat gains you a little. Beating a team you weren't supposed to beat gains you a lot. The gap between expected and actual is the engine of the rating change.

2 Three Layers of Result

A volleyball match isn't just a win or loss — it's a sequence of points inside sets inside a match. The ranking uses all three:

Points

40%

Sets

25%

Match

35%

Point layer (40%)

The most granular signal. A 25–10 set transfers more rating than 25–23. Margin is scaled logarithmically so 40-point blowouts don't dominate — beyond a certain margin, extra points give diminishing returns. This protects ratings from being skewed by mercy-rule wins against vastly weaker opponents.

Set layer (25%)

Winning 2–0 is a stronger signal than 2–1. The set layer rewards sweeps. A sweep multiplier of roughly 1.15–1.3× applies based on the set differential.

Match layer (35%)

The binary outcome. You won or you didn't. This is the classic ELO component and guarantees that winning always moves you up, regardless of margin.

Why three layers? Relying only on match outcome throws away information — a team that loses 25–23, 26–24 is clearly closer to its opponent than one that loses 25–5, 25–8. Splitting the signal across points, sets, and matches gives a more accurate picture, especially early in a season when sample sizes are small.

3 K-Factor (How Much Ratings Move)

K controls the magnitude of the rating change. The bigger K, the faster ratings move. The engine uses an adaptive K:

Stage	Description	K
New team	Fewer than 10 matches processed	48
Established	10+ matches processed	32

Newer teams calibrate faster — if a true top-tier team starts at 1500, higher K lets them climb to their real rating quickly. Once we have enough data on them, K drops so ratings stabilize and don't swing wildly after a single bad match.

4 Context Multipliers

Not every match carries equal weight. The engine applies several multipliers on top of the base rating change:

Division tier

Tier	Code	Multiplier
Open	O	1.0×
USA	U	0.5×
Club	C	0.5×

Open-division matches carry full weight — that's the strongest competition and the most reliable signal of team quality. USA and Club matches count for half. A win in Club doesn't tell us as much about a team's ceiling as a win in Open, so it moves the rating less.

Bracket vs pool

Pool-play and bracket matches currently carry the same weight (1.0×). The engine supports a bracket bonus, but it's disabled in the current settings because the sample of scouted matches already skews toward competitive pools.

Head-to-head dampening (same tournament)

If two teams meet multiple times in the same tournament (e.g., pool then gold bracket), the engine progressively dampens the rating change so a single event doesn't dominate:

Meeting	Multiplier
1st	1.0×
2nd	0.75×
3rd+	0.5×

Tournament quality weight

Some tournaments draw stronger fields than others. Individual events can be weighted up or down by an admin-configurable multiplier (typically between 0.5× and 1.0×) to reflect field strength. The biggest events currently carry the full 1.0× weight; regional and mid-tier events are weighted lower.

5 Confidence & Display Rating

A team that's played 40 matches has a rating we can trust. A team that's played 3 matches does not — those three results might be wildly atypical. Instead of showing raw ELO to everyone, the dashboard shows a confidence-blended display rating:

display_ELO = raw_ELO × confidence + 1500 × (1 − confidence)

Confidence ramps linearly from 0 to 1 over the first 15 matches. A team with only 5 matches gets confidence = 5/15 ≈ 0.33, so its display rating is pulled about two-thirds of the way back toward the 1500 baseline. This prevents fresh teams with a lucky weekend from leapfrogging established programs on the leaderboard.

Teams with fewer than 20 matches are flagged as provisional in the rankings — their rating is shown but tagged as preliminary.

6 Safeguards

Chronological processing

Matches are sorted by date before processing. This matters: an October win against a team that later develops into a powerhouse should use that team's October rating, not their April rating. Out-of-order processing would inflate or deflate the signal.

Forfeit filtering

Sets with a 25–0 or 0–25 scoreline are treated as forfeits and excluded from ELO processing entirely. Forfeit scores would corrupt the point-differential layer.

ELO floor

Ratings can't drop below 800. This prevents teams from being driven into statistical oblivion after a rough stretch — they stay in the system so future results can still pull them back toward their true level.

Cross-division merging

A team that plays in both 16 Open and 16 Club (or similar) is merged into a single entry in the universal rankings. Their ELO history across divisions is stitched together chronologically and their match totals are summed.

7 Honest Limitations

Data coverage is incomplete. The rankings only include matches from tournaments that published results on AES or SportWrench and that have been ingested. Regional events, pre-season scrimmages, and events that don't publish scores are invisible to the model. A team that only plays unscored events will have no rating.

Roster changes aren't modeled. ELO assumes the "team" is the stable unit. If a 17s squad loses three starters mid-season, their rating doesn't know that. Treat early-season ratings with more skepticism than late-season ones.

Regional isolation. ELO only propagates strength through actual matchups. If the Pacific region and the Northeast region rarely play each other, their ratings aren't directly comparable — they're two weakly-connected graphs. Inter-regional events (SoCal Cup, Badger, SacTown) are the bridges that calibrate across regions, which is why those tournaments get full weight.

Ratings reflect results, not potential. This isn't a projection model. It's a summary of what's happened on the court with the data we have. A team with a tough early schedule and some narrow losses may be underrated relative to their ceiling.

8 Tournament Competitive Index

The Tournaments page ranks every event in the database with a 0–100 Competitive Index that answers the question: how strong was the field at this tournament?

Four factors feed the score. Each factor is independently percentile-ranked across every tournament in the system, then combined with fixed weights. Percentile-ranking matters — it means the weights are comparable across factors measured in different units (ELO numbers vs. club counts vs. ratios).

The four factors

Factor	What it measures	Weight
Top-10 Field Strength	Average ELO of the top 10 teams that attended	25%
Field Depth	Average ELO across every team at the event	20%
Geographic Reach	Number of distinct states represented	35%
Field Quality Mix	Log-scaled field size blended with ranked-team ratio	20%

Why reach-first?

Geography carries the heaviest weight (35%) because it's what most separates a genuine national event from a loaded regional. A stacked 12-team invitational can have elite top-end ELO but if every team is from the same area, it's a regional battle, not a national one. The reach weight prevents locally-dominant events from gaming the index.

How reach is measured. Each club is mapped to its home state via a curated lookup table (club_states.js). The reach factor counts the number of distinct states represented at the event. A 20-club event where every club is from the same state scores far lower than one drawing from 15 different states — exactly the distinction this factor is designed to make.

Field Quality Mix formula

This factor blends size and data quality into a single 0–1 score:

quality_mix = 0.5 × log(1 + teamCount) / log(65)
+ 0.5 × (teams with ≥20 matches) / teamCount

The log scaling on size means a 64-team event scores roughly 1.0 on size, while 16-team and 32-team events score ~0.67 and ~0.84. The ranked-team ratio penalizes events padded with unknown-quality teams we don't yet have data on.

Tier labels

Tier	Score threshold
National	75+
Super-Regional	55 – 74
Regional	30 – 54
Local	< 30

The index is display-only. It does NOT feed back into ELO calculations. Per-tournament weights used by the ELO pipeline are still set manually in tournament_weights.json. The index is a public-facing signal about field strength; it is not used to compute ratings.

9 Reproducibility

The pipeline is deterministic. Given the same input CSV and the same settings file, running the rebuild script produces the same ratings every time. The inputs are:

combined_all_tournaments.csv — raw match results (teams, sets, division, date, source)
elo_settings.json — layer weights, K-factors, tier multipliers, thresholds
tournament_weights.json — per-event quality multipliers
club_aliases.json — name normalization for clubs with duplicate spellings

Found a bug or have a suggestion? Use the Contribute page to submit an event, flag a duplicate, or share feedback.

Last updated: April 2026