Backgammon Galaxy Engine Analysis

Over the past few months, we have made significant efforts to enhance the performance of the open-source Gnu Backgammon engine.

Gnu is already extremely strong and nearly on par with XG as a neural network. However, its limited adoption within the backgammon community has mainly been due to two factors:

a) Despite its strength, it is significantly slower than XG. Additionally, the Gnu application is single-threaded, meaning users cannot perform other tasks while it is analyzing a move.
b) The user interface is less polished compared to other engines such as ExtremeGammon and BGBlitz.

As a result, Gnu has seen limited usage over the past two decades. Despite this, it remains a sleeping giant due to its strength and open-source nature.

Our Improvements to the Gnu Engine

We have implemented the following enhancements:

Increased evaluation speed by more than 7x, enabling significantly deeper and stronger rollouts.
Fine-tuned engine parameters to achieve incremental performance gains.
Identified optimal quality and speed configurations for rollout-based analysis.
Added a new and improved 4-ply eval for Star members, slightly stronger than the original 4-ply.
Upgraded our servers and plan to increase investment in rollout-based analysis for Backgammon Galaxy Star Pro members.

Conclusion

We have introduced three new evaluation levels based on work with the Gnu engine:

New and improved 4-ply eval
Gnu Rollout
Gnu Deep Rollout

When it comes to online backgammon, Backgammon Galaxy delivers the strongest and deepest analysis of any platform or app on the market.

Table 1.1: Mean absolute deviation in equity (millipoints) to target rollouts from XG and Gnu

Engine	XG Target Rollout	Gnu Target Rollout
Gnu 0-ply	84.8	80.3
Gnu 1-ply	65.3	61.2
Gnu 2-ply	52.2	46.7
Gnu 3-ply	45.3	39.9
Gnu 4-ply	43.0	36.8
Gnu Target Rollout	17.0	0.0
XG 1-ply	74.7	72.4
XG 2-ply	55.2	53.3
XG 3-ply	45.1	42.0
XG 4-ply	37.5	34.8
XG+	28.9	25.5
XG++	21.6	21.4
XG Target Rollout	0.0	17.0
BG Blitz 1-ply	93.7	91.7
BG Blitz 2-ply	77.9	76.0
BG Blitz 3-ply	73.1	71.2
BG Blitz 4-ply	77.9	67.0
BG Blitz 5-ply	67.1	65.2
Galaxy Gnu 4-ply	42.1	36.3
Galaxy Gnu Rollout	33.6	26.5
Galaxy Gnu Deep Rollout	27.8	20.2

Membership Tier Distribution

Here is how our analysis engines are distributed across membership tiers:

Free Members: Get access to Gnu 1-ply (called 2-ply on Galaxy) for quick, baseline analysis.
Star Members: Upgraded to the new and stronger Galaxy Gnu 4-ply.
Star Pro Members: Get elite-level analysis with the Galaxy Gnu Rollout.

Note: We are currently saving the ultra-powerful Galaxy Gnu Deep Rollout for upcoming custom functionality.

The Backtesting Environment

The backtesting environment consists of 1,721 cube action positions randomly selected from the Backgammon Galaxy database to ensure coverage across the entire equity range. Further details about the dataset are provided later in this report.

XG Target Rollouts

3-ply checker play
XGR cube actions
1,296 trials
Truncated at bear-off
Variance reduction and quasi-random dice

Gnu Target Rollouts

1-ply checker play
2-ply cube action
1,296 trials
Truncated at bear-off
Variance reduction and quasi-random dice

Equity Prediction Data

Measuring equity predictions against a deep rollout benchmark is the most effective way to evaluate a model's performance.

Rather than relying solely on play-agent data (which we also explore), we measure how closely a model predicts equity compared to deep target rollouts. This allows even a few hundred data points to provide meaningful insights.

In our case, we use 3,442 equity predictions in the backtesting environment.

The primary goal of our analysis engine is to assess errors as accurately as possible, since it is intended to serve as the analysis engine for Backgammon Galaxy.

There is inherent uncertainty in model predictions. Stronger models exhibit lower average deviance from God's prediction, but no model is perfect.

For example:

Gnu 0-ply has significantly higher uncertainty than Gnu 3-ply.
However, weaker models may occasionally produce correct predictions by chance.

As models approach very high strength (eg., such as XG++ or Gnu Deep Rollout), we encounter a fundamental limitation: the target rollouts themselves contain uncertainty, making it difficult to determine absolute accuracy.

Methodology

Our approach:

Selected 1,721 positions where a Backgammon Galaxy user made a blunder.
Removed dice for checker play blunders.
Standardized score to 0-0 to 7 to eliminate match equity table considerations.
Sampled around 100 positions from each of the 16 position categories from Backgammon Galaxy blunder database categories, excluding Opening Game.
Sampled around 300 positions from Middle Game, the most frequent category.
Evaluated each position twice:
- Cubeful no double
- Cubeful double/take

This resulted in 3,442 total data points.

Including checker play blunders ensures coverage across the entire equity spectrum, not just cube action scenarios.

This methodology measures how accurately models predict equity relative to XG and Gnu rollouts, which is precisely what is required for analyzing played matches.

Importantly, we are not evaluating play strength, meaning move selection, but rather equity prediction accuracy in absolute terms.

Play Agent Data

Play agent data refers to data generated by having engines play games against each other.

We chose not to rely on this method for the following reasons:

It is extremely time-consuming to generate sufficiently deep rollouts.
Without rollouts, there is no reliable referee for evaluation.
Engines may select the correct move despite poor equity evaluation.
Results depend heavily on move filter settings, which introduce trade-offs between speed and accuracy.
Stronger engines make fewer mistakes, requiring much larger sample sizes for reliable evaluation.

Practical Use of Play Agent Data

Despite its limitations, play agent data can still serve as a sanity check:

Running 100+ money games.
Analyzing them with Gnu Roller Ultradeep or XG++.

This approach can distinguish weaker models reasonably well. However, for top-tier models, differences become too small to measure reliably, and full rollouts are required. Even then, determining which rollout to trust remains challenging.