Simulators¶

Simulate pairwise comparison data with controlled non-transitivity.

Three generators corresponding to different stories about where non-transitivity comes from:

simulate_transitive — pure BT data (no cycles).
simulate_heterogeneous — models have different strengths per category. Aggregate win rates can cycle even though within-category preferences are transitive.
simulate_rock_paper_scissors — irreducible cyclic structure that cannot be explained by category decomposition.

winference.simulate.simulate_transitive(n_models=6, n_comparisons=3000, strength_spread=1.5, seed=42)[source]¶

Generate data from a standard BT model (fully transitive).

Parameters:

n_models (int, default: 6) – Number of models to simulate.
n_comparisons (int, default: 3000) – Number of pairwise comparisons to generate.
strength_spread (float, default: 1.5) – Standard deviation of model strengths.
seed (int, default: 42) – Random seed for reproducibility.

Returns:

comparisons (list of (model_a, model_b, a_wins)), models (list of str), true_strengths (dict {model: theta}), categories (list of str, all “general”).

Return type:

dict[str, Any]

winference.simulate.simulate_heterogeneous(n_models=6, n_categories=3, n_comparisons=5000, strength_spread=1.5, category_names=None, seed=42)[source]¶

Generate data where model strengths differ by category.

Each model has a different strength in each category. Within each category, preferences are transitive. But the aggregate (marginal over categories) can exhibit non-transitive win rates.

Parameters:

n_models (int, default: 6) – Number of models to simulate.
n_categories (int, default: 3) – Number of categories.
n_comparisons (int, default: 5000) – Number of pairwise comparisons to generate.
strength_spread (float, default: 1.5) – Standard deviation of model strengths.
category_names (list[str] | None, default: None) – Optional custom category names.
seed (int, default: 42) – Random seed for reproducibility.

Returns:

comparisons (list of (model_a, model_b, a_wins)), models (list of str), true_strengths (dict {category: {model: theta}}), categories (list of str, one per comparison), category_weights (dict {category: proportion}).

Return type:

dict[str, Any]

Raises:

ValueError – If category_names length doesn’t match n_categories.

winference.simulate.simulate_rock_paper_scissors(n_models=6, n_comparisons=5000, cycle_strength=0.8, transitive_strength=1.0, seed=42)[source]¶

Generate data with irreducible cyclic structure.

Models have a transitive component (some are generally better) PLUS a cyclic component that cannot be explained away by categories. This is the Hodge curl in action.

The cyclic component is constructed by assigning each model a position on a circle and adding a rotational advantage: models beat those “clockwise-adjacent” to them but lose to those “counter-clockwise”.

Parameters:

n_models (int, default: 6) – Number of models to simulate.
n_comparisons (int, default: 5000) – Number of pairwise comparisons to generate.
cycle_strength (float, default: 0.8) – Strength of the cyclic component.
transitive_strength (float, default: 1.0) – Strength of the transitive component.
seed (int, default: 42) – Random seed for reproducibility.

Returns:

comparisons (list of (model_a, model_b, a_wins)), models (list of str), true_transitive (dict {model: s_i} gradient component), true_curl_magnitude (float).

Return type:

dict[str, Any]

winference.simulate.simulate_llm_arena(seed=42)[source]¶

Simulate a realistic LLM arena scenario.

Six models with names evoking real LLMs, three task categories (reasoning, creative_writing, coding) with plausible strength profiles.

Parameters:: seed (int, default: 42) – Random seed for reproducibility.
Return type:: dict[str, Any]
Returns:: Same format as simulate_heterogeneous.