Skip to content

Thompson Sampling: Intelligent Agent Selection

Thompson Sampling: Intelligent Agent Selection

Section titled “Thompson Sampling: Intelligent Agent Selection”

The Problem: How Do You Pick the Right Agent?

Section titled “The Problem: How Do You Pick the Right Agent?”

You have multiple AI agents with different strengths. A data analyst agent, a coder agent, a research agent. When a task arrives, which one should handle it?

Naive approaches fail:

  • Random selection: Wastes good agents on bad-fit tasks
  • Hard-coded rules: Can’t adapt when performance changes
  • Always use the best: Never discovers if other agents improved

What you need: A system that:

  1. Learns which agent performs best for each task type
  2. Tries uncertain agents occasionally (they might be great)
  3. Exploits proven performers most of the time
  4. Adapts automatically as performance changes

This is the multi-armed bandit problem, and Thompson Sampling is the solution.


After this example, you will understand:

  • The exploration-exploitation dilemma and why it matters
  • Beta distributions as representations of uncertainty
  • Thompson Sampling as “optimistic under uncertainty”
  • Belief updating based on outcomes
  • Task categorization for specialized routing

Imagine slot machines (“one-armed bandits”) with unknown payout rates. You want to maximize winnings with limited pulls.

StrategyApproachProblem
Explore onlyTry each equallyWaste pulls on bad machines
Exploit onlyAlways use current bestMiss better machines
BalancedExplore uncertain, exploit provenThompson Sampling!

Agent selection is identical:

  • Machines = AI agents
  • Pulls = Task assignments
  • Payouts = Successful completions
AlgorithmMechanismWeakness
Epsilon-GreedyRandom 10% of the timeExplores bad options unnecessarily
UCBUpper confidence boundDeterministic, exploitable
Thompson SamplingSample from belief distributions✓ Naturally balanced

Thompson Sampling wins because:

  • Uncertain agents get more chances (higher variance in samples)
  • Proven agents dominate (tight distributions around true rate)
  • Exploration decreases automatically as beliefs converge
  • No tuning parameters required

Beta Distributions: Representing Uncertainty

Section titled “Beta Distributions: Representing Uncertainty”

A Beta distribution represents belief about a probability:

Beta(α, β) where:
α = successes + prior
β = failures + prior
Mean = α / (α + β)
Variance decreases as α + β increases

Visual intuition:

Beta(2, 2) - Flat, uncertain ████████████████
Beta(5, 5) - Moderate certainty ████████
Beta(20, 5) - High confidence ██████ (peaked around 0.8)

When you sample from Beta(5, 5), you might get anything from 0.2 to 0.8. When you sample from Beta(50, 10), you’ll almost always get ~0.83.

This is the magic: uncertain agents occasionally win the lottery.

For each task:
1. Classify task → category (Coding, Analysis, Research, etc.)
2. For each available agent:
a. Get belief Beta(α, β) for (agent, category)
b. Sample θ ~ Beta(α, β)
3. Select agent with highest sampled θ
4. Execute task, observe outcome
5. Update belief: success → α += 1, failure → β += 1

DecisionWhyAlternativeTrade-off
Beta distributionConjugate prior for binary outcomesGaussianBeta is mathematically elegant for success/failure
Per-category beliefsAgents specializeSingle belief per agentMore parameters, but better routing
Prior α=2, β=2Uninformative, slight regularizationα=1, β=1 (uniform)Prevents extreme early beliefs
Keyword classificationSimple, interpretableLLM-basedFast, no external calls

Good fit:

  • Multiple agents with overlapping capabilities
  • Performance varies by task type
  • Feedback is available (success/failure)
  • Task volume is sufficient for learning (100+ per category)

Poor fit:

  • Single agent (nothing to select)
  • No feedback mechanism
  • Highly variable task types (no stable categories)
  • Need deterministic selection (audit requirements)
Anti-PatternProblemFix
Ignoring categoriesAgents conflated across task typesUse per-category beliefs
No outcome recordingBeliefs never improveAlways call RecordOutcomeAsync
Cold start panicExpecting good results immediatelySet realistic prior, allow learning period
Too many categoriesSparse data per categoryConsolidate similar categories

// Configure agent selection
services.AddAgentSelection(options => options
.WithPrior(alpha: 2, beta: 2) // Uninformative prior
.WithCategories(
TaskCategory.Analysis,
TaskCategory.Coding,
TaskCategory.Research,
TaskCategory.Writing));
// Register agents
services.AddAgent("analyst", new AgentConfig
{
Name = "Data Analyst",
Capabilities = ["data-analysis", "visualization", "statistics"]
});
services.AddAgent("coder", new AgentConfig
{
Name = "Software Developer",
Capabilities = ["code-generation", "debugging", "refactoring"]
});
public class TaskRouter
{
private readonly IAgentSelector _selector;
private readonly IAgentRegistry _agents;
public async Task<AgentResult> RouteTaskAsync(
string taskDescription,
CancellationToken ct)
{
// 1. Select agent via Thompson Sampling
var selection = await _selector.SelectAgentAsync(
new AgentSelectionContext
{
AvailableAgentIds = ["analyst", "coder", "researcher"],
TaskDescription = taskDescription
}, ct);
// 2. Execute with selected agent
var agent = _agents.Get(selection.SelectedAgentId);
var result = await agent.ExecuteAsync(taskDescription, ct);
// 3. Record outcome for learning (CRITICAL!)
await _selector.RecordOutcomeAsync(
selection.SelectedAgentId,
selection.TaskCategory,
result.Success
? AgentOutcome.Succeeded(result.ConfidenceScore)
: AgentOutcome.Failed(result.ErrorMessage),
ct);
return result;
}
}

Built-in categories with keyword detection:

CategoryKeywordsExample
Analysisanalyze, examine, evaluate”Analyze sales trends”
Codingcode, implement, debug”Implement OAuth flow”
Researchresearch, investigate”Research competitor pricing”
Writingwrite, draft, compose”Write API documentation”
Datadata, transform, migrate”Transform CSV to JSON”
Integrationintegrate, connect, api”Connect Stripe API”
General(fallback)“Help with this task”
public class DomainFeatureExtractor : ITaskFeatureExtractor
{
public TaskFeatures Extract(string taskDescription)
{
var lower = taskDescription.ToLowerInvariant();
// Domain-specific rules
if (lower.Contains("compliance") || lower.Contains("regulation"))
{
return new TaskFeatures(
Category: TaskCategory.Analysis,
Complexity: TaskComplexity.High,
MatchedKeywords: ["compliance"]);
}
// Fall back to default
return DefaultFeatureExtractor.Extract(taskDescription);
}
}
services.AddSingleton<ITaskFeatureExtractor, DomainFeatureExtractor>();
// Development: In-memory (resets on restart)
services.AddSingleton<IBeliefStore, InMemoryBeliefStore>();
// Production: PostgreSQL (persists across restarts)
services.AddSingleton<IBeliefStore, PostgresBeliefStore>();
public class ThompsonSamplingSelector : IAgentSelector
{
public async Task<AgentSelection> SelectAgentAsync(
AgentSelectionContext context,
CancellationToken ct)
{
// 1. Classify task
var features = _featureExtractor.Extract(context.TaskDescription);
// 2. Get beliefs for all agents
var beliefs = await _beliefStore.GetBeliefsAsync(
context.AvailableAgentIds,
features.Category,
ct);
// 3. Sample from each agent's Beta distribution
var samples = beliefs.Select(b => new
{
b.AgentId,
Sample = SampleBeta(b.Alpha, b.Beta) // THE KEY STEP
});
// 4. Select highest sample (not highest mean!)
var selected = samples.MaxBy(s => s.Sample)!;
return new AgentSelection(
SelectedAgentId: selected.AgentId,
TaskCategory: features.Category,
SampledValue: selected.Sample);
}
private double SampleBeta(int alpha, int beta)
{
return BetaDistribution.Sample(_random, alpha, beta);
}
}
// Success
await selector.RecordOutcomeAsync(
agentId: "analyst",
category: TaskCategory.Analysis,
outcome: AgentOutcome.Succeeded(confidenceScore: 0.92),
ct);
// Failure
await selector.RecordOutcomeAsync(
agentId: "coder",
category: TaskCategory.Coding,
outcome: AgentOutcome.Failed("Syntax errors in generated code"),
ct);
// Partial success (weighted)
await selector.RecordOutcomeAsync(
agentId: "researcher",
category: TaskCategory.Research,
outcome: AgentOutcome.Partial(completionRate: 0.7),
ct);
public class DelegateToAgent : IWorkflowStep<TaskState>
{
public async Task<StepResult<TaskState>> ExecuteAsync(
TaskState state,
StepContext context,
CancellationToken ct)
{
// Select best agent
var selection = await _selector.SelectAgentAsync(
new AgentSelectionContext
{
AvailableAgentIds = state.AvailableAgents,
TaskDescription = state.CurrentTask.Description
}, ct);
// Execute
var agent = _agents.Get(selection.SelectedAgentId);
var result = await agent.ExecuteAsync(state.CurrentTask, ct);
// Learn from outcome
await _selector.RecordOutcomeAsync(
selection.SelectedAgentId,
selection.TaskCategory,
result.ToOutcome(),
ct);
return state
.With(s => s.SelectedAgent, selection.SelectedAgentId)
.With(s => s.TaskResult, result)
.AsResult();
}
}

Uncertainty is opportunity, not risk.

An agent with 2 successes and 0 failures has a 100% success rate—but only 2 data points. An agent with 80 successes and 20 failures has an 80% rate with 100 data points.

Which is actually better? We don’t know! Thompson Sampling acknowledges this by letting the uncertain agent occasionally win the selection lottery. If it performs well, we learn something valuable. If it fails, we’ve only lost one task.


// Get performance report
var report = await beliefStore.GetPerformanceReportAsync(ct);
// Output:
// Agent: analyst
// Analysis: 90.0% (45/50 trials)
// Research: 83.3% (20/24 trials)
// Coding: 30.0% (3/10 trials)
//
// Agent: coder
// Coding: 88.0% (44/50 trials)
// Analysis: 45.0% (9/20 trials)

Watch for:

  • Low trial counts: Beliefs haven’t converged yet
  • Cross-category performance: Agents may surprise you
  • Belief drift: Performance changes over time

  1. Thompson Sampling balances exploration and exploitation automatically
  2. Beta distributions represent uncertainty—wider = less certain
  3. Sampling, not averaging—uncertain agents get lucky sometimes
  4. Per-category beliefs enable specialization
  5. Outcome recording is essential—no feedback = no learning
  6. Performance improves over time—patience during cold start