Reference

CAPTCHA Solve Rate SLI/SLO: How to Define and Monitor

"Our CAPTCHA solving works most of the time" isn't a reliability target. SLIs (Service Level Indicators) and SLOs (Service Level Objectives) give you measurable thresholds, error budgets, and actionable alerts for your CAPTCHA pipeline.

Definitions

Term Meaning CAPTCHA Example
SLI A metric that measures service quality Solve success rate: 94.2%
SLO A target value for an SLI Solve success rate ≥ 92% over 30 days
Error Budget Allowed failures before SLO breach 8% failure budget = 800 failures per 10,000 tasks
Burn Rate How fast you're consuming error budget 2x burn rate = budget exhausted in 15 days

SLI 1: Solve Success Rate

Success Rate = Successful Solves / Total Solve Attempts
CAPTCHA Type Typical Rate SLO Target
reCAPTCHA v2 95–99% ≥ 92%
reCAPTCHA v3 90–97% ≥ 88%
Cloudflare Turnstile 95–99% ≥ 92%
hCaptcha 90–97% ≥ 88%
Image/OCR 85–95% ≥ 82%

SLI 2: Solve Latency

Latency = Time from task submission to solution received
Percentile Target Alert Threshold
p50 < 25s
p95 < 90s > 120s
p99 < 180s > 300s

SLI 3: Pipeline Availability

Availability = Time pipeline is accepting and solving tasks / Total time

Target: ≥ 99.5% (allows 3.6 hours downtime per month)

Python — SLI/SLO Tracker

import os
import time
from collections import deque
from dataclasses import dataclass, field

API_KEY = os.environ["CAPTCHAAI_API_KEY"]


@dataclass
class SLITracker:
    """Track CAPTCHA solving SLIs over a sliding window."""

    window_seconds: int = 86400 * 30  # 30 days default
    events: deque = field(default_factory=deque)

    def record_success(self, latency_seconds):
        self.events.append({
            "time": time.time(),
            "success": True,
            "latency": latency_seconds
        })
        self._prune()

    def record_failure(self, error_code):
        self.events.append({
            "time": time.time(),
            "success": False,
            "error": error_code
        })
        self._prune()

    def _prune(self):
        cutoff = time.time() - self.window_seconds
        while self.events and self.events[0]["time"] < cutoff:
            self.events.popleft()

    @property
    def success_rate(self):
        if not self.events:
            return 1.0
        successes = sum(1 for e in self.events if e["success"])
        return successes / len(self.events)

    @property
    def latency_percentiles(self):
        latencies = sorted(
            e["latency"] for e in self.events if e.get("latency")
        )
        if not latencies:
            return {"p50": 0, "p95": 0, "p99": 0}

        def percentile(data, p):
            idx = int(len(data) * p / 100)
            return data[min(idx, len(data) - 1)]

        return {
            "p50": round(percentile(latencies, 50), 2),
            "p95": round(percentile(latencies, 95), 2),
            "p99": round(percentile(latencies, 99), 2),
        }

    @property
    def error_breakdown(self):
        errors = {}
        for e in self.events:
            if not e["success"]:
                code = e.get("error", "unknown")
                errors[code] = errors.get(code, 0) + 1
        return errors


class SLOChecker:
    """Check SLIs against SLO targets."""

    def __init__(self, tracker):
        self.tracker = tracker
        self.slos = {
            "success_rate": 0.92,    # ≥ 92%
            "latency_p95": 90.0,     # < 90 seconds
            "latency_p99": 180.0,    # < 180 seconds
        }

    @property
    def error_budget_total(self):
        """Total allowed failures in the window."""
        total = len(self.tracker.events)
        return int(total * (1 - self.slos["success_rate"]))

    @property
    def error_budget_remaining(self):
        """How many more failures before SLO breach."""
        total = len(self.tracker.events)
        failures = sum(1 for e in self.tracker.events if not e["success"])
        budget = self.error_budget_total
        return max(0, budget - failures)

    @property
    def error_budget_pct(self):
        """Percentage of error budget remaining."""
        total = self.error_budget_total
        if total == 0:
            return 100.0
        return round(self.error_budget_remaining / total * 100, 1)

    @property
    def burn_rate(self):
        """How fast error budget is being consumed.
        1.0 = on track, 2.0 = will exhaust in half the window.
        """
        total = len(self.tracker.events)
        if total == 0:
            return 0.0
        failures = sum(1 for e in self.tracker.events if not e["success"])
        expected_failures = total * (1 - self.slos["success_rate"])
        if expected_failures == 0:
            return 0.0
        return round(failures / expected_failures, 2)

    def check_all(self):
        """Check all SLOs and return status."""
        rate = self.tracker.success_rate
        latencies = self.tracker.latency_percentiles

        return {
            "success_rate": {
                "current": round(rate, 4),
                "target": self.slos["success_rate"],
                "met": rate >= self.slos["success_rate"]
            },
            "latency_p95": {
                "current": latencies["p95"],
                "target": self.slos["latency_p95"],
                "met": latencies["p95"] <= self.slos["latency_p95"]
            },
            "latency_p99": {
                "current": latencies["p99"],
                "target": self.slos["latency_p99"],
                "met": latencies["p99"] <= self.slos["latency_p99"]
            },
            "error_budget": {
                "remaining_pct": self.error_budget_pct,
                "remaining_count": self.error_budget_remaining,
                "burn_rate": self.burn_rate,
            },
            "overall": rate >= self.slos["success_rate"]
                       and latencies["p95"] <= self.slos["latency_p95"]
        }


# Usage
tracker = SLITracker(window_seconds=86400 * 30)
slo = SLOChecker(tracker)

# After each solve:
# tracker.record_success(latency_seconds=24.5)
# tracker.record_failure("ERROR_CAPTCHA_UNSOLVABLE")

# Check SLOs:
# print(slo.check_all())

JavaScript — SLO Dashboard

class SLODashboard {
  constructor(windowMs = 30 * 24 * 60 * 60 * 1000) {
    this.windowMs = windowMs;
    this.events = [];
    this.slos = {
      successRate: 0.92,
      latencyP95: 90,
      latencyP99: 180,
    };
  }

  recordSuccess(latencySeconds) {
    this.events.push({ time: Date.now(), success: true, latency: latencySeconds });
    this._prune();
  }

  recordFailure(errorCode) {
    this.events.push({ time: Date.now(), success: false, error: errorCode });
    this._prune();
  }

  _prune() {
    const cutoff = Date.now() - this.windowMs;
    this.events = this.events.filter((e) => e.time > cutoff);
  }

  get successRate() {
    if (this.events.length === 0) return 1;
    const successes = this.events.filter((e) => e.success).length;
    return successes / this.events.length;
  }

  get errorBudget() {
    const total = this.events.length;
    const allowedFailures = Math.floor(total * (1 - this.slos.successRate));
    const actualFailures = this.events.filter((e) => !e.success).length;
    const remaining = Math.max(0, allowedFailures - actualFailures);

    return {
      total: allowedFailures,
      consumed: actualFailures,
      remaining,
      remainingPct: allowedFailures > 0
        ? ((remaining / allowedFailures) * 100).toFixed(1)
        : "100.0",
      burnRate: allowedFailures > 0
        ? (actualFailures / allowedFailures).toFixed(2)
        : "0.00",
    };
  }

  get report() {
    const latencies = this.events
      .filter((e) => e.success && e.latency)
      .map((e) => e.latency)
      .sort((a, b) => a - b);

    const p95 = latencies.length > 0
      ? latencies[Math.floor(latencies.length * 0.95)]
      : 0;

    return {
      sliSuccessRate: (this.successRate * 100).toFixed(2) + "%",
      sloSuccessRate: (this.slos.successRate * 100).toFixed(0) + "%",
      sloMet: this.successRate >= this.slos.successRate,
      latencyP95: p95.toFixed(1) + "s",
      errorBudget: this.errorBudget,
      totalEvents: this.events.length,
    };
  }
}

const dashboard = new SLODashboard();
// dashboard.recordSuccess(24.5);
// console.log(dashboard.report);

Burn Rate Alert Thresholds

Burn Rate Meaning Alert
1.0 On track — budget lasts the full window None
2.0 Budget exhausted in half the window Warning
6.0 Budget exhausted in 5 days Page on-call
14.0 Budget exhausted in ~2 days Critical — immediate action

Troubleshooting

Issue Cause Fix
SLO always breached Target too aggressive Start with current performance − 3% as SLO
Error budget always full SLO too loose Tighten SLO to drive improvements
Burn rate spikes Burst of failures Check if transient (retry storm) or systemic
Budget consumed by one error type Single root cause Fix that error type; see error breakdown

FAQ

What SLO should I start with?

Measure your current success rate over 7 days. Subtract 3 percentage points — that's your starting SLO. Tighten it as you improve reliability.

Who owns the CAPTCHA SLO?

The team that operates the CAPTCHA solving pipeline. If scraping and CAPTCHA solving are separate teams, the CAPTCHA team owns solve rate SLOs while the scraping team owns end-to-end SLOs.

Should I set different SLOs per CAPTCHA type?

Yes. Image/OCR CAPTCHAs have fundamentally different success rates than reCAPTCHA v2. Setting per-type SLOs prevents one type from masking another's issues.

Next Steps

Set measurable reliability targets — get your CaptchaAI API key and define SLOs for your pipeline.

Related guides:

Discussions (0)

No comments yet.