Tutorials

Webhook Endpoint Monitoring for CAPTCHA Solve Callbacks

Your CaptchaAI callback endpoint is a critical dependency — if it goes down, solved CAPTCHAs don't reach your application. Built-in monitoring catches problems before they cascade.

What to Monitor

Metric Why It Matters Healthy Range
Endpoint uptime Callbacks fail during downtime > 99.5%
Response latency Slow responses may timeout < 500 ms
Error rate (4xx/5xx) Indicates handler bugs < 1%
Callback delivery rate Ratio of callbacks received vs tasks submitted > 95%
Time between callbacks Detects sudden stops < 5× average interval

Self-Monitoring Middleware

Add monitoring directly to your callback handler.

Python (Flask)

import time
import threading
from collections import deque
from flask import Flask, request, jsonify

app = Flask(__name__)

# Rolling window metrics (last 1000 callbacks)
metrics = {
    "total_received": 0,
    "total_errors": 0,
    "latencies": deque(maxlen=1000),
    "last_callback_at": 0,
    "error_counts": {}
}
metrics_lock = threading.Lock()


@app.route("/callback")
def captcha_callback():
    start = time.time()

    task_id = request.args.get("id")
    solution = request.args.get("code")

    try:
        # Process the callback
        store_result(task_id, solution)
        status = "ok"
        http_code = 200
    except Exception as e:
        status = "error"
        http_code = 200  # Still ACK to CaptchaAI
        error_type = type(e).__name__
        with metrics_lock:
            metrics["total_errors"] += 1
            metrics["error_counts"][error_type] = \
                metrics["error_counts"].get(error_type, 0) + 1

    # Record metrics
    latency_ms = (time.time() - start) * 1000
    with metrics_lock:
        metrics["total_received"] += 1
        metrics["latencies"].append(latency_ms)
        metrics["last_callback_at"] = time.time()

    return "OK", http_code


@app.route("/health/callbacks")
def callback_health():
    """Health endpoint for monitoring."""
    with metrics_lock:
        latencies = list(metrics["latencies"])
        last_at = metrics["last_callback_at"]

    now = time.time()
    avg_latency = sum(latencies) / len(latencies) if latencies else 0
    p95_latency = sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0
    seconds_since_last = now - last_at if last_at > 0 else -1

    health = {
        "status": "healthy" if seconds_since_last < 300 else "stale",
        "total_received": metrics["total_received"],
        "total_errors": metrics["total_errors"],
        "error_rate": metrics["total_errors"] / max(metrics["total_received"], 1),
        "avg_latency_ms": round(avg_latency, 2),
        "p95_latency_ms": round(p95_latency, 2),
        "seconds_since_last_callback": round(seconds_since_last, 1),
        "error_breakdown": dict(metrics["error_counts"])
    }

    status_code = 200 if health["status"] == "healthy" else 503
    return jsonify(health), status_code

JavaScript (Express)

const express = require("express");
const app = express();

const metrics = {
  totalReceived: 0,
  totalErrors: 0,
  latencies: [],
  lastCallbackAt: 0,
  errorCounts: {},
};

const MAX_LATENCIES = 1000;

app.get("/callback", (req, res) => {
  const start = Date.now();
  const taskId = req.query.id;
  const solution = req.query.code;

  try {
    storeResult(taskId, solution);
  } catch (err) {
    metrics.totalErrors++;
    const errType = err.constructor.name;
    metrics.errorCounts[errType] = (metrics.errorCounts[errType] || 0) + 1;
  }

  const latencyMs = Date.now() - start;
  metrics.totalReceived++;
  metrics.latencies.push(latencyMs);
  if (metrics.latencies.length > MAX_LATENCIES) metrics.latencies.shift();
  metrics.lastCallbackAt = Date.now();

  res.sendStatus(200);
});

app.get("/health/callbacks", (req, res) => {
  const latencies = [...metrics.latencies].sort((a, b) => a - b);
  const avgLatency =
    latencies.length > 0
      ? latencies.reduce((a, b) => a + b, 0) / latencies.length
      : 0;
  const p95Latency =
    latencies.length > 0
      ? latencies[Math.floor(latencies.length * 0.95)]
      : 0;
  const secondsSinceLast =
    metrics.lastCallbackAt > 0
      ? (Date.now() - metrics.lastCallbackAt) / 1000
      : -1;

  const health = {
    status: secondsSinceLast < 300 ? "healthy" : "stale",
    totalReceived: metrics.totalReceived,
    totalErrors: metrics.totalErrors,
    errorRate: metrics.totalErrors / Math.max(metrics.totalReceived, 1),
    avgLatencyMs: Math.round(avgLatency * 100) / 100,
    p95LatencyMs: Math.round(p95Latency * 100) / 100,
    secondsSinceLastCallback: Math.round(secondsSinceLast * 10) / 10,
    errorBreakdown: metrics.errorCounts,
  };

  res.status(health.status === "healthy" ? 200 : 503).json(health);
});

app.listen(3000);

Delivery Rate Tracking

Compare tasks submitted with callbacks received to measure delivery success:

Python

import time

submitted_tasks = {}  # task_id -> submitted_at
delivered_tasks = set()
delivery_timeout = 300  # 5 minutes


def on_submit(task_id):
    """Call after submitting to CaptchaAI with pingback."""
    submitted_tasks[task_id] = time.time()


def on_callback(task_id):
    """Call when callback is received."""
    delivered_tasks.add(task_id)
    submitted_tasks.pop(task_id, None)


def get_delivery_stats():
    """Calculate delivery metrics."""
    now = time.time()

    # Expired tasks = submitted > 5 min ago, never received callback
    expired = [
        tid for tid, ts in submitted_tasks.items()
        if now - ts > delivery_timeout
    ]

    total = len(delivered_tasks) + len(expired)
    rate = len(delivered_tasks) / max(total, 1)

    return {
        "delivered": len(delivered_tasks),
        "missed": len(expired),
        "pending": len(submitted_tasks) - len(expired),
        "delivery_rate": round(rate, 4),
        "missed_task_ids": expired[:10]  # Sample for debugging
    }

Alert Conditions

Set up alerts for these conditions:

Alert Trigger Severity
Stale endpoint No callback received in 5+ minutes Warning
High error rate > 5% error rate over 100 requests Critical
Slow responses p95 latency > 1000 ms Warning
Low delivery rate < 90% delivery rate Critical
Endpoint down Health check returns 503 or timeout Critical

Simple Alert Script

import requests
import time


def check_callback_health(health_url, alert_callback):
    """Periodic health checker."""
    while True:
        try:
            resp = requests.get(health_url, timeout=5)
            health = resp.json()

            if resp.status_code != 200:
                alert_callback("CRITICAL", f"Callback endpoint unhealthy: {health['status']}")

            if health.get("error_rate", 0) > 0.05:
                alert_callback("CRITICAL", f"High error rate: {health['error_rate']:.1%}")

            if health.get("p95_latency_ms", 0) > 1000:
                alert_callback("WARNING", f"Slow callbacks: p95={health['p95_latency_ms']}ms")

            if health.get("seconds_since_last_callback", -1) > 300:
                alert_callback("WARNING", f"No callbacks for {health['seconds_since_last_callback']:.0f}s")

        except requests.RequestException as e:
            alert_callback("CRITICAL", f"Health check failed: {e}")

        time.sleep(60)  # Check every minute

External Monitoring Integration

For production systems, pair self-monitoring with external uptime checks:

Tool Integration
UptimeRobot Monitor /health/callbacks endpoint
Pingdom HTTP check with response body validation
AWS CloudWatch Synthetic canary on health endpoint
Self-hosted Cron job calling health check script

Troubleshooting

Issue Cause Fix
Health endpoint shows "stale" with no callbacks No tasks submitted recently, or callbacks not reaching server Check if tasks are being submitted with pingback; verify firewall rules
High latency on callback handler Slow database writes in handler Process async — accept callback, queue for background processing
Delivery rate dropping Server restarts clearing in-memory task tracking Use Redis or database to persist submitted task IDs
Error rate spikes Downstream service (database) failing Check error breakdown; fix underlying service

FAQ

Should I use a separate service for monitoring?

For small setups, self-monitoring middleware is sufficient. For production systems with SLAs, add external monitoring (UptimeRobot, Pingdom) that checks from outside your infrastructure.

How long should I keep metrics in memory?

A rolling window of the last 1,000 events is usually enough for real-time dashboards. For historical analysis, export metrics to Prometheus, Datadog, or a time-series database.

What if my callback endpoint is behind a load balancer?

Each instance tracks its own metrics. Aggregate across instances in your monitoring platform, or expose a shared metrics store (Redis) that all instances write to.

Next Steps

Monitor your callback endpoints — get your CaptchaAI API key and add health checks from day one.

Related guides:

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.