Use Cases

How CAPTCHA Detection Works in Web Scraping

Understanding how sites detect scrapers helps you build better automation. This guide covers the technical mechanisms behind CAPTCHA triggers — and how to handle them with CaptchaAI when they fire.

Detection Layers

Modern anti-bot systems use multiple detection layers. A CAPTCHA appears when enough signals combine to indicate automated traffic.

Layer 1: IP-Based Detection

The simplest and most common trigger:

Signal Threshold Result
Requests per minute >20-30 from one IP Rate limit or CAPTCHA
Requests per hour >200-500 from one IP Temporary block
IP reputation Known datacenter range Immediate CAPTCHA
Geographic mismatch VPN/proxy detected Elevated scrutiny

Mitigation: Proxy rotation distributes requests across IPs. See Proxy Rotation for CAPTCHA Scraping.

Layer 2: HTTP Header Analysis

Servers inspect request headers for bot indicators:

# Bot-like request (triggers CAPTCHA)
GET /page HTTP/1.1
User-Agent: python-requests/2.28.0

# Human-like request (less likely to trigger)
GET /page HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br
Referer: https://www.google.com/

Key headers that trigger detection:

  • User-Agent — Default library UAs are instantly flagged
  • Accept-Language — Missing = bot
  • Referer — No referrer on deep pages = suspicious
  • Cookie — No session cookies = new/bot visitor

Layer 3: JavaScript Fingerprinting

Anti-bot services run JavaScript to profile the browser:

// What fingerprinting scripts check:
navigator.webdriver        // true in automated browsers
navigator.plugins.length   // 0 in headless
window.chrome              // undefined in non-Chrome
navigator.languages        // unusual in headless
WebGL renderer             // "SwiftShader" = headless
canvas fingerprint         // consistent across headless instances

reCAPTCHA v3 uses these signals to compute a trust score (0.0 = bot, 1.0 = human). Low scores trigger visible CAPTCHAs.

Layer 4: Behavioral Analysis

Advanced systems track user behavior over time:

Human Behavior Bot Behavior
Random navigation patterns Sequential page access
Variable time on page Consistent quick loads
Mouse movement and scrolling No mouse/scroll events
Click variations Exact coordinate clicks
Search then navigate Direct URL access

Sites plant tracking cookies to identify return visitors:

# First visit — site sets tracking cookies
# Second visit — site checks:
# - Are the cookie values consistent?
# - Was the cookie modified?
# - Is this a fresh session?

Missing or inconsistent cookies elevate the suspicion score.

How reCAPTCHA v3 Scoring Works

reCAPTCHA v3 runs invisibly and assigns a score:

Score Range Classification Action
0.7 - 1.0 Likely human Allow through
0.3 - 0.7 Uncertain May show CAPTCHA
0.0 - 0.3 Likely bot Block or CAPTCHA

Inputs to the score:

  • Browser JavaScript environment
  • Mouse/keyboard interaction patterns
  • Historical Google cookie data
  • IP reputation
  • Page interaction time

When reCAPTCHA v3 assigns a low score, the site can choose to serve a reCAPTCHA v2 challenge. CaptchaAI solves both versions.

How Cloudflare Detection Works

Cloudflare's Bot Management checks:

  1. JavaScript challenge — Runs browser tests in an interstitial page
  2. Managed challenge — Shows Turnstile widget for borderline traffic
  3. Block — Rejects known malicious IPs
  4. IP reputation — Cloudflare sees ~20% of internet traffic, building massive IP profiles

CaptchaAI solves both Turnstile widgets (method=turnstile) and full challenge pages (method=cloudflare_challenge).

Handling Detection with CaptchaAI

When your scraper encounters a CAPTCHA, CaptchaAI solves it regardless of what triggered it:

import requests
import time

API_KEY = "YOUR_API_KEY"

def handle_captcha(captcha_type, site_key, page_url, **kwargs):
    params = {
        "key": API_KEY,
        "pageurl": page_url
    }

    if captcha_type == "recaptcha_v2":
        params["method"] = "userrecaptcha"
        params["googlekey"] = site_key
    elif captcha_type == "recaptcha_v3":
        params["method"] = "userrecaptcha"
        params["googlekey"] = site_key
        params["version"] = "v3"
        params["action"] = kwargs.get("action", "verify")
    elif captcha_type == "turnstile":
        params["method"] = "turnstile"
        params["sitekey"] = site_key
    elif captcha_type == "cloudflare":
        params["method"] = "cloudflare_challenge"
        params["proxy"] = kwargs["proxy"]
        params["proxytype"] = "HTTP"

    resp = requests.get("https://ocr.captchaai.com/in.php", params=params)
    task_id = resp.text.split("|")[1]

    for _ in range(60):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id
        })
        if result.text == "CAPCHA_NOT_READY": continue
        if result.text.startswith("OK|"): return result.text.split("|")[1]
        raise Exception(result.text)

FAQ

Can I avoid all CAPTCHAs with stealth techniques?

At low volumes, yes — stealth headers, proxies, and realistic behavior patterns avoid most triggers. At scale, CAPTCHAs become inevitable. CaptchaAI handles them when they appear.

Why do I get CAPTCHAs with residential proxies?

Residential IPs aren't immune. High request rates, missing cookies, or bot-like headers can still trigger CAPTCHAs. Proxies reduce frequency but don't eliminate detection.

How does reCAPTCHA know I'm a bot if I'm in a real browser?

reCAPTCHA checks dozens of signals including cookie history, mouse movement patterns, and Google account activity. Automated browsers lack the organic interaction patterns of real users.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.