Use Cases

Scraping CAPTCHA-Protected Websites

Most high-value websites use CAPTCHAs as part of their anti-bot defense. This guide covers strategies for scraping these sites reliably using CaptchaAI, including how to identify CAPTCHA types, solve them automatically, and build resilient scrapers.

Common CAPTCHA Implementations

CAPTCHA Where Used CaptchaAI Method
reCAPTCHA v2 Login forms, search pages method=userrecaptcha
reCAPTCHA v3 Background scoring on any page method=userrecaptcha&version=v3
Cloudflare Turnstile Sites behind Cloudflare method=turnstile
Cloudflare Challenge Full-page Cloudflare block method=cloudflare_challenge
Image/OCR CAPTCHA Legacy sites, Amazon method=base64
hCaptcha Privacy-focused sites method=hcaptcha

Strategy 1: Detect and Solve on Demand

The most reliable approach — scrape normally and solve CAPTCHAs only when they appear:

import requests
import time
from bs4 import BeautifulSoup

API_KEY = "YOUR_API_KEY"

class ProtectedScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

    def scrape(self, url):
        resp = self.session.get(url)

        # Check for CAPTCHA
        if self._has_captcha(resp.text):
            resp = self._handle_captcha(resp.text, url)

        return resp.text

    def _has_captcha(self, html):
        indicators = ["g-recaptcha", "cf-turnstile", "h-captcha", "captcha"]
        return any(ind in html.lower() for ind in indicators)

    def _handle_captcha(self, html, url):
        soup = BeautifulSoup(html, "html.parser")

        # reCAPTCHA v2
        rc = soup.find("div", class_="g-recaptcha")
        if rc:
            token = self._solve_recaptcha(rc["data-sitekey"], url)
            return self.session.post(url, data={"g-recaptcha-response": token})

        # Cloudflare Turnstile
        ts = soup.find("div", class_="cf-turnstile")
        if ts:
            token = self._solve_turnstile(ts["data-sitekey"], url)
            return self.session.post(url, data={"cf-turnstile-response": token})

        raise Exception("Unknown CAPTCHA type")

    def _solve_recaptcha(self, site_key, page_url):
        resp = requests.get("https://ocr.captchaai.com/in.php", params={
            "key": API_KEY, "method": "userrecaptcha",
            "googlekey": site_key, "pageurl": page_url
        })
        return self._poll(resp.text.split("|")[1])

    def _solve_turnstile(self, site_key, page_url):
        resp = requests.get("https://ocr.captchaai.com/in.php", params={
            "key": API_KEY, "method": "turnstile",
            "sitekey": site_key, "pageurl": page_url
        })
        return self._poll(resp.text.split("|")[1])

    def _poll(self, task_id):
        for _ in range(60):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": API_KEY, "action": "get", "id": task_id
            })
            if result.text == "CAPCHA_NOT_READY": continue
            if result.text.startswith("OK|"): return result.text.split("|")[1]
            raise Exception(result.text)
        raise TimeoutError()

# Usage
scraper = ProtectedScraper()
html = scraper.scrape("https://example.com/data")

Strategy 2: Pre-Solve for Known CAPTCHA Pages

If you know which pages always have CAPTCHAs, solve preemptively:

def scrape_known_captcha_page(url, site_key):
    # Solve before even loading the page
    token = solve_recaptcha(site_key, url)

    # Submit directly with token
    resp = requests.post(url, data={
        "g-recaptcha-response": token,
        "query": "search term"
    })
    return resp.text

Strategy 3: Cloudflare-Protected Sites

Sites behind Cloudflare often require a cf_clearance cookie:

def get_cloudflare_clearance(url, proxy):
    resp = requests.get("https://ocr.captchaai.com/in.php", params={
        "key": API_KEY,
        "method": "cloudflare_challenge",
        "pageurl": url,
        "proxy": proxy,
        "proxytype": "HTTP"
    })
    task_id = resp.text.split("|")[1]

    for _ in range(60):
        time.sleep(5)
        result = requests.get("https://ocr.captchaai.com/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id
        })
        if result.text == "CAPCHA_NOT_READY": continue
        if "cf_clearance" in result.text:
            # Parse cf_clearance and user_agent from response
            return result.text
    raise TimeoutError()

Multi-Page Scraping Pattern

def scrape_multiple_pages(base_url, site_key, pages):
    scraper = ProtectedScraper()
    results = []

    for page in pages:
        url = f"{base_url}?page={page}"
        try:
            html = scraper.scrape(url)
            soup = BeautifulSoup(html, "html.parser")
            items = soup.find_all("div", class_="item")
            results.extend([item.text.strip() for item in items])
            print(f"Page {page}: {len(items)} items")
        except Exception as e:
            print(f"Page {page} failed: {e}")

        time.sleep(random.uniform(2, 5))

    return results

Troubleshooting

Issue Fix
CAPTCHA appears on every page Use proxies; reduce request rate
Token rejected after solving Token may have expired; use within 120s
Cloudflare blocks despite clearance Use same proxy and user-agent for all requests
Site returns different page after solve Check for additional redirects or cookies

FAQ

Which sites are hardest to scrape?

Sites using Cloudflare Enterprise, PerimeterX, or Akamai Bot Manager are the most challenging. CaptchaAI handles their CAPTCHA components; combine with stealth browsers and proxies for best results.

Can I scrape sites that require login?

Yes. Log in first (solving any login CAPTCHA), maintain the session cookies, then scrape authenticated pages. CaptchaAI handles CAPTCHAs at any stage.

How do I handle JavaScript-rendered pages?

Use Selenium, Puppeteer, or Playwright to render JavaScript, then extract CAPTCHA parameters and solve via CaptchaAI. See Selenium CAPTCHA Handling.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.