Use Cases

Scraping Reliability Guide: Managing Blocks and CAPTCHA Challenges

Getting blocked during scraping wastes time and resources. This guide covers techniques to minimize detection and CAPTCHA triggers — and how CaptchaAI handles the CAPTCHAs that still appear.

How Sites Detect Scrapers

Layer Detection Method Difficulty
IP Rate limiting, reputation, geolocation Easy to circumvent
Headers User-Agent, Accept-Language, Referer Easy
Cookies Session tracking, fingerprinting cookies Medium
JavaScript Browser fingerprinting, behavior analysis Medium
CAPTCHA reCAPTCHA, Turnstile, hCaptcha Solvable via CaptchaAI
Behavioral Mouse movement, scroll patterns, timing Hard

Technique 1: Realistic HTTP Headers

import requests
import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
]

def get_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

Technique 2: Request Timing and Rate Limiting

import time
import random

def polite_delay():
    """Random delay between 2-7 seconds."""
    time.sleep(random.uniform(2, 7))

def scrape_pages(urls):
    session = requests.Session()
    results = []

    for url in urls:
        session.headers = get_headers()
        resp = session.get(url)
        results.append(resp.text)
        polite_delay()

    return results

Technique 3: Proxy Rotation

PROXIES = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
]

def get_proxy():
    proxy = random.choice(PROXIES)
    return {"http": proxy, "https": proxy}

session = requests.Session()
session.proxies = get_proxy()
def create_warm_session(base_url):
    """Create a session with realistic cookie history."""
    session = requests.Session()
    session.headers = get_headers()

    # Visit the homepage first to get cookies
    session.get(base_url)
    time.sleep(random.uniform(1, 3))

    # Visit a few pages to build cookie history
    session.get(f"{base_url}/about")
    time.sleep(random.uniform(1, 3))

    return session

Technique 5: Referrer Chain

def scrape_with_referrer(session, urls):
    """Add realistic Referer headers."""
    prev_url = None

    for url in urls:
        headers = get_headers()
        if prev_url:
            headers["Referer"] = prev_url

        resp = session.get(url, headers=headers)
        prev_url = url
        polite_delay()

When CAPTCHAs Still Appear: CaptchaAI Integration

Even with perfect stealth-configuredion, CAPTCHAs will eventually appear at scale. Add CaptchaAI as a fallback:

import requests
import time

API_KEY = "YOUR_API_KEY"

def solve_if_captcha(session, resp, url):
    """Check for CAPTCHA and solve if present."""
    from bs4 import BeautifulSoup

    captcha_indicators = ["g-recaptcha", "cf-turnstile", "captcha"]
    if not any(ind in resp.text.lower() for ind in captcha_indicators):
        return resp  # No CAPTCHA, return original response

    soup = BeautifulSoup(resp.text, "html.parser")

    # reCAPTCHA
    rc = soup.find("div", class_="g-recaptcha")
    if rc:
        site_key = rc["data-sitekey"]
        submit = requests.get("https://ocr.captchaai.com/in.php", params={
            "key": API_KEY, "method": "userrecaptcha",
            "googlekey": site_key, "pageurl": url
        })
        task_id = submit.text.split("|")[1]

        for _ in range(60):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": API_KEY, "action": "get", "id": task_id
            })
            if result.text == "CAPCHA_NOT_READY": continue
            if result.text.startswith("OK|"):
                token = result.text.split("|")[1]
                return session.post(url, data={"g-recaptcha-response": token})

    return resp

# Usage in scraper
session = create_warm_session("https://example.com")
resp = session.get("https://example.com/data")
resp = solve_if_captcha(session, resp, "https://example.com/data")

Complete Stealth-Configuredion Scraper

import requests
import time
import random
from bs4 import BeautifulSoup

API_KEY = "YOUR_API_KEY"

class StealthScraper:
    def __init__(self, proxies=None):
        self.session = requests.Session()
        self.proxies = proxies or []

    def scrape(self, url):
        self.session.headers = get_headers()
        if self.proxies:
            self.session.proxies = get_proxy()

        resp = self.session.get(url)
        resp = solve_if_captcha(self.session, resp, url)

        polite_delay()
        return resp.text

    def scrape_batch(self, urls):
        results = []
        for url in urls:
            try:
                html = self.scrape(url)
                results.append({"url": url, "html": html, "success": True})
            except Exception as e:
                results.append({"url": url, "error": str(e), "success": False})
        return results

Detection Checklist

Check Status
Rotating User-Agent strings
Realistic Accept/Language headers
Random delays between requests (2-7s)
Proxy rotation (residential preferred)
Session cookie management
Referrer headers
CaptchaAI integration for CAPTCHA fallback
Error handling and retries

FAQ

What's the most important stealth-configuredion technique?

Proxy rotation has the highest impact. Most blocking decisions are IP-based. Combining residential proxies with CaptchaAI covers both IP-level and CAPTCHA-level protection.

Can I scrape at high speed without getting blocked?

Not from a single IP. Distribute requests across many proxies and accept that CAPTCHAs will appear. CaptchaAI solves them in seconds, so the overhead is minimal.

Does CaptchaAI work with all anti-bot systems?

CaptchaAI solves the CAPTCHA component of anti-bot systems (reCAPTCHA, Turnstile, hCaptcha, Cloudflare Challenge). Other detection layers (JavaScript fingerprinting, behavioral analysis) require browser-level solutions.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.