Use Cases

Reducing CAPTCHA Interruptions in Web Scraping

Every CAPTCHA solved costs time and money. These techniques reduce how often CAPTCHAs appear during scraping — and CaptchaAI handles the ones that still get through.

Prevention Techniques

1. Use Residential Proxies

Datacenter IPs trigger CAPTCHAs 5-10x more often than residential IPs:

# Residential proxy rotation
proxies = {
    "http": "http://user:pass@residential-proxy.example.com:8080",
    "https": "http://user:pass@residential-proxy.example.com:8080"
}
resp = requests.get(url, proxies=proxies)

2. Implement Request Delays

import random
import time

# Random delay between 3-8 seconds
time.sleep(random.uniform(3, 8))

Sites track request timing. Consistent intervals (exactly every 1 second) are a strong bot signal. Random delays mimic human behavior.

3. Set Realistic Headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.google.com/",
    "DNT": "1",
    "Connection": "keep-alive"
}

4. Maintain Session Cookies

session = requests.Session()

# Visit homepage first to establish cookies
session.get("https://example.com")
time.sleep(2)

# Then access target pages
session.get("https://example.com/data")

Sites expect returning visitors to have cookie history. A fresh session hitting deep pages is suspicious.

5. Use Referrer Chains

# Navigate like a human: search → results → detail
session.get("https://example.com")
time.sleep(2)
session.get("https://example.com/search?q=product", headers={"Referer": "https://example.com"})
time.sleep(3)
session.get("https://example.com/product/123", headers={"Referer": "https://example.com/search?q=product"})

6. Lower Concurrency

Concurrency CAPTCHA Rate Speed
1 thread Lowest Slow
3 threads Low Moderate
10 threads High Fast
50 threads Very high Fast but blocked

Start with 1-3 concurrent scrapers per site.

7. Use APIs When Available

Many sites offer public APIs that don't require CAPTCHA solving:

Site API Available Notes
Amazon Product Advertising API Requires approval
Google Custom Search API 100 free/day
Twitter/X API v2 Paid tiers
Reddit Reddit API Free with app registration

Check if your target has an API before building a scraper.

8. Scrape During Off-Peak Hours

Sites are less aggressive with bot detection during low-traffic periods (late night, weekends). Rate limits may be higher and monitoring less strict.

When Prevention Fails: CaptchaAI

No prevention technique eliminates CAPTCHAs entirely. At scale, you need both prevention and solving:

import requests
import time

API_KEY = "YOUR_API_KEY"

def scrape_with_fallback(url, session):
    resp = session.get(url)

    # If CAPTCHA appears, solve it
    if "g-recaptcha" in resp.text:
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(resp.text, "html.parser")
        site_key = soup.find("div", class_="g-recaptcha")["data-sitekey"]

        # Solve via CaptchaAI
        submit = requests.get("https://ocr.captchaai.com/in.php", params={
            "key": API_KEY, "method": "userrecaptcha",
            "googlekey": site_key, "pageurl": url
        })
        task_id = submit.text.split("|")[1]

        for _ in range(60):
            time.sleep(5)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": API_KEY, "action": "get", "id": task_id
            })
            if result.text == "CAPCHA_NOT_READY": continue
            if result.text.startswith("OK|"):
                token = result.text.split("|")[1]
                resp = session.post(url, data={"g-recaptcha-response": token})
                break

    return resp.text

Cost Impact of Prevention

Good prevention techniques reduce CaptchaAI usage significantly:

Approach CAPTCHAs per 1K pages Cost
No prevention ~200-500 $0.20-0.50
Basic headers + delays ~50-100 $0.05-0.10
Residential proxies + headers ~10-30 $0.01-0.03
Full stealth setup ~5-15 $0.005-0.015

Investing in prevention pays for itself through lower CAPTCHA solving costs.

FAQ

What's the single most effective technique?

Residential proxy rotation. It addresses the most common trigger (IP reputation) and works across all sites.

Do I still need CaptchaAI if I use all these techniques?

Yes, for production reliability. Prevention reduces CAPTCHAs but doesn't eliminate them. CaptchaAI ensures your scraper never gets stuck on an unsolved CAPTCHA.

How do I know which technique helps most for my target site?

Monitor your CAPTCHA rate. Add techniques one at a time and measure the reduction. Start with proxies and headers as they have the highest impact.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.