Most high-value websites use CAPTCHAs as part of their anti-bot defense. This guide covers strategies for scraping these sites reliably using CaptchaAI, including how to identify CAPTCHA types, solve them automatically, and build resilient scrapers.
Common CAPTCHA Implementations
| CAPTCHA | Where Used | CaptchaAI Method |
|---|---|---|
| reCAPTCHA v2 | Login forms, search pages | method=userrecaptcha |
| reCAPTCHA v3 | Background scoring on any page | method=userrecaptcha&version=v3 |
| Cloudflare Turnstile | Sites behind Cloudflare | method=turnstile |
| Cloudflare Challenge | Full-page Cloudflare block | method=cloudflare_challenge |
| Image/OCR CAPTCHA | Legacy sites, Amazon | method=base64 |
| hCaptcha | Privacy-focused sites | method=hcaptcha |
Strategy 1: Detect and Solve on Demand
The most reliable approach — scrape normally and solve CAPTCHAs only when they appear:
import requests
import time
from bs4 import BeautifulSoup
API_KEY = "YOUR_API_KEY"
class ProtectedScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
def scrape(self, url):
resp = self.session.get(url)
# Check for CAPTCHA
if self._has_captcha(resp.text):
resp = self._handle_captcha(resp.text, url)
return resp.text
def _has_captcha(self, html):
indicators = ["g-recaptcha", "cf-turnstile", "h-captcha", "captcha"]
return any(ind in html.lower() for ind in indicators)
def _handle_captcha(self, html, url):
soup = BeautifulSoup(html, "html.parser")
# reCAPTCHA v2
rc = soup.find("div", class_="g-recaptcha")
if rc:
token = self._solve_recaptcha(rc["data-sitekey"], url)
return self.session.post(url, data={"g-recaptcha-response": token})
# Cloudflare Turnstile
ts = soup.find("div", class_="cf-turnstile")
if ts:
token = self._solve_turnstile(ts["data-sitekey"], url)
return self.session.post(url, data={"cf-turnstile-response": token})
raise Exception("Unknown CAPTCHA type")
def _solve_recaptcha(self, site_key, page_url):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "userrecaptcha",
"googlekey": site_key, "pageurl": page_url
})
return self._poll(resp.text.split("|")[1])
def _solve_turnstile(self, site_key, page_url):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY, "method": "turnstile",
"sitekey": site_key, "pageurl": page_url
})
return self._poll(resp.text.split("|")[1])
def _poll(self, task_id):
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if result.text.startswith("OK|"): return result.text.split("|")[1]
raise Exception(result.text)
raise TimeoutError()
# Usage
scraper = ProtectedScraper()
html = scraper.scrape("https://example.com/data")
Strategy 2: Pre-Solve for Known CAPTCHA Pages
If you know which pages always have CAPTCHAs, solve preemptively:
def scrape_known_captcha_page(url, site_key):
# Solve before even loading the page
token = solve_recaptcha(site_key, url)
# Submit directly with token
resp = requests.post(url, data={
"g-recaptcha-response": token,
"query": "search term"
})
return resp.text
Strategy 3: Cloudflare-Protected Sites
Sites behind Cloudflare often require a cf_clearance cookie:
def get_cloudflare_clearance(url, proxy):
resp = requests.get("https://ocr.captchaai.com/in.php", params={
"key": API_KEY,
"method": "cloudflare_challenge",
"pageurl": url,
"proxy": proxy,
"proxytype": "HTTP"
})
task_id = resp.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get("https://ocr.captchaai.com/res.php", params={
"key": API_KEY, "action": "get", "id": task_id
})
if result.text == "CAPCHA_NOT_READY": continue
if "cf_clearance" in result.text:
# Parse cf_clearance and user_agent from response
return result.text
raise TimeoutError()
Multi-Page Scraping Pattern
def scrape_multiple_pages(base_url, site_key, pages):
scraper = ProtectedScraper()
results = []
for page in pages:
url = f"{base_url}?page={page}"
try:
html = scraper.scrape(url)
soup = BeautifulSoup(html, "html.parser")
items = soup.find_all("div", class_="item")
results.extend([item.text.strip() for item in items])
print(f"Page {page}: {len(items)} items")
except Exception as e:
print(f"Page {page} failed: {e}")
time.sleep(random.uniform(2, 5))
return results
Troubleshooting
| Issue | Fix |
|---|---|
| CAPTCHA appears on every page | Use proxies; reduce request rate |
| Token rejected after solving | Token may have expired; use within 120s |
| Cloudflare blocks despite clearance | Use same proxy and user-agent for all requests |
| Site returns different page after solve | Check for additional redirects or cookies |
FAQ
Which sites are hardest to scrape?
Sites using Cloudflare Enterprise, PerimeterX, or Akamai Bot Manager are the most challenging. CaptchaAI handles their CAPTCHA components; combine with stealth browsers and proxies for best results.
Can I scrape sites that require login?
Yes. Log in first (solving any login CAPTCHA), maintain the session cookies, then scrape authenticated pages. CaptchaAI handles CAPTCHAs at any stage.
How do I handle JavaScript-rendered pages?
Use Selenium, Puppeteer, or Playwright to render JavaScript, then extract CAPTCHA parameters and solve via CaptchaAI. See Selenium CAPTCHA Handling.
Related Guides
- How to Handle CAPTCHA Challenges in Web Scraping Workflows
- Headless Browser CAPTCHA Issues
- CAPTCHA Detection in Scraping Explained
Full Working Code
Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.
View on GitHub →
Discussions (0)
Join the conversation
Sign in to share your opinion.
Sign InNo comments yet.