Explainers

How to Handle reCAPTCHA v2 in Web Scraping Workflows

When your scraper hits a reCAPTCHA v2 challenge, the workflow stops. The page waits for a human to solve the checkbox or image grid before serving the data you need. The fastest way to resume scraping is to route the CAPTCHA to a solver API: extract the sitekey and page URL, send them to CaptchaAI, receive a valid token, and inject it back into the page.

This guide shows the complete flow with working code for Python (Selenium + requests) and Node.js (Puppeteer).


How the workflow works

Every reCAPTCHA v2 widget has two parameters your scraper needs:

  1. googlekey — the public sitekey embedded in the page HTML
  2. pageurl — the URL where the CAPTCHA appears

Your scraper sends these to the CaptchaAI API, waits for a solved token, and injects the token back into the page's g-recaptcha-response field (or calls the callback function). The target site's backend verifies the token against Google and lets the request through.


Python: Selenium + CaptchaAI

import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

# Step 1: Open the page with Selenium
driver = webdriver.Chrome()
driver.get("https://example.com/protected-page")

# Step 2: Extract the sitekey
sitekey = driver.find_element(By.CSS_SELECTOR, ".g-recaptcha").get_attribute("data-sitekey")
page_url = driver.current_url

# Step 3: Submit to CaptchaAI
response = requests.get("https://ocr.captchaai.com/in.php", params={
    "key": "YOUR_API_KEY",
    "method": "userrecaptcha",
    "googlekey": sitekey,
    "pageurl": page_url,
    "json": 1
}).json()

task_id = response["request"]

# Step 4: Poll for result
token = None
for _ in range(40):
    time.sleep(5)
    result = requests.get("https://ocr.captchaai.com/res.php", params={
        "key": "YOUR_API_KEY",
        "action": "get",
        "id": task_id,
        "json": 1
    }).json()

    if result.get("status") == 1:
        token = result["request"]
        break
    if result.get("request") != "CAPCHA_NOT_READY":
        raise RuntimeError(f"Solve failed: {result['request']}")

# Step 5: Inject the token and submit
driver.execute_script(
    f'document.getElementById("g-recaptcha-response").innerHTML = "{token}";'
)

# Check for callback
callback = driver.execute_script(
    'var el = document.querySelector(".g-recaptcha"); '
    'return el ? el.getAttribute("data-callback") : null;'
)
if callback:
    driver.execute_script(f'{callback}("{token}");')
else:
    driver.find_element(By.CSS_SELECTOR, "form").submit()

# Step 6: Scrape the data
print(driver.page_source[:500])
driver.quit()

Node.js: Puppeteer + CaptchaAI

const puppeteer = require("puppeteer");

async function scrapeWithCaptcha(url) {
  const browser = await puppeteer.launch({ headless: "new" });
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "networkidle2" });

  // Extract sitekey
  const sitekey = await page.$eval(".g-recaptcha", (el) => el.dataset.sitekey);

  // Submit to CaptchaAI
  const submitRes = await fetch(
    `https://ocr.captchaai.com/in.php?${new URLSearchParams({
      key: "YOUR_API_KEY",
      method: "userrecaptcha",
      googlekey: sitekey,
      pageurl: url,
      json: 1,
    })}`
  );
  const { request: taskId } = await submitRes.json();

  // Poll for result
  let token;
  for (let i = 0; i < 40; i++) {
    await new Promise((r) => setTimeout(r, 5000));
    const res = await fetch(
      `https://ocr.captchaai.com/res.php?${new URLSearchParams({
        key: "YOUR_API_KEY",
        action: "get",
        id: taskId,
        json: 1,
      })}`
    );
    const data = await res.json();
    if (data.status === 1) {
      token = data.request;
      break;
    }
    if (data.request !== "CAPCHA_NOT_READY")
      throw new Error(`Solve failed: ${data.request}`);
  }

  // Inject token
  await page.evaluate((t) => {
    document.getElementById("g-recaptcha-response").innerHTML = t;
    const cb = document.querySelector(".g-recaptcha")?.dataset.callback;
    if (cb && window[cb]) window[cb](t);
  }, token);

  // Wait for navigation after form submit
  await page.waitForNavigation({ waitUntil: "networkidle2" });
  const content = await page.content();
  await browser.close();
  return content;
}

scrapeWithCaptcha("https://example.com/protected-page").then(console.log);

Headless vs headed mode

Some sites detect headless browsers and block them before the CAPTCHA even appears. If you get blocked before seeing reCAPTCHA:

  • Use headless: "new" in Puppeteer (newer stealth mode)
  • Add --disable-blink-features=AutomationControlled to Chromium flags
  • Use a real User-Agent string
  • Consider using proxy rotation with your CaptchaAI solves

HTTP-only approach (no browser)

If the target site sends the CAPTCHA in a form submission flow, you can skip the browser entirely:

import requests
import time

session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"

# Load the page to get cookies
session.get("https://example.com/protected-page")

# Solve the CAPTCHA
sitekey = "6Le-wvkSAAAAAN..."  # extracted from page HTML
solve_resp = requests.get("https://ocr.captchaai.com/in.php", params={
    "key": "YOUR_API_KEY", "method": "userrecaptcha",
    "googlekey": sitekey, "pageurl": "https://example.com/protected-page",
    "json": 1
}).json()

task_id = solve_resp["request"]
time.sleep(15)

# Poll
for _ in range(30):
    result = requests.get("https://ocr.captchaai.com/res.php", params={
        "key": "YOUR_API_KEY", "action": "get", "id": task_id, "json": 1
    }).json()
    if result.get("status") == 1:
        token = result["request"]
        break
    time.sleep(5)

# Submit with token
resp = session.post("https://example.com/protected-page", data={
    "g-recaptcha-response": token,
    "other_field": "value"
})
print(resp.text[:500])

FAQ

Does solving reCAPTCHA v2 slow down my scraper?

Each solve takes 15–60 seconds. For high-volume scraping, run multiple solves in parallel (CaptchaAI supports concurrent tasks per thread).

Can I cache reCAPTCHA tokens?

No. Each token is single-use and expires after ~2 minutes. You need a fresh solve for each protected page request.

Do I need a browser to handle reCAPTCHA v2?

Not always. If the site accepts the g-recaptcha-response as a POST field, you can use an HTTP-only approach. If the site requires JavaScript-based token injection, you need a browser.

How do I handle rotating proxies with CaptchaAI?

CaptchaAI solves CAPTCHAs on its own infrastructure — you do not need to pass your proxy for standard reCAPTCHA v2. Use your proxies for the scraping requests that follow.

What if the site uses Enterprise reCAPTCHA?

Add enterprise=1 to your CaptchaAI request. See How to Solve reCAPTCHA v2 Enterprise Using API.


Start scraping through reCAPTCHA v2

  1. Get your API key at captchaai.com/api.php
  2. Extract the sitekey from the target page
  3. Use the code examples above to solve and inject
  4. Scale with concurrent solves for high-volume workflows

Discussions (0)

No comments yet.