Integrations

Scrapy + CaptchaAI Integration Guide

Scrapy is the most popular Python crawling framework. This guide shows how to add CaptchaAI CAPTCHA solving to your spiders using a custom middleware.

Requirements

Requirement Details
Python 3.8+
Scrapy 2.5+
requests For CaptchaAI API calls
CaptchaAI API key Get one here
pip install scrapy requests

CaptchaAI Solver Module

Create captcha_solver.py in your Scrapy project root:

import requests
import time


class CaptchaAISolver:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://ocr.captchaai.com"

    def solve_recaptcha(self, site_key, page_url, timeout=300):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": site_key,
            "pageurl": page_url,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

    def solve_image(self, image_base64, timeout=120):
        resp = requests.get(f"{self.base_url}/in.php", params={
            "key": self.api_key,
            "method": "base64",
            "body": image_base64,
        })

        if not resp.text.startswith("OK|"):
            raise Exception(f"Submit failed: {resp.text}")

        task_id = resp.text.split("|")[1]
        deadline = time.time() + timeout

        while time.time() < deadline:
            time.sleep(5)
            result = requests.get(f"{self.base_url}/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
            })

            if result.text == "CAPCHA_NOT_READY":
                continue
            if result.text.startswith("OK|"):
                return result.text.split("|", 1)[1]
            raise Exception(f"Solve failed: {result.text}")

        raise TimeoutError(f"Task {task_id} timed out")

Scrapy Middleware

Create middlewares.py:

import base64
import re
from scrapy import signals
from scrapy.http import HtmlResponse
from captcha_solver import CaptchaAISolver


class CaptchaAIMiddleware:
    """Scrapy downloader middleware that detects and solves CAPTCHAs."""

    def __init__(self, api_key):
        self.solver = CaptchaAISolver(api_key)

    @classmethod
    def from_crawler(cls, crawler):
        api_key = crawler.settings.get("CAPTCHAAI_API_KEY")
        if not api_key:
            raise ValueError("CAPTCHAAI_API_KEY setting is required")
        return cls(api_key)

    def process_response(self, request, response, spider):
        # Check for reCAPTCHA on the page
        site_key = self._find_recaptcha_key(response.text)
        if site_key:
            spider.logger.info(f"reCAPTCHA detected on {response.url}")
            token = self.solver.solve_recaptcha(site_key, response.url)
            request.meta["captcha_token"] = token
            spider.logger.info("CAPTCHA solved successfully")

        # Check for image CAPTCHA
        captcha_img = self._find_image_captcha(response)
        if captcha_img:
            spider.logger.info(f"Image CAPTCHA detected on {response.url}")
            text = self.solver.solve_image(captcha_img)
            request.meta["captcha_text"] = text
            spider.logger.info(f"Image CAPTCHA solved: {text}")

        return response

    def _find_recaptcha_key(self, html):
        match = re.search(
            r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', html
        )
        return match.group(1) if match else None

    def _find_image_captcha(self, response):
        img = response.css("img#captcha-image::attr(src)").get()
        if img and img.startswith("data:image"):
            return img.split(",", 1)[1]
        return None

Settings Configuration

Add to settings.py:

import os

CAPTCHAAI_API_KEY = os.environ.get("CAPTCHAAI_API_KEY")

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.CaptchaAIMiddleware": 560,
}

Spider Example

import scrapy


class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        # If CAPTCHA was solved, the token is in meta
        token = response.meta.get("captcha_token")
        if token:
            # Resubmit the page with the token
            yield scrapy.FormRequest(
                url=response.url,
                formdata={"g-recaptcha-response": token},
                callback=self.parse_products,
            )
        else:
            yield from self.parse_products(response)

    def parse_products(self, response):
        for product in response.css(".product-item"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
                "url": response.urljoin(
                    product.css("a::attr(href)").get()
                ),
            }

        next_page = response.css("a.next-page::attr(href)").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

Retry on CAPTCHA Pages

Add automatic retry when CAPTCHAs appear:

class CaptchaRetryMiddleware:
    """Retry requests that return CAPTCHA challenge pages."""

    max_retries = 3

    def process_response(self, request, response, spider):
        if self._is_captcha_page(response):
            retries = request.meta.get("captcha_retries", 0)
            if retries < self.max_retries:
                request.meta["captcha_retries"] = retries + 1
                spider.logger.info(
                    f"CAPTCHA page detected, retry {retries + 1}"
                )
                return request.copy()

        return response

    def _is_captcha_page(self, response):
        indicators = [
            "g-recaptcha",
            "cf-turnstile",
            "captcha-image",
            "Please verify you are human",
        ]
        return any(ind in response.text for ind in indicators)

Running the Spider

export CAPTCHAAI_API_KEY="YOUR_API_KEY"
scrapy crawl products -o products.json

Troubleshooting

Issue Cause Fix
ValueError: CAPTCHAAI_API_KEY setting is required Missing env var Set CAPTCHAAI_API_KEY
CAPTCHA not detected Different HTML structure Update regex pattern in middleware
TimeoutError on solve Slow solve or network Increase timeout in solver
Spider gets blocked after solving IP-based blocking Add proxy rotation middleware

FAQ

Can I use this with Scrapy-Splash or Scrapy-Playwright?

Yes. For JavaScript-rendered pages, the middleware works the same way — it inspects the final HTML response for CAPTCHA elements.

Does the middleware slow down crawling?

CAPTCHA solving takes 5-15 seconds per page. Use CONCURRENT_REQUESTS to crawl other pages while waiting. Only pages with CAPTCHAs cause delays.

How do I handle different CAPTCHA types per page?

Extend the middleware's process_response method to check for Turnstile, GeeTest, or other types and call the appropriate solver method.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.