Integrations

Octoparse + CaptchaAI: Visual Scraping with CAPTCHA Handling

Octoparse is a visual web scraping tool that lets non-coders extract data. When CAPTCHAs block extraction, CaptchaAI provides the solution.


When Octoparse Encounters CAPTCHAs

Scenario What Happens
reCAPTCHA on target page Extraction stops, manual solve needed
Cloudflare challenge Page loads but no data extracted
Rate-limiting CAPTCHA After N pages, CAPTCHA appears
Login-protected data Login form has CAPTCHA

Since Octoparse is a visual tool, the integration uses a Python helper to solve CAPTCHAs and export session cookies for Octoparse:

import requests
import time
import json


class OctoparseCaptchaHelper:
    """Solve CAPTCHAs and export cookies for Octoparse."""

    def __init__(self, api_key):
        self.api_key = api_key
        self.session = requests.Session()

    def solve_and_get_cookies(self, login_url, sitekey, credentials):
        """
        Solve login CAPTCHA and return session cookies.

        Steps:

        1. Visit login page to get initial cookies
        2. Solve CAPTCHA via CaptchaAI
        3. Submit login form with token
        4. Export authenticated cookies
        """
        # Step 1: Get initial cookies
        self.session.get(login_url, timeout=15)

        # Step 2: Solve CAPTCHA
        token = self._solve_recaptcha(sitekey, login_url)

        # Step 3: Submit login
        login_data = {
            **credentials,
            "g-recaptcha-response": token,
        }
        resp = self.session.post(login_url, data=login_data, timeout=30)

        if resp.status_code != 200:
            raise RuntimeError(f"Login failed: {resp.status_code}")

        # Step 4: Export cookies
        cookies = []
        for cookie in self.session.cookies:
            cookies.append({
                "name": cookie.name,
                "value": cookie.value,
                "domain": cookie.domain,
                "path": cookie.path,
            })

        return cookies

    def export_cookies_for_octoparse(self, cookies, output_file="cookies.json"):
        """Save cookies in format importable by Octoparse."""
        with open(output_file, "w") as f:
            json.dump(cookies, f, indent=2)
        print(f"Cookies saved to {output_file}")
        print(f"Import these in Octoparse: Task → Advanced Settings → Cookies")

    def _solve_recaptcha(self, sitekey, pageurl):
        """Solve reCAPTCHA via CaptchaAI."""
        resp = requests.post("https://ocr.captchaai.com/in.php", data={
            "key": self.api_key,
            "method": "userrecaptcha",
            "googlekey": sitekey,
            "pageurl": pageurl,
            "json": 1,
        }, timeout=30)
        result = resp.json()

        if result.get("status") != 1:
            raise RuntimeError(f"Submit error: {result.get('request')}")

        task_id = result["request"]
        time.sleep(15)

        for _ in range(24):
            resp = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": self.api_key, "action": "get",
                "id": task_id, "json": 1,
            }, timeout=15)
            data = resp.json()

            if data.get("status") == 1:
                return data["request"]
            if data["request"] != "CAPCHA_NOT_READY":
                raise RuntimeError(data["request"])
            time.sleep(5)

        raise TimeoutError("Solve timeout")


# Usage
helper = OctoparseCaptchaHelper("YOUR_API_KEY")

cookies = helper.solve_and_get_cookies(
    login_url="https://example.com/login",
    sitekey="6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-",
    credentials={"username": "user", "password": "pass"},
)

helper.export_cookies_for_octoparse(cookies)

Approach: API-Based Extraction with CAPTCHA Solving

For more control, use CaptchaAI directly in a Python script alongside Octoparse:

def extract_with_captcha(api_key, urls, sitekey):
    """Extract data from CAPTCHA-protected pages."""
    results = []

    for url in urls:
        print(f"Processing: {url}")

        # Solve CAPTCHA for this page
        helper = OctoparseCaptchaHelper(api_key)
        token = helper._solve_recaptcha(sitekey, url)

        # Access page with token
        resp = requests.post(url, data={
            "g-recaptcha-response": token,
        }, timeout=30)

        # Parse response
        if resp.status_code == 200:
            results.append({
                "url": url,
                "content_length": len(resp.text),
                "status": "success",
            })
        else:
            results.append({
                "url": url,
                "status": f"failed ({resp.status_code})",
            })

        time.sleep(3)  # Rate limit

    return results

Octoparse Configuration Tips

Setting Recommendation
Page load wait Set to 10+ seconds for CAPTCHA pages
Retry on error Enable with 3 retries
Cookie import Use exported cookies from helper
Cloud extraction Use Octoparse cloud with pre-solved cookies
Local extraction Use local mode for initial CAPTCHA bypass

FAQ

Can Octoparse solve CAPTCHAs automatically?

Octoparse has limited built-in CAPTCHA handling. For reliable solving, use CaptchaAI to pre-solve and export session cookies, or switch to a code-based approach for CAPTCHA-heavy sites.

When should I use Octoparse vs. a coded solution?

Use Octoparse for simple, low-CAPTCHA sites. For sites with frequent CAPTCHAs, a Python script with CaptchaAI gives you more control and reliability.

Yes. Run the Python helper on a schedule (e.g., via cron or Task Scheduler) to refresh cookies before each Octoparse extraction run.



Handle CAPTCHAs in visual scraping — try CaptchaAI.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.