Use Cases

CAPTCHA Handling for Sports Statistics Data Collection

Sports statistics portals protect player databases, game logs, and historical records with Cloudflare Turnstile and reCAPTCHA. CAPTCHAs trigger during rapid queries across multiple players, seasons, or games — especially on sites that monetize their data behind paywalls. Here's how to collect sports data reliably.

CAPTCHA Patterns on Sports Portals

Data type Portal type CAPTCHA Trigger
Player statistics Reference sites Cloudflare Turnstile Rapid player page loads
Game box scores Score portals Cloudflare Challenge Bulk game lookups
Season standings League sites reCAPTCHA v2 Automated navigation
Fantasy projections Fantasy platforms reCAPTCHA v3 Frequent API-like access
Betting odds/lines Odds portals Cloudflare Turnstile High-frequency refreshes
Historical records Archive sites Image CAPTCHA Data export requests

Sports Data Collector

import requests
import time
import re
from dataclasses import dataclass, field

@dataclass
class PlayerStats:
    name: str
    team: str
    position: str
    stats: dict = field(default_factory=dict)
    season: str = ""
    source: str = ""

class SportsDataCollector:
    def __init__(self, api_key):
        self.api_key = api_key
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

    def get_player_stats(self, portal_url, player_slug, season=None):
        """Fetch player statistics, solving CAPTCHAs as needed."""
        url = f"{portal_url}/players/{player_slug}"
        if season:
            url += f"/{season}"

        response = self.session.get(url)

        if self._is_captcha_page(response):
            response = self._solve_turnstile_and_retry(response, url)

        return self._parse_player_stats(response.text)

    def get_game_scores(self, portal_url, date):
        """Fetch all game scores for a specific date."""
        url = f"{portal_url}/scores/{date}"
        response = self.session.get(url)

        if self._is_captcha_page(response):
            response = self._solve_turnstile_and_retry(response, url)

        return self._parse_scores(response.text)

    def collect_team_roster(self, portal_url, team_slug, season):
        """Collect stats for all players on a team roster."""
        roster_url = f"{portal_url}/teams/{team_slug}/{season}/roster"
        response = self.session.get(roster_url)

        if self._is_captcha_page(response):
            response = self._solve_turnstile_and_retry(response, roster_url)

        player_slugs = self._extract_player_links(response.text)

        all_stats = []
        for slug in player_slugs:
            try:
                stats = self.get_player_stats(portal_url, slug, season)
                all_stats.append(stats)
                time.sleep(2)  # Respectful delay
            except Exception as e:
                print(f"Failed for {slug}: {e}")

        return all_stats

    def _is_captcha_page(self, response):
        return (
            response.status_code == 403 or
            "cf-turnstile" in response.text or
            "challenges.cloudflare.com" in response.text
        )

    def _solve_turnstile_and_retry(self, response, url):
        match = re.search(r'data-sitekey="(0x[^"]+)"', response.text)
        if not match:
            raise ValueError("Turnstile sitekey not found")

        resp = requests.post("https://ocr.captchaai.com/in.php", data={
            "key": self.api_key,
            "method": "turnstile",
            "sitekey": match.group(1),
            "pageurl": url,
            "json": 1
        })
        task_id = resp.json()["request"]

        for _ in range(60):
            time.sleep(3)
            result = requests.get("https://ocr.captchaai.com/res.php", params={
                "key": self.api_key,
                "action": "get",
                "id": task_id,
                "json": 1
            })
            data = result.json()
            if data["status"] == 1:
                return self.session.post(url, data={
                    "cf-turnstile-response": data["request"]
                })

        raise TimeoutError("Turnstile solve timed out")

    def _parse_player_stats(self, html):
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "html.parser")

        # Extract stat rows from tables
        stats = {}
        stat_table = soup.select_one("table.stats, #stats-table")
        if stat_table:
            headers = [th.text.strip() for th in stat_table.select("thead th")]
            for row in stat_table.select("tbody tr"):
                cells = [td.text.strip() for td in row.select("td")]
                if len(cells) == len(headers):
                    for header, value in zip(headers, cells):
                        stats[header] = value

        return PlayerStats(
            name=soup.select_one("h1, .player-name")?.text?.strip() or "",
            team=soup.select_one(".team-name, .team")?.text?.strip() or "",
            position=soup.select_one(".position, .pos")?.text?.strip() or "",
            stats=stats
        )

    def _parse_scores(self, html):
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "html.parser")
        games = []

        for game in soup.select(".game-card, .scoreboard-item"):
            games.append({
                "away": game.select_one(".away-team")?.text?.strip(),
                "home": game.select_one(".home-team")?.text?.strip(),
                "away_score": game.select_one(".away-score")?.text?.strip(),
                "home_score": game.select_one(".home-score")?.text?.strip(),
                "status": game.select_one(".game-status")?.text?.strip()
            })

        return games

    def _extract_player_links(self, html):
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "html.parser")
        links = []
        for a in soup.select("a[href*='/players/']"):
            slug = a["href"].rstrip("/").split("/")[-1]
            if slug and slug not in links:
                links.append(slug)
        return links


# Usage
collector = SportsDataCollector("YOUR_API_KEY")

# Get player stats
stats = collector.get_player_stats(
    "https://sports.example.com", "lebron-james", "2024"
)
print(f"{stats.name} ({stats.team}): {stats.stats}")

# Get all scores for a date
scores = collector.get_game_scores("https://sports.example.com", "2024-12-25")
for game in scores:
    print(f"{game['away']} {game['away_score']} @ {game['home']} {game['home_score']}")

Season Data Aggregator (JavaScript)

class SportsAggregator {
  constructor(apiKey) {
    this.apiKey = apiKey;
  }

  async collectSeasonData(portalUrl, sport, season, teams) {
    const allData = {};

    for (const team of teams) {
      try {
        const roster = await this.getTeamStats(portalUrl, team, season);
        allData[team] = roster;
      } catch (error) {
        allData[team] = { error: error.message };
      }
      // Rate limit between teams
      await new Promise(r => setTimeout(r, 3000));
    }

    return allData;
  }

  async getTeamStats(portalUrl, teamSlug, season) {
    const url = `${portalUrl}/teams/${teamSlug}/${season}`;
    const response = await fetch(url);
    const html = await response.text();

    if (html.includes('cf-turnstile') || response.status === 403) {
      return this.solveAndFetch(url, html);
    }

    return this.parseTeamPage(html);
  }

  async solveAndFetch(url, html) {
    const match = html.match(/data-sitekey="(0x[^"]+)"/);
    if (!match) throw new Error('Turnstile sitekey not found');

    const submitResp = await fetch('https://ocr.captchaai.com/in.php', {
      method: 'POST',
      body: new URLSearchParams({
        key: this.apiKey,
        method: 'turnstile',
        sitekey: match[1],
        pageurl: url,
        json: '1'
      })
    });
    const { request: taskId } = await submitResp.json();

    for (let i = 0; i < 60; i++) {
      await new Promise(r => setTimeout(r, 3000));
      const result = await fetch(
        `https://ocr.captchaai.com/res.php?key=${this.apiKey}&action=get&id=${taskId}&json=1`
      );
      const data = await result.json();
      if (data.status === 1) {
        const response = await fetch(url, {
          method: 'POST',
          body: new URLSearchParams({ 'cf-turnstile-response': data.request })
        });
        return this.parseTeamPage(await response.text());
      }
    }
    throw new Error('Turnstile solve timed out');
  }

  parseTeamPage(html) {
    const players = [];
    const rowMatches = html.matchAll(/<tr[^>]*class="[^"]*player[^"]*"[^>]*>([\s\S]*?)<\/tr>/gi);

    for (const row of rowMatches) {
      const cells = [...row[1].matchAll(/<td[^>]*>([\s\S]*?)<\/td>/gi)]
        .map(m => m[1].replace(/<[^>]+>/g, '').trim());
      if (cells.length >= 3) {
        players.push({
          name: cells[0],
          position: cells[1],
          stats: cells.slice(2)
        });
      }
    }

    return { players, count: players.length };
  }
}

// Usage
const aggregator = new SportsAggregator('YOUR_API_KEY');
const seasonData = await aggregator.collectSeasonData(
  'https://sports.example.com', 'basketball', '2024',
  ['lakers', 'celtics', 'warriors']
);

Collection Strategy by Sport

Sport Peak data volume CAPTCHA sensitivity Recommended approach
Baseball Daily game logs Moderate Collect after games end
Basketball Game nights High during games Use off-peak hours
Football Weekly games Low between games Weekly bulk collection
Soccer Daily across leagues Moderate Per-league sessions
Hockey Nightly Moderate Post-game collection

Troubleshooting

Issue Cause Fix
Turnstile on every page No cookie persistence Maintain cf_clearance in session
Player page shows different stats Season/career toggle Include season parameter in URL
Scores page empty Games not yet played Check game schedule before querying
Rate limited after 50 requests Portal's daily cap Distribute across time, use proxies

FAQ

How often should I collect live game data?

During active games, scores update every few minutes. Use 2–5 minute intervals with session persistence to minimize CAPTCHAs. CaptchaAI handles Turnstile with 100% success rate.

Can I collect historical stats going back multiple seasons?

Yes. Historical data rarely changes, so collect it once and cache locally. CAPTCHAs may trigger during initial bulk collection but won't affect stored data.

Do sports APIs eliminate the need for CAPTCHA solving?

Some leagues offer official APIs, but they're often expensive, rate-limited, or missing advanced stats available on reference sites. CaptchaAI lets you access additional data sources protected by CAPTCHAs.

Next Steps

Collect sports data reliably — get your CaptchaAI API key and handle sports portal CAPTCHAs automatically.

Full Working Code

Complete runnable examples for this article in Python, Node.js, PHP, Go, Java, C#, Ruby, Rust, Kotlin & Bash.

View on GitHub →

Discussions (0)

No comments yet.