Web Scraping with Intelligent Strategy Selection

This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it gu

New skill

No reviews yet

New skill

🤖 Claude Code⚡ Cursor

FREE

Free to install — no account needed

Copy the command below and paste into your agent.

Instant access • No coding needed • No account needed

What you get in 5 minutes

Full skill code ready to install
Works with 2 AI agents
Lifetime updates included

SecureBe the first

Description

--- name: web-scraping description: This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI. license: MIT --- # Web Scraping with Intelligent Strategy Selection ## When This Skill Activates Activate automatically when user requests: - "Scrape [website]" - "Extract data from [site]" - "Get product information from [URL]" - "Find all links/pages on [site]" - "I'm getting blocked" or "Getting 403 errors" (loads `strategies/anti-blocking.md`) - "Make this an Apify Actor" (loads `apify/` subdirectory) - "Productionize this scraper" ## Input Parsing Determine reconnaissance depth from user request: | User Says | Mode | Phases Run | |-----------|------|------------| | "quick recon", "just check", "what framework" | Quick | Phase 0 only | | "scrape X", "extract data from X" (default) | Standard | Phases 0-3 + 5, Phase 4 only if protection signals detected | | "full recon", "deep scan", "production scraping" | Full | All phases (0-5) including protection testing | Default is Standard mode. Escalate to Full if protection signals appear during any phase. ## Adaptive Reconnaissance Workflow This skill uses an adaptive phased workflow with quality gates. Each gate asks **"Do I have enough?"** — continue only when the answer is no. **See**: `strategies/framework-signatures.md` for framework detection tables referenced throughout. ### Phase 0: QUICK ASSESSMENT (curl, no browser) Gather maximum intelligence with minimum cost — a single HTTP request. **Step 0a: Fetch raw HTML and headers** ```bash curl -s -D- -L "https://target.com/page" -o response.html ``` **Step 0b: Check response headers** - Match headers against `strategies/framework-signatures.md` → Response Header Signatures table - Note `Server`, `X-Powered-By`, `X-Shopify-Stage`, `Set-Cookie` (protection markers) - Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects) **Step 0c: Check Known Major Sites table** - Match domain against `strategies/framework-signatures.md` → Known Major Sites - If matched: use the specified data strategy, skip generic pattern scanning **Step 0d: Detect framework from HTML** - Search raw HTML for signatures in `strategies/framework-signatures.md` → HTML Signatures table - Look for `__NEXT_DATA__`, `__NUXT__`, `ld+json`, `/wp-content/`, `data-reactroot` **Step 0e: Search for target data points** - For each data point the user wants: search raw HTML for that content - Track which data points are found vs missing - Check for sitemaps: `curl -s https://[site]/robots.txt | grep -i Sitemap` **Step 0f: Note protection signals** - 403/503 status, Cloudflare challenge HTML, CAPTCHA elements, `cf-ray` header - Record for Phase 4 decision **See**: `strategies/cheerio-vs-browser-test.md` for the Cheerio viability assessment > **QUALITY GATE A**: All target data points found in raw HTML + no protection signals? > → YES: Skip to Phase 3 (Validate Findings). No browser needed. > → NO: Continue to Phase 1. ### Phase 1: BROWSER RECONNAISSANCE (only if Phase 0 needs it) Launch browser only for data points missing from raw HTML or when JavaScript rendering is required. **Step 1a: Initialize browser session** - `proxy_start()` → Start traffic interception proxy - `interceptor_chrome_launch(url, stealthMode: true)` → Launch Chrome with anti-detection - `interceptor_chrome_devtools_attach(target_id)` → Attach DevTools bridge - `interceptor_chrome_devtools_screenshot()` → Capture visual state **Step 1b: Capture traffic and rendered DOM** - `proxy_list_traffic()` → Review all traffic from page load - `proxy_search_traffic(query: "application/json")` → Find JSON responses - `interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])` → XHR/fetch calls - `interceptor_chrome_devtools_snapshot()` → Accessibility tree (rendered DOM) **Step 1c: Search rendered DOM for missing data points** - For each data point NOT found in Phase 0: search rendered DOM - Use framework-specific search strategy from `strategies/framework-signatures.md` → Framework → Search Strategy table - Only search patterns relevant to the detected framework **Step 1d: Inspect discovered endpoints** - `proxy_get_exchange(exchange_id)` → Full request/response for promising endpoints - Document: method, headers, auth, response structure, pagination > **QUALITY GATE B**: All target data points now covered (raw HTML + rendered DOM + traffic)? > → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. > → NO: Continue to Phase 2 for missing data points only. ### Phase 2: DEEP SCAN (only for missing data points) Targeted investigation for data points not yet found. Only search for what's missing. **Step 2a: Test interactions for missing data** - `proxy_clear_traffic()` before each action → Isolate API calls - `humanizer_click(target_id, selector)` → Trigger dynamic content loads - `humanizer_scroll(target_id, direction, amount)` → Trigger lazy loading / infinite scroll - `humanizer_idle(target_id, duration_ms)` → Wait for delayed content - After each action: `proxy_list_traffic()` → Check for new API calls **Step 2b: Sniff APIs (framework-aware)** - Search only patterns relevant to detected framework: - Next.js → `proxy_list_traffic(url_filter: "/_next/data/")` - WordPress → `proxy_list_traffic(url_filter: "/wp-json/")` - GraphQL → `proxy_search_traffic(query: "graphql")` - Generic → `proxy_list_traffic(url_filter: "/api/")` + `proxy_search_traffic(query: "application/json")` - Skip patterns that don't apply to the detected framework **Step 2c: Test pagination and filtering** - Only if pagination data is a missing data point or needed for coverage assessment - `proxy_clear_traffic()` → click next page → `proxy_list_traffic(url_filter: "page=")` - Document pagination type (URL-based, API offset, cursor, infinite scroll) > **QUALITY GATE C**: Enough data points covered for a useful report? > → YES: Go to Phase 3. > → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique). ### Phase 3: VALIDATE FINDINGS Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested. **See**: `strategies/cheerio-vs-browser-test.md` for validation methodology **Step 3a: Validate CSS selectors** - For each Cheerio/selector-based method: confirm the selector matches actual HTML - Test against raw HTML (curl output) or rendered DOM (snapshot) - Confirm selector extracts the correct value, not a different element **Step 3b: Validate JSON paths** - For each JSON extraction (e.g., `__NEXT_DATA__`, API response): confirm the path resolves - Parse the JSON, follow the path, verify it returns the expected data type and value **Step 3c: Validate API endpoints** - For each discovered API: replay the request (curl or `proxy_get_exchange`) - Confirm: response status 200, expected data structure, correct values - Test pagination if claimed (at least page 1 and page 2) **Step 3d: Downgrade or re-investigate failures** - If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence - If an API returns 403: note protection requirement, flag for Phase 4 - If a JSON path is wrong: re-examine the JSON structure, correct the path ### Phase 4: PROTECTION TESTING (conditional) **See**: `strategies/proxy-escalation.md` for complete skip/run decision logic **Skip Phase 4 when ALL true**: - No protection signals detected in Phases 0-2 - All data points have validated extraction methods - User didn't request "full recon" **Run Phase 4 when ANY true**: - 403/challenge page observed during any phase - Known high-protection domain - High-volume or production intent - User explicitly requested it **If running**: **Step 4a: Test raw HTTP access** ```bash curl -s -o /dev/null -w "%{http_code}" "https://target.com/page" ``` - 200 → Cheerio viable, no browser needed for accessible endpoints - 403/503 → Escalate to stealth browser **Step 4b: Test with stealth browser** (if needed) - Already running from Phase 1 — check if pages loaded without challenges - `interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")` → Protection cookies - `interceptor_chrome_devtools_list_storage_keys(storage_type: "local")` → Fingerprint markers - `proxy_get_tls_fingerprints()` → TLS fingerprint analysis **Step 4c: Test with upstream proxy** (if needed) - `proxy_set_upstream("http://user:pass@proxy-provider:port")` - Re-test blocked endpoints through proxy - Document minimum access level for each data point **Step 4d: Document protection profile** - What protections exist, what worked to bypass them, what production scrapers will need ### Phase 5: REPORT + SELF-CRITIQUE Generate the intelligence report, then critically review it for gaps. **See**: `reference/report-schema.md` for complete report format **Step 5a: Generate report** - Follow `reference/report-schema.md` schema (Sections 1-6) - Include `Validated?` status for every strategy (YES / PARTIAL / NO) - Include all discovered endpoints with full specs **Step 5b: Self-critique** - Write Section 7 (Self-Critique) per `reference/report-schema.md`: - **Gaps**: Data points not found — why, and what would find them - **Skipped steps**: Which phases skipped, with quality gate reasoning - **Unvalidated claims**: Anything marked PARTIAL or NO - **Assumptions**: Things not verified (e.g., "consistent layout across categories") - **Staleness risk**: Geo-dependent prices, A/B layouts, session-specific content - **Recommendations**: Targeted next steps (not "re-run everything") **Step 5c: Fix gaps with targeted re-investigation** - If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run - Example: "Price selector untested" → run one curl + parse, don't re-launch browser - Update report with results **Step 5d: Record session** (if browser was used) - `proxy_session_start(name)` → `proxy_session_stop(session_id)` → `proxy_export_har(session_id, path)` - HAR file captures all traffic for replay. See `strategies/session-workflows.md` --- ### IMPLEMENTATION (after reconnaissance) After reconnaissance report is accepted, implement scraper iteratively. **Core Pattern**: 1. Implement recommended approach (minimal code) 2. Test with small batch (5-10 items) 3. Validate data quality 4. Scale to full dataset or fallback 5. Handle blocking if encountered 6. Add robustness (error handling, retries, logging) **See**: `workflows/implementation.md` for complete implementation patterns and code examples ### PRODUCTIONIZATION (on request) Convert scraper to production-ready Apify Actor. **Activation triggers**: "Make this an Apify Actor", "Productionize this", "Deploy to Apify" **Core Pattern**: 1. Confirm TypeScript preference (STRONGLY RECOMMENDED) 2. Initialize with `apify create` command (CRITICAL) 3. Port scraping logic to Actor format 4. Test locally and deploy **Note**: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure. **See**: `workflows/productionization.md` for complete workflow and `apify/` for Actor development guides ## Quick Reference | Task | Pattern/Command | Documentation | |------|----------------|---------------| | **Reconnaissance** | **Adaptive Phases 0-5** | **`workflows/reconnaissance.md`** | | Framework detection | Header + HTML signature matching | `strategies/framework-signatures.md` | | Cheerio vs Browser | Three-way test + early exit | `strategies/cheerio-vs-browser-test.md` | | Traffic analysis | `proxy_list_traffic()` + `proxy_get_exchange()` | `strategies/traffic-interception.md` | | Protection testing | Conditional escalation | `strategies/proxy-escalation.md` | | Report format | Sections 1-7 with self-critique | `reference/report-schema.md` | | Find sitemaps | `RobotsFile.find(url)` | `strategies/sitemap-discovery.md` | | Filter sitemap URLs | `RequestList + regex` | `reference/regex-patterns.md` | | Discover APIs | Traffic capture (automatic) | `strategies/api-discovery.md` | | DOM scraping | DevTools bridge + humanizer | `strategies/dom-scraping.md` | | HTTP scraping | `CheerioCrawler` | `strategies/cheerio-scraping.md` | | Hybrid approach | Sitemap + API | `strategies/hybrid-approaches.md` | | Handle blocking | Stealth mode + upstream proxies | `strategies/anti-blocking.md` | | Session recording | `proxy_session_start()` / `proxy_export_har()` | `strategies/session-workflows.md` | | Proxy-MCP tools | Complete reference | `reference/proxy-tool-reference.md` | | Fingerprint configs | Stealth + TLS presets | `reference/fingerprint-patterns.md` | | Create Apify Actor | `apify create` | `apify/cli-workflow.md` | | Template selection | Cheerio vs Playwright | `workflows/productionization.md` | | Input schema | `.actor/input_schema.json` | `apify/input-schemas.md` | | Deploy actor | `apify push` | `apify/deployment.md` | ## Common Patterns ### Pattern 1: Sitemap-Based Scraping ```javascript import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee'; // Auto-discover and parse sitemaps const robots = await RobotsFile.find('https://example.com'); const urls = await robots.parseUrlsFromSitemaps(); const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const data = { title: $('h1').text().trim(), // ... extract data }; await Dataset.pushData(data); }, }); await crawler.addRequests(urls); await crawler.run(); ``` See `examples/sitemap-basic.js` for complete example. ### Pattern 2: API-Based Scraping ```javascript import { gotScraping } from 'got-scraping'; const productIds = [123, 456, 789]; for (const id of productIds) { const response = await gotScraping({ url: `https://api.example.com/products/${id}`, responseType: 'json', }); console.log(response.body); } ``` See `examples/api-scraper.js` for complete example. ### Pattern 3: Hybrid (Sitemap + API) ```javascript // Get URLs from sitemap const robots = await RobotsFile.find('https://shop.com'); const urls = await robots.parseUrlsFromSitemaps(); // Extract IDs from URLs const productIds = urls .map(url => url.match(/\/products\/(\d+)/)?.[1]) .filter(Boolean); // Fetch data via API for (const id of productIds) { const data = await gotScraping({ url: `https://api.shop.com/v1/products/${id}`, responseType: 'json', }); // Process data } ``` See `examples/hybrid-sitemap-api.js` for complete example. ## Directory Navigation This skill uses **progressive disclosure** - detailed information is organized in subdirectories and loaded only when needed. ### Workflows (Implementation Patterns) **For**: Step-by-step workflow guides for each phase - `workflows/reconnaissance.md` - **Phase 1 interactive reconnaissance (CRITICAL)** - `workflows/implementation.md` - Phase 4 iterative implementation patterns - `workflows/productionization.md` - Phase 5 Apify Actor creation workflow ### Strategies (Deep Dives) **For**: Detailed guides on specific scraping approaches - `strategies/framework-signatures.md` - **Framework detection lookup tables (Phase 0/1)** - `strategies/cheerio-vs-browser-test.md` - **Cheerio vs Browser decision test with early exit** - `strategies/proxy-escalation.md` - **Protection testing skip/run conditions (Phase 4)** - `strategies/traffic-interception.md` - Traffic interception via MITM proxy - `strategies/sitemap-discovery.md` - Complete sitemap guide (4 patterns) - `strategies/api-discovery.md` - Finding and using APIs - `strategies/dom-scraping.md` - DOM scraping via DevTools bridge - `strategies/cheerio-scraping.md` - HTTP-only scraping - `strategies/hybrid-approaches.md` - Combining strategies - `strategies/anti-blocking.md` - Multi-layer anti-detection (stealth, humanizer, proxies, TLS) - `strategies/session-workflows.md` - Session recording, HAR export, replay ### Examples (Runnable Code) **For**: Working code to reference or execute **JavaScript Learning Examples** (Simple standalone scripts): - `examples/sitemap-basic.js` - Simple sitemap scraper - `examples/api-scraper.js` - Pure API approach - `examples/traffic-interception-basic.js` - Proxy-based reconnaissance - `examples/hybrid-sitemap-api.js` - Combined approach - `examples/iterative-fallback.js` - Try traffic interception→sitemap→API→DOM scraping **TypeScript Production Examples** (Complete Actors): - `apify/examples/basic-scraper/` - Sitemap + Playwright - `apify/examples/anti-blocking/` - Fingerprinting + proxies - `apify/examples/hybrid-api/` - Sitemap + API (optimal) ### Reference (Quick Lookup) **For**: Quick patterns and troubleshooting - `reference/report-schema.md` - **Intelligence report format (Sections 1-7 + self-critique)** - `reference/proxy-tool-reference.md` - Proxy-MCP tool reference (all 80+ tools) - `reference/regex-patterns.md` - Common URL regex patterns - `reference/fingerprint-patterns.md` - Stealth mode + TLS fingerprint presets - `reference/anti-patterns.md` - What NOT to do ### Apify (Production Deployment) **For**: Creating production Apify Actors - `apify/README.md` - When and how to use Apify - `apify/typescript-first.md` - **Why TypeScript for actors** - `apify/cli-workflow.md` - **apify create workflow (CRITICAL)** - `apify/initialization.md` - Complete setup guide - `apify/input-schemas.md` - Input validation patterns - `apify/configuration.md` - actor.json setup - `apify/deployment.md` - Testing and deployment - `apify/templates/` - TypeScript boilerplate **Note**: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed. ## Core Principles ### 1. Assess Before Committing Resources Start cheap (curl), escalate only when needed: - Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan) - Quality gates skip phases when data is sufficient - Never launch a browser if curl gives you everything ### 2. Detect First, Then Search Relevant Patterns Use framework detection to focus searches: - Match against `strategies/framework-signatures.md` before scanning - Skip patterns that don't apply (no `__NEXT_DATA__` on Amazon) - Known major sites get direct strategy lookup ### 3. Validate, Don't Assume Every claimed extraction method must be tested: - "Found text in HTML" is not enough — need a working selector/path - Phase 3 validates every finding before the report - Unvalidated claims are marked PARTIAL or NO in the report ### 4. Iterative Implementation Build incrementally: - Small test batch first (5-10 items) - Validate quality - Scale or fallback - Add robustness last ### 5. Production-Ready Code When productionizing: - Use TypeScript (strongly recommended) - Use `apify create` (never manual setup) - Add proper error handling - Include logging and monitoring --- **Remember**: Traffic interception first, sitemaps second, APIs third, DOM scraping last! For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.

Preview in:

Security Status

Unvetted

Not yet security scanned

Related AI Tools

More Grow Business tools you might like

Clawra Selfie

Free

Edit Clawra's reference image with Grok Imagine (xAI Aurora) and send selfies to messaging channels via OpenClaw

Agent Skills for Context Engineering

Free

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems. Use when building, optimizing, or debugging agent systems that require effective context management.

Terraform Skill for Claude

Free

Use when working with Terraform or OpenTofu - creating modules, writing tests (native test framework, Terratest), setting up CI/CD pipelines, reviewing configurations, choosing between testing approaches, debugging state issues, implementing security

NotebookLM Research Assistant Skill

Free

Use this skill to query your Google NotebookLM notebooks directly from Claude Code for source-grounded, citation-backed answers from Gemini. Browser automation, library management, persistent auth. Drastically reduced hallucinations through document-

Engineering Advanced Skills (POWERFUL Tier)

Free

"25 advanced engineering agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Agent design, RAG, MCP servers, CI/CD, database design, observability, security auditing, release management, platform ops."

Clawra Selfie

Free

Edit Clawra's reference image with Grok Imagine (xAI Aurora) and send selfies to messaging channels via OpenClaw