Infrastructure

Web Scraping

The automated process of extracting data from websites using software tools or scripts, commonly used for data collection, price monitoring, and content aggregation.

What is Web Scraping?

Web scraping is the automated extraction of data from websites through software programs that fetch web pages, parse their content, and extract specific information. Unlike manual copy-pasting, web scraping enables large-scale data collection processing thousands or millions of pages efficiently. Organizations use web scraping for competitive intelligence, price monitoring, market research, lead generation, content aggregation, sentiment analysis, and countless other data-driven applications.

Web scraping exists on a spectrum of complexity. Simple scraping downloads HTML pages and extracts data using pattern matching or DOM queries. More sophisticated approaches handle JavaScript-rendered content, bypass bot detection systems, solve CAPTCHAs, manage sessions, rotate proxies, and mimic human browsing patterns. The appropriate technique depends on target website characteristics, anti-scraping measures deployed, required scale, and legal constraints.

Understanding HTML Parsing

HTML parsing forms the foundation of web scraping. Websites structure content using HTML elements with classes, IDs, and hierarchical relationships. Scrapers navigate this structure locating elements containing desired data—product prices in span tags with class “price”, article titles in h1 elements, images in img src attributes. Modern parsing libraries provide CSS selectors and XPath expressions for precise element targeting without complex string manipulation.

The Document Object Model (DOM) represents HTML as a tree structure that parsers can traverse and query. CSS selectors like “.product .price” find elements matching patterns, while XPath expressions like “//div[@class=‘product’]//span[@class=‘price’]” provide even more powerful selection capabilities. Scrapers iterate over selected elements extracting text content, attribute values, or structural relationships to build structured datasets from unstructured web pages.

// Basic HTML parsing and data extraction
interface Product {
  name: string;
  price: string;
  url: string;
  image?: string;
}

async function scrapeProductPage(url: string): Promise<Product[]> {
  const proxyUrl = `https://corsproxy.io/?url=${encodeURIComponent(url)}&key=your-api-key`;

  const response = await fetch(proxyUrl, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.9'
    }
  });

  const html = await response.text();
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, 'text/html');

  const products: Product[] = [];
  const productElements = doc.querySelectorAll('.product-card');

  productElements.forEach(element => {
    const nameElement = element.querySelector('.product-name');
    const priceElement = element.querySelector('.product-price');
    const linkElement = element.querySelector('a.product-link');
    const imageElement = element.querySelector('img.product-image');

    if (nameElement && priceElement && linkElement) {
      products.push({
        name: nameElement.textContent?.trim() || '',
        price: priceElement.textContent?.trim() || '',
        url: linkElement.getAttribute('href') || '',
        image: imageElement?.getAttribute('src')
      });
    }
  });

  return products;
}

Handling Dynamic Content

Many modern websites render content dynamically using JavaScript frameworks like React, Vue, or Angular. Initial HTML contains minimal content with JavaScript code that fetches data from APIs and builds the DOM after page load. Traditional scraping approaches fetching raw HTML receive empty pages or loading spinners because JavaScript hasn’t executed yet.

Headless browsers solve the JavaScript rendering challenge by running a full browser engine programmatically. These browsers execute JavaScript, wait for DOM updates, handle AJAX requests, and provide access to the fully rendered page. Playwright and Puppeteer represent popular headless browser tools offering APIs for navigation, waiting, interaction, and data extraction from JavaScript-heavy sites.

Alternatively, many JavaScript sites fetch data from backend APIs that scrapers can access directly. Browser developer tools reveal network requests showing API endpoints returning JSON data. Scraping these APIs directly bypasses rendering overhead, improves performance, and often provides cleaner data structures than parsing HTML. This approach requires reverse engineering API authentication, rate limiting, and request formats.

// Advanced scraper handling both HTML parsing and API data
class WebScraper {
  constructor(
    private apiKey: string,
    private proxyType: 'residential' | 'datacenter' = 'residential'
  ) {}

  async scrapeHTMLContent(url: string) {
    const proxyUrl = this.buildProxyUrl(url);

    const response = await fetch(proxyUrl, {
      headers: this.getBrowserHeaders()
    });

    const html = await response.text();
    return this.parseHTML(html);
  }

  async scrapeAPIEndpoint(apiUrl: string) {
    const proxyUrl = this.buildProxyUrl(apiUrl);

    const response = await fetch(proxyUrl, {
      headers: {
        'Accept': 'application/json',
        'User-Agent': this.getBrowserHeaders()['User-Agent']
      }
    });

    return response.json();
  }

  async scrapeWithPagination(baseUrl: string, maxPages: number = 10) {
    const results: any[] = [];

    for (let page = 1; page <= maxPages; page++) {
      const url = `${baseUrl}?page=${page}`;
      console.log(`Scraping page ${page}...`);

      const pageData = await this.scrapeHTMLContent(url);
      results.push(...pageData);

      // Respectful delay between requests
      await this.delay(2000);

      // Check if we've reached the last page
      if (pageData.length === 0) {
        break;
      }
    }

    return results;
  }

  private buildProxyUrl(url: string): string {
    return `https://corsproxy.io/?url=${encodeURIComponent(url)}&key=${this.apiKey}&type=${this.proxyType}&colo=fra`;
  }

  private getBrowserHeaders() {
    return {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.9',
      'Accept-Encoding': 'gzip, deflate, br',
      'Connection': 'keep-alive',
      'Upgrade-Insecure-Requests': '1'
    };
  }

  private parseHTML(html: string) {
    const parser = new DOMParser();
    const doc = parser.parseFromString(html, 'text/html');

    // Implement your parsing logic here
    const data = doc.querySelectorAll('.data-item');
    return Array.from(data).map(item => ({
      text: item.textContent?.trim()
    }));
  }

  private delay(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const scraper = new WebScraper('your-api-key', 'residential');

// Scrape HTML content
const products = await scraper.scrapeHTMLContent('https://example.com/products');

// Scrape API endpoint
const apiData = await scraper.scrapeAPIEndpoint('https://api.example.com/v1/products');

// Scrape with pagination
const allProducts = await scraper.scrapeWithPagination('https://example.com/products', 5);

Rate Limiting and Politeness

Responsible web scraping respects server resources and website terms of service. Rate limiting controls request frequency preventing server overload and avoiding IP bans. A simple approach delays a fixed interval between requests—one second per request limits load to 60 requests per minute. More sophisticated rate limiters implement token buckets or sliding windows allowing bursts while maintaining average rates.

Distributed scraping across multiple IPs enables higher aggregate throughput while keeping per-IP rates reasonable. Dedicated proxy pools handle rotation and geographic distribution for large‑scale scraping workloads.

Scrapers should identify themselves through User-Agent headers rather than masquerading as browsers when operating at scale. Many sites appreciate being notified about bot traffic and may provide official APIs or data access agreements for legitimate use cases. Including contact information in User-Agent strings enables site owners to reach out about excessive load or policy violations rather than immediately blocking access.

Using CorsProxy for Web Scraping

CorsProxy is a CORS‑friendly proxy that lets browser code fetch cross‑origin content without running a separate proxy server. It is best suited for browser‑based scraping or API access where CORS blocks requests.

For server‑side scraping or high‑volume rotation, use a dedicated proxy provider. CorsProxy focuses on simplifying CORS‑blocked fetches in front‑end environments.

Business plan note: extract, input, output, and ttl require a Business plan and a valid API key.

CorsProxy’s URL parameter approach simplifies integration compared to traditional proxy protocols requiring low-level HTTP client configuration. Applications make standard HTTPS requests to CorsProxy URLs without proxy protocol support, session management, or authentication handling beyond API key inclusion. This works seamlessly with browsers, serverless functions, and any environment supporting HTTPS fetch.

// Production-ready scraper with CorsProxy integration
class ProductionScraper {
  private requestCount = 0;
  private lastRequestTime = 0;
  private minRequestInterval = 1000; // 1 second between requests

  constructor(
    private apiKey: string,
    private options: {
      maxRetries?: number;
      timeout?: number;
    } = {}
  ) {}

  async scrape(url: string, parseFunction: (html: string) => any) {
    await this.enforceRateLimit();

    let lastError: Error | null = null;
    const maxRetries = this.options.maxRetries || 3;

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        const data = await this.fetchWithTimeout(url);
        const parsed = parseFunction(data);

        this.requestCount++;
        return parsed;
      } catch (error) {
        lastError = error as Error;
        console.log(`Attempt ${attempt} failed:`, error);

        if (attempt < maxRetries) {
          // Exponential backoff
          const delay = Math.pow(2, attempt) * 1000;
          await this.delay(delay);
        }
      }
    }

    throw new Error(`Failed after ${maxRetries} attempts: ${lastError?.message}`);
  }

  async scrapeBatch(urls: string[], parseFunction: (html: string) => any, concurrency: number = 5) {
    const results: any[] = [];
    const batches: string[][] = [];

    // Split URLs into batches
    for (let i = 0; i < urls.length; i += concurrency) {
      batches.push(urls.slice(i, i + concurrency));
    }

    // Process batches sequentially, items in batch parallel
    for (const batch of batches) {
      const batchResults = await Promise.allSettled(
        batch.map(url => this.scrape(url, parseFunction))
      );

      batchResults.forEach((result, index) => {
        if (result.status === 'fulfilled') {
          results.push({
            url: batch[index],
            data: result.value,
            success: true
          });
        } else {
          results.push({
            url: batch[index],
            error: result.reason?.message,
            success: false
          });
        }
      });
    }

    return results;
  }

  private async fetchWithTimeout(url: string): Promise<string> {
    const proxyUrl = `https://corsproxy.io/?url=${encodeURIComponent(url)}&key=${this.apiKey}`;

    const timeout = this.options.timeout || 30000;
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), timeout);

    try {
      const response = await fetch(proxyUrl, {
        signal: controller.signal,
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
        }
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

      return await response.text();
    } finally {
      clearTimeout(timeoutId);
    }
  }

  private async enforceRateLimit() {
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequestTime;

    if (timeSinceLastRequest < this.minRequestInterval) {
      await this.delay(this.minRequestInterval - timeSinceLastRequest);
    }

    this.lastRequestTime = Date.now();
  }

  private delay(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  getStats() {
    return {
      totalRequests: this.requestCount,
      averageRate: this.requestCount / ((Date.now() - this.lastRequestTime) / 1000)
    };
  }
}

Web scraping legality varies by jurisdiction, website terms of service, and data usage. Many websites prohibit automated access in their Terms of Service, though enforceability depends on local laws. The Computer Fraud and Abuse Act (CFAA) in the United States criminalizes unauthorized access, but courts disagree on whether violating ToS constitutes unauthorized access or whether publicly accessible data qualifies as protected.

Always respect robots.txt files that specify which paths bots may access and what crawl delays to observe. While not legally binding, robots.txt represents explicit site owner preferences about automated access. Violating robots.txt demonstrates bad faith that strengthens legal claims in jurisdictions where scraping legality remains ambiguous. Commercial scraping services often respect robots.txt as industry best practice regardless of strict legal requirements.

Personal data scraping faces additional restrictions under regulations like GDPR and CCPA. These laws limit collection, storage, and processing of personally identifiable information without consent or legitimate interest. Scraping LinkedIn profiles, Facebook data, or other platforms containing personal information risks regulatory enforcement even when data appears publicly accessible. Always consult legal counsel before scraping personal data at scale.

Best Practices and Error Handling

Implement comprehensive error handling accounting for network failures, parsing errors, rate limits, and anti-bot blocks. Distinguish transient errors warranting retries from permanent failures requiring human intervention. Log detailed error information including URLs, timestamps, and error messages enabling debugging and pattern recognition. Use exponential backoff for retries preventing thundering herd effects when services recover from outages.

Cache scraped data preventing redundant requests for unchanged content. Store HTTP ETags or Last-Modified headers enabling conditional requests that return 304 Not Modified for unchanged resources. Implement persistent storage for scraped data with deduplication preventing duplicate processing. Database indexing on URLs or content hashes enables efficient duplicate detection across scraping sessions.

Monitor scraping effectiveness tracking success rates, error types, and performance metrics. Alerting on elevated error rates indicates site structure changes requiring parser updates or anti-bot countermeasures requiring approach adjustments. A/B test different scraping strategies measuring success rates, speed, and resource consumption optimizing for specific use cases and target sites.

Learn More

Create a free Account to use Web Scraping in Production

Say goodbye to CORS errors and get back to building great web applications. It's free!

CORSPROXY Dashboard

Related Terms

More in Infrastructure

Related guides

Back to Glossary