Skip to content

Scraping API — Endpoint Reference

Base URL: https://api.rl-analytix.de API Version: v1 Content-Type: application/json


Authentication

Every request must include your API key in the X-API-Key header.

X-API-Key: your_api_key_here

If the key is missing or invalid the API returns 401 Unauthorized. Quota-related failures return 402 Payment Required (insufficient tokens) or 429 Too Many Requests (rate limit exceeded).


Common Data Types

All three scraping endpoints return individual page results as a ScrapeResponse object. Understanding this shared structure first makes the per-endpoint response schemas easier to read.

ScrapeResponse

Field Type Description
url string The URL that was scraped
status_code integer HTTP status code returned by the target server
success boolean true if the page was fetched and parsed without error
data ExtractedData | null Extracted content (present when success is true)
error string | null Human-readable error message (present when success is false)

ExtractedData

Field Type Max items Description
title string | null Content of the <title> tag
description string | null Content of the <meta name="description"> tag
text_content string | null 5 000 chars Visible body text with script/style/nav/header/footer removed
links array[string] 100 Absolute URLs from all <a href> attributes (excludes #, javascript:, mailto:)
images array[string] 50 Absolute URLs from all <img src> attributes
emails array[string] 50 Email addresses found on the page — including obfuscated variants such as name(at)domain.com, name[at]domain.com, name{at}domain.com, and name-at-domain.com
selected_elements array[string] 50 Text content of elements matched by css_selector (empty when no selector was given)

POST /v1/scrape/ — Scrape a Single URL

Fetches a single web page and returns its content and extracted data.

Endpoint: POST /v1/scrape/

Request Parameters

Parameter Type Required Default Description
url string yes The page URL to scrape. The protocol (https://) is added automatically if omitted.
extract_links boolean no false When true, the response includes all hyperlinks found on the page.
extract_images boolean no false When true, the response includes all image URLs found on the page.
extract_emails boolean no false When true, the response scans the page for email addresses, including common obfuscation patterns.
css_selector string no null A CSS selector (e.g. "h2.article-title") whose matching elements' text content is returned in selected_elements.

Response

Returns a single ScrapeResponse object.

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/",
    headers={"X-API-Key": API_KEY},
    json={"url": "https://example.com"},
)
result = response.json()
print(result["data"]["title"])
print(result["data"]["text_content"])

# --- Full options ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/",
    headers={"X-API-Key": API_KEY},
    json={
        "url":             "https://example.com/contact",
        "extract_links":   True,
        "extract_images":  True,
        "extract_emails":  True,
        "css_selector":    "h2, p.intro",
    },
)
result = response.json()

if result["success"]:
    data = result["data"]
    print("Title      :", data["title"])
    print("Description:", data["description"])
    print("Links      :", data["links"])
    print("Images     :", data["images"])
    print("Emails     :", data["emails"])
    print("Selected   :", data["selected_elements"])
else:
    print("Error:", result["error"])

POST /v1/scrape/batch — Scrape Multiple URLs

Scrapes up to 50 URLs in parallel in a single request. Ideal for scraping a known list of pages without writing your own concurrency logic.

Endpoint: POST /v1/scrape/batch

Request Parameters

Parameter Type Required Default Limits Description
urls array[string] yes 1–50 items List of page URLs to scrape. The protocol is added automatically if omitted.
extract_links boolean no false When true, extract all hyperlinks from every page.
extract_images boolean no false When true, extract all image URLs from every page.
extract_emails boolean no false When true, detect email addresses on every page.

Note: A CSS selector cannot be specified per-URL in a batch request. Use the single-URL endpoint for selector-based extraction.

Response

Field Type Description
total integer Total number of URLs submitted
successful integer Number of URLs scraped without error
failed integer Number of URLs that could not be scraped
results array[ScrapeResponse] One entry per URL in the same order as the request

Each item in results is a ScrapeResponse. Always check the success field per result — a failed URL does not cause the whole batch to fail.

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

urls = [
    "https://example.com",
    "https://example.com/about",
    "https://example.com/contact",
    "https://example.com/blog/post-1",
    "https://example.com/blog/post-2",
]

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/batch",
    headers={"X-API-Key": API_KEY},
    json={"urls": urls},
)
data = response.json()
print(f"Scraped {data['successful']}/{data['total']} pages successfully")

# --- Full options + per-result handling ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/batch",
    headers={"X-API-Key": API_KEY},
    json={
        "urls":           urls,
        "extract_links":  True,
        "extract_images": False,
        "extract_emails": True,
    },
)
batch = response.json()

for result in batch["results"]:
    if result["success"]:
        print(f"[OK]  {result['url']}")
        print(f"      Title : {result['data']['title']}")
        print(f"      Emails: {result['data']['emails']}")
    else:
        print(f"[ERR] {result['url']} — {result['error']}")

POST /v1/scrape/crawl — Crawl a Website

Starting from a seed URL, the crawler follows links and scrapes every discovered page using a breadth-first search (BFS) strategy. You control how deep it goes, how many pages it visits, and which links it should follow or ignore.

Endpoint: POST /v1/scrape/crawl

Request Parameters

Core extraction parameters

Parameter Type Required Default Description
url string yes The starting URL (seed) for the crawl.
extract_images boolean no false Extract image URLs from every crawled page.
extract_emails boolean no false Detect email addresses on every crawled page.
css_selector string no null CSS selector applied to every crawled page. Matched element text is returned in selected_elements.

Note: extract_links is always true for a crawl — the crawler relies on extracted links to discover new pages.

Crawl scope parameters

Parameter Type Required Default Range Description
max_depth integer no 2 1–10 Maximum recursion depth from the seed URL. Depth 0 = seed page only; depth 1 = seed + directly linked pages; and so on.
max_pages integer no 50 1–500 Hard cap on the total number of pages to visit. The crawl stops as soon as this limit is reached.
same_domain_only boolean no true When true (recommended), only links pointing to the same domain are followed. The www. prefix is ignored for this comparison, so www.example.com and example.com are treated as the same domain.
ignore_query_params boolean no false When true, URLs that differ only in query string (e.g. /page?lang=en vs /page?lang=de) are treated as the same page and only visited once.

URL filtering parameters

Parameter Type Required Default Description
include_patterns array[string] no [] If one or more patterns are provided, only URLs whose path matches at least one pattern are crawled. Patterns use Unix shell-style wildcards (* matches any sequence of characters). Example: ["/blog/*", "/news/*"]
exclude_patterns array[string] no [] URLs whose path matches any of these patterns are skipped. Example: ["/admin/*", "*/login", "*/logout"]
exclude_file_extensions boolean no true When true, links to non-HTML resources are automatically skipped. This includes: PDFs, ZIP/RAR archives, images (.jpg, .png, .gif, .webp, .svg, …), videos, audio files, Office documents (.docx, .xlsx, .pptx), executables, and static assets (.css, .js, .json, .xml).

Response

Field Type Description
url string The seed URL the crawl started from
total_crawled integer Total number of pages visited
successful integer Pages scraped without error
failed integer Pages that could not be scraped
max_depth_reached integer The deepest level actually visited during this crawl
stopped_reason string Why the crawl ended: "completed" (no more links), "max_pages" (limit hit), or "max_depth" (depth limit hit)
results array[ScrapeResponse] Full scraping result for every visited page
sitemap array[SitemapEntry] Structural map of every discovered URL

SitemapEntry

Field Type Description
url string The page URL
depth integer Depth at which this page was found (seed = 0)
parent_url string | null The page from which this URL was discovered
status_code integer HTTP status code returned by the target server
success boolean Whether the page was scraped successfully

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/crawl",
    headers={"X-API-Key": API_KEY},
    json={"url": "https://example.com"},
)
crawl = response.json()
print(f"Crawled {crawl['total_crawled']} pages, stopped because: {crawl['stopped_reason']}")

# --- Full options ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/crawl",
    headers={"X-API-Key": API_KEY},
    json={
        "url":                      "https://example.com",
        "extract_images":           False,
        "extract_emails":           True,
        "css_selector":             "h1, h2",
        "max_depth":                3,
        "max_pages":                100,
        "same_domain_only":         True,
        "ignore_query_params":      True,
        "include_patterns":         ["/blog/*", "/news/*"],
        "exclude_patterns":         ["/admin/*", "*/login", "*/tag/*"],
        "exclude_file_extensions":  True,
    },
)
crawl = response.json()

# Summary
print(f"Pages visited : {crawl['total_crawled']}")
print(f"Successful    : {crawl['successful']}")
print(f"Failed        : {crawl['failed']}")
print(f"Max depth hit : {crawl['max_depth_reached']}")
print(f"Stop reason   : {crawl['stopped_reason']}")
print()

# All discovered emails across the entire site
all_emails = set()
for page in crawl["results"]:
    if page["success"] and page["data"]["emails"]:
        all_emails.update(page["data"]["emails"])
print("Emails found:", sorted(all_emails))

# Sitemap — print tree structure
print("\nSitemap:")
for entry in crawl["sitemap"]:
    indent = "  " * entry["depth"]
    status = "OK " if entry["success"] else "ERR"
    print(f"{indent}[{status}] {entry['url']}  (depth={entry['depth']})")

Error Reference

HTTP Status Meaning Common Cause
400 Bad Request Invalid input Malformed URL, parameter out of allowed range, private/localhost IP blocked by SSRF protection
401 Unauthorized Authentication failed Missing or invalid X-API-Key
402 Payment Required Quota exhausted API key has no remaining token balance
422 Unprocessable Entity Validation error Request body does not match the expected schema
429 Too Many Requests Rate limit exceeded Too many requests in a short period
502 Bad Gateway Scraping failed Target server unreachable, returned an error, or timed out (30 s limit per page)
503 Service Unavailable Auth service down The authentication service is temporarily unavailable

Error responses include a JSON body with a detail field describing the problem:

{
  "detail": "URL targets a private IP address range and cannot be scraped."
}

Limits & Behaviour Notes

Item Limit / Behaviour
Batch size Maximum 50 URLs per batch request
Crawl depth 1–10 levels (default: 2)
Crawl pages 1–500 pages per crawl (default: 50)
Concurrency Up to 10 pages fetched in parallel (applies to both batch and crawl)
Request timeout 30 seconds per individual page request
Crawl delay 100 ms between requests to avoid overwhelming target servers
Text content Truncated at 5 000 characters
Links per page Capped at 100
Images per page Capped at 50
Emails per page Capped at 50, deduplicated and lowercased
CSS selector results Capped at 50 elements
SSRF protection URLs pointing to localhost, loopback addresses (127.*), or RFC 1918 private networks (10.*, 172.16–31.*, 192.168.*) are rejected with HTTP 400
URL normalisation Fragments (#section) are stripped before deduplication; query parameters are optionally stripped during crawls (ignore_query_params)
Protocol https:// is automatically prepended to URLs that have no scheme