Scraping API — Endpoint Reference

Base URL: https://api.rl-analytix.de API Version: v1 Content-Type: application/json

Authentication

Every request must include your API key in the X-API-Key header.

X-API-Key: your_api_key_here

If the key is missing or invalid the API returns 401 Unauthorized. Quota-related failures return 402 Payment Required (insufficient tokens) or 429 Too Many Requests (rate limit exceeded).

Common Data Types

All three scraping endpoints return individual page results as a ScrapeResponse object. Understanding this shared structure first makes the per-endpoint response schemas easier to read.

`ScrapeResponse`

Field	Type	Description
`url`	string	The URL that was scraped
`status_code`	integer	HTTP status code returned by the target server
`success`	boolean	`true` if the page was fetched and parsed without error
`data`	`ExtractedData` \| null	Extracted content (present when `success` is `true`)
`error`	string \| null	Human-readable error message (present when `success` is `false`)

`ExtractedData`

Field	Type	Max items	Description
`title`	string \| null	—	Content of the `<title>` tag
`description`	string \| null	—	Content of the `<meta name="description">` tag
`text_content`	string \| null	5 000 chars	Visible body text with script/style/nav/header/footer removed
`links`	array[string]	100	Absolute URLs from all `<a href>` attributes (excludes `#`, `javascript:`, `mailto:`)
`images`	array[string]	50	Absolute URLs from all `<img src>` attributes
`emails`	array[string]	50	Email addresses found on the page — including obfuscated variants such as `name(at)domain.com`, `name[at]domain.com`, `name{at}domain.com`, and `name-at-domain.com`
`selected_elements`	array[string]	50	Text content of elements matched by `css_selector` (empty when no selector was given)

POST /v1/scrape/ — Scrape a Single URL

Fetches a single web page and returns its content and extracted data.

Endpoint: POST /v1/scrape/

Request Parameters

Parameter	Type	Required	Default	Description
`url`	string	yes	—	The page URL to scrape. The protocol (`https://`) is added automatically if omitted.
`extract_links`	boolean	no	`false`	When `true`, the response includes all hyperlinks found on the page.
`extract_images`	boolean	no	`false`	When `true`, the response includes all image URLs found on the page.
`extract_emails`	boolean	no	`false`	When `true`, the response scans the page for email addresses, including common obfuscation patterns.
`css_selector`	string	no	`null`	A CSS selector (e.g. `"h2.article-title"`) whose matching elements' text content is returned in `selected_elements`.

Response

Returns a single ScrapeResponse object.

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/",
    headers={"X-API-Key": API_KEY},
    json={"url": "https://example.com"},
)
result = response.json()
print(result["data"]["title"])
print(result["data"]["text_content"])

# --- Full options ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/",
    headers={"X-API-Key": API_KEY},
    json={
        "url":             "https://example.com/contact",
        "extract_links":   True,
        "extract_images":  True,
        "extract_emails":  True,
        "css_selector":    "h2, p.intro",
    },
)
result = response.json()

if result["success"]:
    data = result["data"]
    print("Title      :", data["title"])
    print("Description:", data["description"])
    print("Links      :", data["links"])
    print("Images     :", data["images"])
    print("Emails     :", data["emails"])
    print("Selected   :", data["selected_elements"])
else:
    print("Error:", result["error"])

POST /v1/scrape/batch — Scrape Multiple URLs

Scrapes up to 50 URLs in parallel in a single request. Ideal for scraping a known list of pages without writing your own concurrency logic.

Endpoint: POST /v1/scrape/batch

Request Parameters

Parameter	Type	Required	Default	Limits	Description
`urls`	array[string]	yes	—	1–50 items	List of page URLs to scrape. The protocol is added automatically if omitted.
`extract_links`	boolean	no	`false`	—	When `true`, extract all hyperlinks from every page.
`extract_images`	boolean	no	`false`	—	When `true`, extract all image URLs from every page.
`extract_emails`	boolean	no	`false`	—	When `true`, detect email addresses on every page.

Note: A CSS selector cannot be specified per-URL in a batch request. Use the single-URL endpoint for selector-based extraction.

Response

Field	Type	Description
`total`	integer	Total number of URLs submitted
`successful`	integer	Number of URLs scraped without error
`failed`	integer	Number of URLs that could not be scraped
`results`	array[`ScrapeResponse`]	One entry per URL in the same order as the request

Each item in results is a ScrapeResponse. Always check the success field per result — a failed URL does not cause the whole batch to fail.

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

urls = [
    "https://example.com",
    "https://example.com/about",
    "https://example.com/contact",
    "https://example.com/blog/post-1",
    "https://example.com/blog/post-2",
]

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/batch",
    headers={"X-API-Key": API_KEY},
    json={"urls": urls},
)
data = response.json()
print(f"Scraped {data['successful']}/{data['total']} pages successfully")

# --- Full options + per-result handling ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/batch",
    headers={"X-API-Key": API_KEY},
    json={
        "urls":           urls,
        "extract_links":  True,
        "extract_images": False,
        "extract_emails": True,
    },
)
batch = response.json()

for result in batch["results"]:
    if result["success"]:
        print(f"[OK]  {result['url']}")
        print(f"      Title : {result['data']['title']}")
        print(f"      Emails: {result['data']['emails']}")
    else:
        print(f"[ERR] {result['url']} — {result['error']}")

POST /v1/scrape/crawl — Crawl a Website

Starting from a seed URL, the crawler follows links and scrapes every discovered page using a breadth-first search (BFS) strategy. You control how deep it goes, how many pages it visits, and which links it should follow or ignore.

Endpoint: POST /v1/scrape/crawl

Request Parameters

Core extraction parameters

Parameter	Type	Required	Default	Description
`url`	string	yes	—	The starting URL (seed) for the crawl.
`extract_images`	boolean	no	`false`	Extract image URLs from every crawled page.
`extract_emails`	boolean	no	`false`	Detect email addresses on every crawled page.
`css_selector`	string	no	`null`	CSS selector applied to every crawled page. Matched element text is returned in `selected_elements`.

Note: extract_links is always true for a crawl — the crawler relies on extracted links to discover new pages.

Crawl scope parameters

Parameter	Type	Required	Default	Range	Description
`max_depth`	integer	no	`2`	1–10	Maximum recursion depth from the seed URL. Depth 0 = seed page only; depth 1 = seed + directly linked pages; and so on.
`max_pages`	integer	no	`50`	1–500	Hard cap on the total number of pages to visit. The crawl stops as soon as this limit is reached.
`same_domain_only`	boolean	no	`true`	—	When `true` (recommended), only links pointing to the same domain are followed. The `www.` prefix is ignored for this comparison, so `www.example.com` and `example.com` are treated as the same domain.
`ignore_query_params`	boolean	no	`false`	—	When `true`, URLs that differ only in query string (e.g. `/page?lang=en` vs `/page?lang=de`) are treated as the same page and only visited once.

URL filtering parameters

Parameter	Type	Required	Default	Description
`include_patterns`	array[string]	no	`[]`	If one or more patterns are provided, only URLs whose path matches at least one pattern are crawled. Patterns use Unix shell-style wildcards (`` matches any sequence of characters). Example: `["/blog/", "/news/*"]`
`exclude_patterns`	array[string]	no	`[]`	URLs whose path matches any of these patterns are skipped. Example: `["/admin/", "/login", "*/logout"]`
`exclude_file_extensions`	boolean	no	`true`	When `true`, links to non-HTML resources are automatically skipped. This includes: PDFs, ZIP/RAR archives, images (`.jpg`, `.png`, `.gif`, `.webp`, `.svg`, …), videos, audio files, Office documents (`.docx`, `.xlsx`, `.pptx`), executables, and static assets (`.css`, `.js`, `.json`, `.xml`).

Response

Field	Type	Description
`url`	string	The seed URL the crawl started from
`total_crawled`	integer	Total number of pages visited
`successful`	integer	Pages scraped without error
`failed`	integer	Pages that could not be scraped
`max_depth_reached`	integer	The deepest level actually visited during this crawl
`stopped_reason`	string	Why the crawl ended: `"completed"` (no more links), `"max_pages"` (limit hit), or `"max_depth"` (depth limit hit)
`results`	array[`ScrapeResponse`]	Full scraping result for every visited page
`sitemap`	array[`SitemapEntry`]	Structural map of every discovered URL

`SitemapEntry`

Field	Type	Description
`url`	string	The page URL
`depth`	integer	Depth at which this page was found (seed = 0)
`parent_url`	string \| null	The page from which this URL was discovered
`status_code`	integer	HTTP status code returned by the target server
`success`	boolean	Whether the page was scraped successfully

Python Example

import requests

BASE_URL = "https://api.rl-analytix.de"
API_KEY  = "your_api_key_here"

# --- Minimal call ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/crawl",
    headers={"X-API-Key": API_KEY},
    json={"url": "https://example.com"},
)
crawl = response.json()
print(f"Crawled {crawl['total_crawled']} pages, stopped because: {crawl['stopped_reason']}")

# --- Full options ---
response = requests.post(
    f"{BASE_URL}/v1/scrape/crawl",
    headers={"X-API-Key": API_KEY},
    json={
        "url":                      "https://example.com",
        "extract_images":           False,
        "extract_emails":           True,
        "css_selector":             "h1, h2",
        "max_depth":                3,
        "max_pages":                100,
        "same_domain_only":         True,
        "ignore_query_params":      True,
        "include_patterns":         ["/blog/*", "/news/*"],
        "exclude_patterns":         ["/admin/*", "*/login", "*/tag/*"],
        "exclude_file_extensions":  True,
    },
)
crawl = response.json()

# Summary
print(f"Pages visited : {crawl['total_crawled']}")
print(f"Successful    : {crawl['successful']}")
print(f"Failed        : {crawl['failed']}")
print(f"Max depth hit : {crawl['max_depth_reached']}")
print(f"Stop reason   : {crawl['stopped_reason']}")
print()

# All discovered emails across the entire site
all_emails = set()
for page in crawl["results"]:
    if page["success"] and page["data"]["emails"]:
        all_emails.update(page["data"]["emails"])
print("Emails found:", sorted(all_emails))

# Sitemap — print tree structure
print("\nSitemap:")
for entry in crawl["sitemap"]:
    indent = "  " * entry["depth"]
    status = "OK " if entry["success"] else "ERR"
    print(f"{indent}[{status}] {entry['url']}  (depth={entry['depth']})")

Error Reference

HTTP Status	Meaning	Common Cause
`400 Bad Request`	Invalid input	Malformed URL, parameter out of allowed range, private/localhost IP blocked by SSRF protection
`401 Unauthorized`	Authentication failed	Missing or invalid `X-API-Key`
`402 Payment Required`	Quota exhausted	API key has no remaining token balance
`422 Unprocessable Entity`	Validation error	Request body does not match the expected schema
`429 Too Many Requests`	Rate limit exceeded	Too many requests in a short period
`502 Bad Gateway`	Scraping failed	Target server unreachable, returned an error, or timed out (30 s limit per page)
`503 Service Unavailable`	Auth service down	The authentication service is temporarily unavailable

Error responses include a JSON body with a detail field describing the problem:

{
  "detail": "URL targets a private IP address range and cannot be scraped."
}

Limits & Behaviour Notes

Item	Limit / Behaviour
Batch size	Maximum 50 URLs per batch request
Crawl depth	1–10 levels (default: 2)
Crawl pages	1–500 pages per crawl (default: 50)
Concurrency	Up to 10 pages fetched in parallel (applies to both batch and crawl)
Request timeout	30 seconds per individual page request
Crawl delay	100 ms between requests to avoid overwhelming target servers
Text content	Truncated at 5 000 characters
Links per page	Capped at 100
Images per page	Capped at 50
Emails per page	Capped at 50, deduplicated and lowercased
CSS selector results	Capped at 50 elements
SSRF protection	URLs pointing to `localhost`, loopback addresses (`127.`), or RFC 1918 private networks (`10.`, `172.16–31.`, `192.168.`) are rejected with HTTP 400
URL normalisation	Fragments (`#section`) are stripped before deduplication; query parameters are optionally stripped during crawls (`ignore_query_params`)
Protocol	`https://` is automatically prepended to URLs that have no scheme