Scraping API — Endpoint Reference
Base URL:
https://api.rl-analytix.deAPI Version: v1 Content-Type:application/json
Authentication
Every request must include your API key in the X-API-Key header.
X-API-Key: your_api_key_here
If the key is missing or invalid the API returns 401 Unauthorized. Quota-related failures return 402 Payment Required (insufficient tokens) or 429 Too Many Requests (rate limit exceeded).
Common Data Types
All three scraping endpoints return individual page results as a ScrapeResponse object. Understanding this shared structure first makes the per-endpoint response schemas easier to read.
ScrapeResponse
| Field | Type | Description |
|---|---|---|
url |
string | The URL that was scraped |
status_code |
integer | HTTP status code returned by the target server |
success |
boolean | true if the page was fetched and parsed without error |
data |
ExtractedData | null |
Extracted content (present when success is true) |
error |
string | null | Human-readable error message (present when success is false) |
ExtractedData
| Field | Type | Max items | Description |
|---|---|---|---|
title |
string | null | — | Content of the <title> tag |
description |
string | null | — | Content of the <meta name="description"> tag |
text_content |
string | null | 5 000 chars | Visible body text with script/style/nav/header/footer removed |
links |
array[string] | 100 | Absolute URLs from all <a href> attributes (excludes #, javascript:, mailto:) |
images |
array[string] | 50 | Absolute URLs from all <img src> attributes |
emails |
array[string] | 50 | Email addresses found on the page — including obfuscated variants such as name(at)domain.com, name[at]domain.com, name{at}domain.com, and name-at-domain.com |
selected_elements |
array[string] | 50 | Text content of elements matched by css_selector (empty when no selector was given) |
POST /v1/scrape/ — Scrape a Single URL
Fetches a single web page and returns its content and extracted data.
Endpoint: POST /v1/scrape/
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | The page URL to scrape. The protocol (https://) is added automatically if omitted. |
extract_links |
boolean | no | false |
When true, the response includes all hyperlinks found on the page. |
extract_images |
boolean | no | false |
When true, the response includes all image URLs found on the page. |
extract_emails |
boolean | no | false |
When true, the response scans the page for email addresses, including common obfuscation patterns. |
css_selector |
string | no | null |
A CSS selector (e.g. "h2.article-title") whose matching elements' text content is returned in selected_elements. |
Response
Returns a single ScrapeResponse object.
Python Example
import requests
BASE_URL = "https://api.rl-analytix.de"
API_KEY = "your_api_key_here"
# --- Minimal call ---
response = requests.post(
f"{BASE_URL}/v1/scrape/",
headers={"X-API-Key": API_KEY},
json={"url": "https://example.com"},
)
result = response.json()
print(result["data"]["title"])
print(result["data"]["text_content"])
# --- Full options ---
response = requests.post(
f"{BASE_URL}/v1/scrape/",
headers={"X-API-Key": API_KEY},
json={
"url": "https://example.com/contact",
"extract_links": True,
"extract_images": True,
"extract_emails": True,
"css_selector": "h2, p.intro",
},
)
result = response.json()
if result["success"]:
data = result["data"]
print("Title :", data["title"])
print("Description:", data["description"])
print("Links :", data["links"])
print("Images :", data["images"])
print("Emails :", data["emails"])
print("Selected :", data["selected_elements"])
else:
print("Error:", result["error"])
POST /v1/scrape/batch — Scrape Multiple URLs
Scrapes up to 50 URLs in parallel in a single request. Ideal for scraping a known list of pages without writing your own concurrency logic.
Endpoint: POST /v1/scrape/batch
Request Parameters
| Parameter | Type | Required | Default | Limits | Description |
|---|---|---|---|---|---|
urls |
array[string] | yes | — | 1–50 items | List of page URLs to scrape. The protocol is added automatically if omitted. |
extract_links |
boolean | no | false |
— | When true, extract all hyperlinks from every page. |
extract_images |
boolean | no | false |
— | When true, extract all image URLs from every page. |
extract_emails |
boolean | no | false |
— | When true, detect email addresses on every page. |
Note: A CSS selector cannot be specified per-URL in a batch request. Use the single-URL endpoint for selector-based extraction.
Response
| Field | Type | Description |
|---|---|---|
total |
integer | Total number of URLs submitted |
successful |
integer | Number of URLs scraped without error |
failed |
integer | Number of URLs that could not be scraped |
results |
array[ScrapeResponse] |
One entry per URL in the same order as the request |
Each item in results is a ScrapeResponse. Always check the success field per result — a failed URL does not cause the whole batch to fail.
Python Example
import requests
BASE_URL = "https://api.rl-analytix.de"
API_KEY = "your_api_key_here"
urls = [
"https://example.com",
"https://example.com/about",
"https://example.com/contact",
"https://example.com/blog/post-1",
"https://example.com/blog/post-2",
]
# --- Minimal call ---
response = requests.post(
f"{BASE_URL}/v1/scrape/batch",
headers={"X-API-Key": API_KEY},
json={"urls": urls},
)
data = response.json()
print(f"Scraped {data['successful']}/{data['total']} pages successfully")
# --- Full options + per-result handling ---
response = requests.post(
f"{BASE_URL}/v1/scrape/batch",
headers={"X-API-Key": API_KEY},
json={
"urls": urls,
"extract_links": True,
"extract_images": False,
"extract_emails": True,
},
)
batch = response.json()
for result in batch["results"]:
if result["success"]:
print(f"[OK] {result['url']}")
print(f" Title : {result['data']['title']}")
print(f" Emails: {result['data']['emails']}")
else:
print(f"[ERR] {result['url']} — {result['error']}")
POST /v1/scrape/crawl — Crawl a Website
Starting from a seed URL, the crawler follows links and scrapes every discovered page using a breadth-first search (BFS) strategy. You control how deep it goes, how many pages it visits, and which links it should follow or ignore.
Endpoint: POST /v1/scrape/crawl
Request Parameters
Core extraction parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | yes | — | The starting URL (seed) for the crawl. |
extract_images |
boolean | no | false |
Extract image URLs from every crawled page. |
extract_emails |
boolean | no | false |
Detect email addresses on every crawled page. |
css_selector |
string | no | null |
CSS selector applied to every crawled page. Matched element text is returned in selected_elements. |
Note:
extract_linksis alwaystruefor a crawl — the crawler relies on extracted links to discover new pages.
Crawl scope parameters
| Parameter | Type | Required | Default | Range | Description |
|---|---|---|---|---|---|
max_depth |
integer | no | 2 |
1–10 | Maximum recursion depth from the seed URL. Depth 0 = seed page only; depth 1 = seed + directly linked pages; and so on. |
max_pages |
integer | no | 50 |
1–500 | Hard cap on the total number of pages to visit. The crawl stops as soon as this limit is reached. |
same_domain_only |
boolean | no | true |
— | When true (recommended), only links pointing to the same domain are followed. The www. prefix is ignored for this comparison, so www.example.com and example.com are treated as the same domain. |
ignore_query_params |
boolean | no | false |
— | When true, URLs that differ only in query string (e.g. /page?lang=en vs /page?lang=de) are treated as the same page and only visited once. |
URL filtering parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
include_patterns |
array[string] | no | [] |
If one or more patterns are provided, only URLs whose path matches at least one pattern are crawled. Patterns use Unix shell-style wildcards (* matches any sequence of characters). Example: ["/blog/*", "/news/*"] |
exclude_patterns |
array[string] | no | [] |
URLs whose path matches any of these patterns are skipped. Example: ["/admin/*", "*/login", "*/logout"] |
exclude_file_extensions |
boolean | no | true |
When true, links to non-HTML resources are automatically skipped. This includes: PDFs, ZIP/RAR archives, images (.jpg, .png, .gif, .webp, .svg, …), videos, audio files, Office documents (.docx, .xlsx, .pptx), executables, and static assets (.css, .js, .json, .xml). |
Response
| Field | Type | Description |
|---|---|---|
url |
string | The seed URL the crawl started from |
total_crawled |
integer | Total number of pages visited |
successful |
integer | Pages scraped without error |
failed |
integer | Pages that could not be scraped |
max_depth_reached |
integer | The deepest level actually visited during this crawl |
stopped_reason |
string | Why the crawl ended: "completed" (no more links), "max_pages" (limit hit), or "max_depth" (depth limit hit) |
results |
array[ScrapeResponse] |
Full scraping result for every visited page |
sitemap |
array[SitemapEntry] |
Structural map of every discovered URL |
SitemapEntry
| Field | Type | Description |
|---|---|---|
url |
string | The page URL |
depth |
integer | Depth at which this page was found (seed = 0) |
parent_url |
string | null | The page from which this URL was discovered |
status_code |
integer | HTTP status code returned by the target server |
success |
boolean | Whether the page was scraped successfully |
Python Example
import requests
BASE_URL = "https://api.rl-analytix.de"
API_KEY = "your_api_key_here"
# --- Minimal call ---
response = requests.post(
f"{BASE_URL}/v1/scrape/crawl",
headers={"X-API-Key": API_KEY},
json={"url": "https://example.com"},
)
crawl = response.json()
print(f"Crawled {crawl['total_crawled']} pages, stopped because: {crawl['stopped_reason']}")
# --- Full options ---
response = requests.post(
f"{BASE_URL}/v1/scrape/crawl",
headers={"X-API-Key": API_KEY},
json={
"url": "https://example.com",
"extract_images": False,
"extract_emails": True,
"css_selector": "h1, h2",
"max_depth": 3,
"max_pages": 100,
"same_domain_only": True,
"ignore_query_params": True,
"include_patterns": ["/blog/*", "/news/*"],
"exclude_patterns": ["/admin/*", "*/login", "*/tag/*"],
"exclude_file_extensions": True,
},
)
crawl = response.json()
# Summary
print(f"Pages visited : {crawl['total_crawled']}")
print(f"Successful : {crawl['successful']}")
print(f"Failed : {crawl['failed']}")
print(f"Max depth hit : {crawl['max_depth_reached']}")
print(f"Stop reason : {crawl['stopped_reason']}")
print()
# All discovered emails across the entire site
all_emails = set()
for page in crawl["results"]:
if page["success"] and page["data"]["emails"]:
all_emails.update(page["data"]["emails"])
print("Emails found:", sorted(all_emails))
# Sitemap — print tree structure
print("\nSitemap:")
for entry in crawl["sitemap"]:
indent = " " * entry["depth"]
status = "OK " if entry["success"] else "ERR"
print(f"{indent}[{status}] {entry['url']} (depth={entry['depth']})")
Error Reference
| HTTP Status | Meaning | Common Cause |
|---|---|---|
400 Bad Request |
Invalid input | Malformed URL, parameter out of allowed range, private/localhost IP blocked by SSRF protection |
401 Unauthorized |
Authentication failed | Missing or invalid X-API-Key |
402 Payment Required |
Quota exhausted | API key has no remaining token balance |
422 Unprocessable Entity |
Validation error | Request body does not match the expected schema |
429 Too Many Requests |
Rate limit exceeded | Too many requests in a short period |
502 Bad Gateway |
Scraping failed | Target server unreachable, returned an error, or timed out (30 s limit per page) |
503 Service Unavailable |
Auth service down | The authentication service is temporarily unavailable |
Error responses include a JSON body with a detail field describing the problem:
{
"detail": "URL targets a private IP address range and cannot be scraped."
}
Limits & Behaviour Notes
| Item | Limit / Behaviour |
|---|---|
| Batch size | Maximum 50 URLs per batch request |
| Crawl depth | 1–10 levels (default: 2) |
| Crawl pages | 1–500 pages per crawl (default: 50) |
| Concurrency | Up to 10 pages fetched in parallel (applies to both batch and crawl) |
| Request timeout | 30 seconds per individual page request |
| Crawl delay | 100 ms between requests to avoid overwhelming target servers |
| Text content | Truncated at 5 000 characters |
| Links per page | Capped at 100 |
| Images per page | Capped at 50 |
| Emails per page | Capped at 50, deduplicated and lowercased |
| CSS selector results | Capped at 50 elements |
| SSRF protection | URLs pointing to localhost, loopback addresses (127.*), or RFC 1918 private networks (10.*, 172.16–31.*, 192.168.*) are rejected with HTTP 400 |
| URL normalisation | Fragments (#section) are stripped before deduplication; query parameters are optionally stripped during crawls (ignore_query_params) |
| Protocol | https:// is automatically prepended to URLs that have no scheme |