API Reference

The Spider API is based on REST. Our API is predictable, returns JSON-encoded responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the authorization header to commence. You can use the content-type header with application/json, application/xml, text/csv, and application/jsonl for shaping the response.

The Spider API supports multi domain actions. You can work with multiple domains per request by adding the urls comma separated.

The Spider API differs for every account as we release new versions and tailor functionality. You can add v1 before any path to pin to the version.

Just getting started?

Check out our development quickstart guide.

Not a developer?

Use Spiders no-code options or apps to get started with Spider and to do more with your Spider account no code required.

Base Url

https://api.spider.cloud

Crawl websites

Start crawling a website(s) to collect resources.

POST https://api.spider.cloud/crawl

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
depth number
The crawl limit for maximum depth. If zero, no limit will be applied.
Crawl Depth
cache boolean
Use HTTP caching for the crawl to speed up repeated runs.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
locale string
The locale to use for request, example en-US.
cookies string
Add HTTP cookies to use for request.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.
headers string
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadata boolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
subdomains boolean
Allow subdomains to be included.
user_agent string
Add a custom HTTP user agent to the request.
store_data boolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the"prompt" to chain steps.
fingerprint boolean
Use advanced fingerprint for chrome.
storageless boolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_format string
The format to return the data in. Possible values are markdown, raw, text, and html2text. Use raw to return the default format of the page like HTML etc.
proxy_enabled boolean
Enable high performance premium proxies for the request to prevent being blocked at the network level.
query_selector string
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resources boolean
Crawl and download all the resources for a website.
request_timeout number
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":50,"url":"http://www.example.com"}

response = requests.post('https://api.spider.cloud/crawl', 
  headers=headers, 
  json=json_data)

print(response.json())

Response

[
  {
    "content": "<html>...",
    "error": null,
    "status": 200,
    "url": "http://www.example.com"
  },
  // more content...
]

Crawl websites get links

Start crawling a website(s) to collect links found.

POST https://api.spider.cloud/links

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
depth number
The crawl limit for maximum depth. If zero, no limit will be applied.
Crawl Depth
cache boolean
Use HTTP caching for the crawl to speed up repeated runs.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
locale string
The locale to use for request, example en-US.
cookies string
Add HTTP cookies to use for request.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.
headers string
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadata boolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
subdomains boolean
Allow subdomains to be included.
user_agent string
Add a custom HTTP user agent to the request.
store_data boolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the"prompt" to chain steps.
fingerprint boolean
Use advanced fingerprint for chrome.
storageless boolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_format string
The format to return the data in. Possible values are markdown, raw, text, and html2text. Use raw to return the default format of the page like HTML etc.
proxy_enabled boolean
Enable high performance premium proxies for the request to prevent being blocked at the network level.
query_selector string
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resources boolean
Crawl and download all the resources for a website.
request_timeout number
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":50,"url":"http://www.example.com"}

response = requests.post('https://api.spider.cloud/links', 
  headers=headers, 
  json=json_data)

print(response.json())

Response

[
  {
    "content": "",
    "error": null,
    "status": 200,
    "url": "http://www.example.com"
  },
  // more content...
]

Screenshot websites

Start taking screenshots of website(s) to collect images to base64 or binary.

POST https://api.spider.cloud/screenshot

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
depth number
The crawl limit for maximum depth. If zero, no limit will be applied.
Crawl Depth
cache boolean
Use HTTP caching for the crawl to speed up repeated runs.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
locale string
The locale to use for request, example en-US.
cookies string
Add HTTP cookies to use for request.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.
headers string
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadata boolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
subdomains boolean
Allow subdomains to be included.
user_agent string
Add a custom HTTP user agent to the request.
store_data boolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the"prompt" to chain steps.
fingerprint boolean
Use advanced fingerprint for chrome.
storageless boolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_format string
The format to return the data in. Possible values are markdown, raw, text, and html2text. Use raw to return the default format of the page like HTML etc.
proxy_enabled boolean
Enable high performance premium proxies for the request to prevent being blocked at the network level.
query_selector string
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resources boolean
Crawl and download all the resources for a website.
request_timeout number
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":50,"url":"http://www.example.com"}

response = requests.post('https://api.spider.cloud/screenshot', 
  headers=headers, 
  json=json_data)

print(response.json())

Response

[
  {
    "content": "base64...",
    "error": null,
    "status": 200,
    "url": "http://www.example.com"
  },
  // more content...
]

Pipelines

Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.

Crawl websites and extract contacts

Start crawling a website(s) to collect all contacts utilizing AI. A minimum of $25 in credits is necessary for extraction.

POST https://api.spider.cloud/pipeline/extract-contacts

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
depth number
The crawl limit for maximum depth. If zero, no limit will be applied.
Crawl Depth
cache boolean
Use HTTP caching for the crawl to speed up repeated runs.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
locale string
The locale to use for request, example en-US.
cookies string
Add HTTP cookies to use for request.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.
headers string
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadata boolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
subdomains boolean
Allow subdomains to be included.
user_agent string
Add a custom HTTP user agent to the request.
store_data boolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the"prompt" to chain steps.
fingerprint boolean
Use advanced fingerprint for chrome.
storageless boolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_format string
The format to return the data in. Possible values are markdown, raw, text, and html2text. Use raw to return the default format of the page like HTML etc.
proxy_enabled boolean
Enable high performance premium proxies for the request to prevent being blocked at the network level.
query_selector string
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resources boolean
Crawl and download all the resources for a website.
request_timeout number
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":50,"url":"http://www.example.com"}

response = requests.post('https://api.spider.cloud/pipeline/extract-contacts', 
  headers=headers, 
  json=json_data)

print(response.json())

Response

[
  {
    "content": [{ "full_name": "John Doe", "email": "johndoe@gmail.com", "phone": "555-555-555", "title": "Baker"}, ...],
    "error": null,
    "status": 200,
    "url": "http://www.example.com"
  },
  // more content...
]

Label website

Crawl a website and accurately categorize it using AI.

POST https://api.spider.cloud/pipeline/label

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url
request string
The request type to perform. Possible values are http, chrome, and smart. Use smart to perform HTTP request by default until JavaScript rendering is needed for the HTML.
limit number
The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
Crawl Limit
depth number
The crawl limit for maximum depth. If zero, no limit will be applied.
Crawl Depth
cache boolean
Use HTTP caching for the crawl to speed up repeated runs.
budget object
Object that has paths with a counter for limiting the amount of pages example {"*":1} for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting { "/docs/colors": 10, "/docs/": 100 } which only allows a max of 100 pages if the route matches /docs/:pathname and only 10 pages if it matches /docs/colors/:pathname.
Crawl Budget
locale string
The locale to use for request, example en-US.
cookies string
Add HTTP cookies to use for request.
stealth boolean
Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome.
headers string
Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs.
metadata boolean
Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled.
viewport object
Configure the viewport for chrome. Defaults to 800x600.
encoding string
The type of encoding to use like UTF-8, SHIFT_JIS, or etc.
subdomains boolean
Allow subdomains to be included.
user_agent string
Add a custom HTTP user agent to the request.
store_data boolean
Boolean to determine if storage should be used. If set this takes precedence over storageless. Defaults to false.
gpt_config object
Use AI to generate actions to perform during the crawl. You can pass an array for the"prompt" to chain steps.
fingerprint boolean
Use advanced fingerprint for chrome.
storageless boolean
Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored.
readability boolean
Use readability to pre-process the content for reading. This may drastically improve the content for LLM usage.
return_format string
The format to return the data in. Possible values are markdown, raw, text, and html2text. Use raw to return the default format of the page like HTML etc.
proxy_enabled boolean
Enable high performance premium proxies for the request to prevent being blocked at the network level.
query_selector string
The CSS query selector to use when extracting content from the markup.
Test Query Selector
full_resources boolean
Crawl and download all the resources for a website.
request_timeout number
The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds.
run_in_background boolean
Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set.

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":50,"url":"http://www.example.com"}

response = requests.post('https://api.spider.cloud/pipeline/label', 
  headers=headers, 
  json=json_data)

print(response.json())

Response

[
  {
    "content": ["Government"],
    "error": null,
    "status": 200,
    "url": "http://www.example.com"
  },
  // more content...
]

Queries

Query the data that you collect. Add dynamic filters for extracting exactly what is needed.

Crawl State

Get the state of the crawl for the domain.

POST https://api.spider.cloud/crawl/status

Request body

url required string
The URI resource to crawl. This can be a comma split list for multiple urls.
Test Url

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/crawl/status', 
  headers=headers)

print(response.json())

Response

{
    "data": {
      "id": "195bf2f2-2821-421d-b89c-f27e57ca71fh",
      "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
      "domain": "example.com",
      "url": "https://example.com/",
      "links": 1,
      "credits_used": 3,
      "mode": 2,
      "crawl_duration": 340,
      "message": null,
      "request_user_agent": "Spider",
      "level": "info",
      "status_code": 0,
      "created_at": "2024-04-21T01:21:32.886863+00:00",
      "updated_at": "2024-04-21T01:21:32.886863+00:00"
    },
    "error": ""
  }

Credits Available

Get the remaining credits available.

GET https://api.spider.cloud/data/credits

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/data/credits', 
  headers=headers)

print(response.json())

Response

{
    "data": {
      "id": "8d662167-5a5f-41aa-9cb8-0cbb7d536891",
      "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg",
      "credits": 53334,
      "created_at": "2024-04-21T01:21:32.886863+00:00",
      "updated_at": "2024-04-21T01:21:32.886863+00:00"
    }
  }

Websites Collection

Get the websites stored.

GET https://api.spider.cloud/data/websites

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/data/websites', 
  headers=headers)

print(response.json())

Response

{
 "data": [
  {
    "id": "2a503c02-f161-444b-b1fa-03a3914667b6",
    "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfd",
    "url": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfd/example.com/index.html",
    "domain": "example.com",
    "created_at": "2024-04-18T15:40:25.667063+00:00",
    "updated_at": "2024-04-18T15:40:25.667063+00:00",
    "pathname": "/",
    "fts": "",
    "scheme": "https:",
    "last_checked_at": "2024-05-10T13:39:32.293017+00:00",
    "screenshot": null
  }
 ] 
}

Pages Collection

Get the pages/resources stored.

GET https://api.spider.cloud/data/pages

Example request

import requests, os

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

response = requests.post('https://api.spider.cloud/data/pages', 
  headers=headers)

print(response.json())

Response

{
  "data": [
    {
      "id": "733b0d0f-e406-4229-949d-8068ade54752",
      "user_id": "6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfd",
      "url": "https://www.example.com",
      "domain": "www.example.com",
      "created_at": "2024-04-17T01:28:15.016975+00:00",
      "updated_at": "2024-04-17T01:28:15.016975+00:00",
      "proxy": true,
      "headless": true,
      "crawl_budget": null,
      "scheme": "https:",
      "last_checked_at": "2024-04-17T01:28:15.016975+00:00",
      "full_resources": false,
      "metadata": true,
      "gpt_config": null,
      "smart_mode": false,
      "fts": "'www.example.com':1"
    }
  ]
}

API Reference

Just getting started?

Not a developer?

Crawl websites

url required string

request string

limit number

depth number

cache boolean

budget object

locale string

cookies string

stealth boolean

headers string

metadata boolean

viewport object

encoding string

subdomains boolean

user_agent string

store_data boolean

gpt_config object

fingerprint boolean

storageless boolean

readability boolean

return_format string

proxy_enabled boolean

query_selector string

full_resources boolean

request_timeout number

run_in_background boolean

Crawl websites get links

url required string

request string

limit number

depth number

cache boolean

budget object

locale string

cookies string

stealth boolean

headers string

metadata boolean

viewport object

encoding string

subdomains boolean

user_agent string

store_data boolean

gpt_config object

fingerprint boolean

storageless boolean

readability boolean

return_format string

proxy_enabled boolean

query_selector string

full_resources boolean

request_timeout number

run_in_background boolean

Screenshot websites

url required string

request string

limit number

depth number

cache boolean

budget object

locale string

cookies string

stealth boolean

headers string

metadata boolean

viewport object

encoding string

subdomains boolean

user_agent string

store_data boolean

gpt_config object

fingerprint boolean

storageless boolean

readability boolean

return_format string

proxy_enabled boolean