Getting started Spider API

Contents

API built to scale

Our services can handle thousands of request a second with our systems set to scale elastically. The latency for the request is about the same for processing all responses. Navigate to the API docs to learn more about the request params allowed. We do not offer and clients to use directly since our API is simple enough to use as is, this may change in the future.

API Usage

Getting started with the API is simple and straight forward. After you get your secret key you can access our instance directly. We have one main endpoint /crawl that handles all things related to data curation. The crawler is highly configurable through the params to fit all needs.

Crawling One Page

Most cases you probally just want to crawl one page. Even if you only need one page, our system performs fast enough to lead the race. The most straight forward way to make sure you only crawl a single page is to set the budget limit with a wild card value or * to 1. You can also pass in the param limit in the JSON body with the limit of pages.

Crawling Multiple Pages

When you crawl multiple pages the concurrency horse power of spider kicks in. You might wonder why and how one request may take (x)ms to come back and 100 request take about the same time! That is because of the isolated concurrency built in allows crawling thousands - millions of pages within no time. The only current solution that can handle large websites with over 100k pages within a minute or two ( sometimes even a blink or two ). By default we do not add any limits for crawls unless specified.

Planet Scale Crawling

If you plan on processing crawls that have over 200 pages we reccomend you stream the request from the client instead of parsing the entire payload once finished. We have an example of this with python on the API docs page also shown below.

import requests, os, json

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":250,"url":"http://www.example.com"}

response = requests.post('https://spider.a11ywatch.com/crawl', 
  headers=headers, 
  json=json_data,
  stream=True)

for line in response.iter_lines():
  if line:
      print(json.loads(line))

Automatic Configuration

Spider handles automatic concurrency handling and ip rotation to make it simple to curate data. The more credits you have or usage available allows for a higher concurrency limit.

Written on: