Filtering links with AI

Content

The Spider API brings unique workflows for almost any AI data curation step. One common workflow found in many AI pipelines is to store a set of links belonging to a website and process the pages to check if the data is related to the content. With Spider, we can collect metadata from the website and use AI to determine if the linked page is relevant. If you have enabled metadata and crawled the website before filtering, the AI will utilize the filtered data to help determine relevance.

import requests, os, json

headers = {
    'Authorization': os.environ["SPIDER_API_KEY"],
    'Content-Type': 'application/json',
}

json_data = {"limit":1,"url":["http://www.example.com/contacts", "http://www.example.com/health"], "model": "gpt-4-1106-preview", "prompt": "Use only links that relate to Health and Medicine."}

response = requests.post('https://spider.a11ywatch.com/pipeline/filter-links', 
  headers=headers, 
  json=json_data,
  stream=True)

for line in response.iter_lines():
  if line:
      # http://example.com/health
      print(json.loads(line))

Written on: