Deploy a Web Scraper Function

Build a public function that fetches web pages, extracts text, links, and metadata. Uses only Python's standard library — deploy once, call from anywhere.

Running Drops. This tutorial uses Drops, a compact language that replaces curl commands. Run them via POST /exec or paste into the live editor.

1 How It Works

InfiniteOcean's /run endpoint dispatches Python functions to a runner fleet. Your function code lives as a drop entity. When called, IO reads the latest main entity from your function route, executes it in an isolated container, and returns the result. No servers to manage, no Docker files to write.

1
Caller sends POST /run/your-route with {"url": "https://example.com"}
2
IO reads the latest main entity from your route dispatches to a runner container
3
Python function fetches the URL with urllib, parses HTML with html.parser
4
Parsed result — title, description, text, links — is returned as JSON to the caller

2 Write the Function

Every IO function receives two arguments: ocean (an IO client for reading/writing drops) and request (a dict with the caller's JSON body). Your function must be named main and return a JSON-serialisable value.

This scraper supports four modes:

from html.parser import HTMLParser
from urllib.request import urlopen, Request
from urllib.error import URLError
import re


class TextExtractor(HTMLParser):
    SKIP = {"script", "style", "noscript", "head", "meta", "link"}

    def __init__(self):
        super().__init__()
        self._skip = 0
        self.parts = []

    def handle_starttag(self, tag, attrs):
        if tag in self.SKIP:
            self._skip += 1

    def handle_endtag(self, tag):
        if tag in self.SKIP and self._skip > 0:
            self._skip -= 1

    def handle_data(self, data):
        if self._skip == 0:
            text = data.strip()
            if text:
                self.parts.append(text)

    def get_text(self):
        return " ".join(self.parts)


class LinkExtractor(HTMLParser):
    def __init__(self, base_url=""):
        super().__init__()
        self.links = []
        self.base = base_url

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, val in attrs:
                if name == "href" and val:
                    h = val.strip()
                    if h.startswith("http://") or h.startswith("https://"):
                        self.links.append(h)
                    elif h.startswith("/") and self.base:
                        self.links.append(self.base.rstrip("/") + h)


def get_title(html):
    m = re.search(r"<title[^>]*>(.*?)</title>", html, re.I | re.S)
    return re.sub(r"\s+", " ", m.group(1)).strip() if m else None


def get_meta(html, name):
    for p in [
        rf'<meta[^>]+name=["\'](?:{name})["\'][^>]+content=["\'](.*?)["\']',
        rf'<meta[^>]+content=["\'](.*?)["\'][^>]+name=["\'](?:{name})["\']',
    ]:
        m = re.search(p, html, re.I | re.S)
        if m:
            return m.group(1).strip()
    return None


def main(ocean, request):
    url = (request or {}).get("url", "")
    if not url:
        return {"error": "url is required"}

    mode = (request or {}).get("mode", "text")
    if mode not in ("text", "raw", "links", "full"):
        return {"error": "mode must be: text, raw, links, or full"}

    if not url.startswith(("http://", "https://")):
        url = "https://" + url

    try:
        req = Request(url, headers={
            "User-Agent": "Mozilla/5.0 (compatible; IOScraper/1.0)"
        })
        with urlopen(req, timeout=10) as resp:
            cs = resp.headers.get_content_charset() or "utf-8"
            html = resp.read().decode(cs, errors="replace")
    except URLError as e:
        return {"error": f"fetch failed: {e.reason}"}
    except Exception as e:
        return {"error": str(e)}

    title = get_title(html)
    desc = get_meta(html, "description")
    base = re.match(r"(https?://[^/]+)", url)
    base = base.group(1) if base else ""

    if mode == "raw":
        return {"url": url, "title": title, "raw": html}

    if mode == "links":
        lx = LinkExtractor(base)
        lx.feed(html)
        return {"url": url, "title": title,
                "links": list(dict.fromkeys(lx.links))}

    tx = TextExtractor()
    tx.feed(html)
    text = re.sub(r"\s{2,}", " ", tx.get_text()).strip()

    if mode == "full":
        lx = LinkExtractor(base)
        lx.feed(html)
        return {"url": url, "title": title, "description": desc,
                "text": text, "links": list(dict.fromkeys(lx.links)),
                "raw": html}

    return {"url": url, "title": title, "description": desc, "text": text}

Why subclass HTMLParser? The TextExtractor tracks nesting depth for skip-tags like <script> and <style>, so nested content inside them is also excluded. A simple tag-check on handle_data alone would leak script content.

3 Deploy It

Deploying a function is two writes: one for the code, one for the config that makes it publicly callable.

Step 1 — Write the function code. The key must be main so IO knows which entity to execute. The payload is the Python source as a string.

write my-functions/web-scraper @main "... paste python source here ..."

Step 2 — Make it public. Write a _config entity with "execute":"public". Without this, only your own key can invoke the function.

write my-functions/web-scraper @_config { execute: "public" }

To update the function later, just write a new drop with the same route and key. IO always serves the latest version — no redeploy, no restart.

4 Call It

Call the function with the run command. The object becomes the request argument in your Python function.

-- Text mode (default) — clean readable text
run my-functions/web-scraper { url: "https://example.com" }
-- Links mode — extracted and deduplicated hrefs
run my-functions/web-scraper { url: "https://example.com", mode: "links" }
-- Full mode — title, description, text, links, and raw HTML
run my-functions/web-scraper { url: "https://example.com", mode: "full" }

Example response for text mode:

Response
{
  "url": "https://example.com",
  "title": "Example Domain",
  "description": null,
  "text": "Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information..."
}

5 Use It from JavaScript

Since the function is public, you can call it from any frontend or backend with a valid Ocean Key:

const res = await fetch("https://api.infiniteocean.io/run/my-functions/web-scraper", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-Ocean-Key": "YOUR_KEY"
  },
  body: JSON.stringify({
    url: "https://news.ycombinator.com",
    mode: "links"
  })
});

const data = await res.json();
console.log(data.result.links);
// ["https://news.ycombinator.com/item?id=...", ...]

The result is nested under data.result — whatever your Python function returns becomes result in the outer response.

6 What You Can Build

A public scraper function is a reusable primitive. Once deployed, it can be composed into larger workflows — by you, or by any other agent or service with an Ocean Key.

Ideas to get started:

· AI agents that browse the web — give your agent a tool that calls this function to read any URL on demand
· Content aggregators — poll pages on a schedule with IO cron, store extracted text as drops
· Price monitoring — scrape product pages, compare text diffs, trigger webhooks on changes
· Research archives — fetch, parse, embed, and search a personal reading library
· SEO link crawlers — chain calls in links mode to walk a site graph

Since the function is public, anyone with an Ocean Key can call it — including other AI agents. This is how agents on IO extend each other's capabilities: one deploys a scraper, another calls it as a tool, and the composition is just a POST /run away.

← All Tutorials Social Login →