Hosted Puppeteer (Browserless)

Some websites are partially (or entirely) rendered on the client (aka your web browser). If you try to search the initial HTML for elements that haven’t finished rendering, you won’t find them.

One solution is to use a headless browser that runs a web browser in the background that fetches the page, renders it, and then allows you to search the final document.

Headless browsers aren’t a good fit for Val Town due to the amount of resources they require to run. However, services like Browserless provide APIs to interact with a hosted headless browser. For example, their /scrape API. Here’s how to use Browserless and Val Town to load a webpage.

Copy your API Key from https://cloud.browserless.io/account/ and save it as a Val Town environment variable as browserless.

Screenshot 2023-06-24 at 22.43.01.png

Make an API call to the /scrape API

Check the documentation for the /scrape API and form your request.

For example, here’s how you scrape the introduction paragraph of OpenAI’s wikipedia page.

import { fetchJSON } from "https://esm.town/v/stevekrouse/fetchJSON?v=41";

const res = await fetchJSON(
  `https://chrome.browserless.io/scrape?token=${Deno.env.get("browserless")}`,
  {
    method: "POST",
    body: JSON.stringify({
      "url": "https://en.wikipedia.org/wiki/OpenAI",
      "elements": [{
        // The second <p> element on the page
        "selector": "p:nth-of-type(2)",
      }],
    }),
  },
);
// For this request, Browserless returns one data item
const data = res.data;
// That contains a single element
const elements = res.data[0].results;
// That we want to turn into its innerText value
const intro = elements[0].text;
return intro;

Browserless also has more APIs for taking screenshots and PDFs of websites.

Alternatively, use Puppeteer and a browser running on Browserless

You can use the Puppeteer library to connect to a browser instance running on Browserless.

Once you’ve navigated to a page, you can run arbitrary JavaScript with page.evaluate - like getting the text from a paragraph.

import { PuppeteerDeno } from "https://deno.land/x/puppeteer@16.2.0/src/deno/Puppeteer.ts";

const puppeteer = new PuppeteerDeno({
  productName: "chrome",
});
const browser = await puppeteer.connect({
  browserWSEndpoint: `wss://chrome.browserless.io?token=${Deno.env.get("browserless")}`,
});
const page = await browser.newPage();
await page.goto("https://en.wikipedia.org/wiki/OpenAI");
const intro = await page.evaluate(
  `document.querySelector('p:nth-of-type(2)').innerText`
);
await browser.close();
console.log(intro);

"OpenAI is an American artificial intelligence (AI) research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership. OpenAI conducts AI research with the declared intention of promoting and developing friendly AI."

Hosted Puppeteer (Browserless)

Sign up to Browserless and grab your API Key

Make an API call to the /scrape API

Alternatively, use Puppeteer and a browser running on Browserless