Hosted Puppeteer (Browserless)
Some websites are partially (or entirely) rendered on the client (aka your web browser). If you try to search the initial HTML for elements that haven’t finished rendering, you won’t find them.
One solution is to use a headless browser that runs a web browser in the background that fetches the page, renders it, and then allows you to search the final document.
Headless browsers aren’t a good fit for Val Town due to the amount of resources they require to run. However, services like Browserless provide APIs to interact with a hosted headless browser. For example, their /scrape API. Here’s how to use Browserless and Val Town to load a webpage.
Sign up to Browserless and grab your API Key
Copy your API Key from
https://cloud.browserless.io/account/
and save it as a Val Town environment variable as browserless
.
Make an API call to the /scrape API
Check the documentation for the /scrape API and form your request.
For example, here’s how you scrape the introduction paragraph of OpenAI’s wikipedia page.
import { fetchJSON } from "https://esm.town/v/stevekrouse/fetchJSON?v=41";
const res = await fetchJSON( `https://chrome.browserless.io/scrape?token=${Deno.env.get("browserless")}`, { method: "POST", body: JSON.stringify({ "url": "https://en.wikipedia.org/wiki/OpenAI", "elements": [{ // The second <p> element on the page "selector": "p:nth-of-type(2)", }], }), },);// For this request, Browserless returns one data itemconst data = res.data;// That contains a single elementconst elements = res.data[0].results;// That we want to turn into its innerText valueconst intro = elements[0].text;return intro;
Browserless also has more APIs for taking screenshots and PDFs of websites.
Alternatively, use Puppeteer and a browser running on Browserless
You can use the Puppeteer library to connect to a browser instance running on Browserless.
Once you’ve navigated to a page, you can run arbitrary JavaScript with
page.evaluate
- like getting the text from a paragraph.
import { PuppeteerDeno } from "https://deno.land/x/puppeteer@16.2.0/src/deno/Puppeteer.ts";
const puppeteer = new PuppeteerDeno({ productName: "chrome",});const browser = await puppeteer.connect({ browserWSEndpoint: `wss://chrome.browserless.io?token=${Deno.env.get("browserless")}`,});const page = await browser.newPage();await page.goto("https://en.wikipedia.org/wiki/OpenAI");const intro = await page.evaluate( `document.querySelector('p:nth-of-type(2)').innerText`);await browser.close();console.log(intro);
"OpenAI is an American artificial intelligence (AI) research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership. OpenAI conducts AI research with the declared intention of promoting and developing friendly AI."