Pipet: a .pipet file is a scraper

Pipet is a command-line scraper for HTML and JSON endpoints. You describe a fetch target and a set of selectors in a .pipet file, run pipet file.pipet, and get structured output — text, JSON, or rendered via a Go template.

Why I starred it

Most scraping tools make you write code. Even the lightweight ones (BeautifulSoup, cheerio) require you to scaffold a script before you get anything. Pipet inverts this: the scraper is the file, not the code. The .pipet format is declarative and minimal enough that you can write one in under a minute from a browser's "Copy as cURL" menu item.

The monitoring angle also caught my attention. Pass --interval 60 --on-change "notify-send {}" and it becomes a no-code change detector for any URL. That's a real use case — flight prices, GitHub releases, any page that doesn't have a proper RSS feed.

How it works

The execution path is clean. ParseSpecFile in internal/app/app.go reads the .pipet file line by line with a bufio.Scanner, building a slice of Block structs:

type Block struct {
    Type     string
    Command  interface{}
    Queries  []string
    NextPage string
}

Blank lines delimit blocks; lines beginning with // are skipped. The first line of each block is either curl ... or playwright ... — that sets Block.Type and Block.Command. Every subsequent non-empty line becomes a query string; lines starting with > set Block.NextPage.

ExecuteBlocks in the same file then runs each block. For curl blocks it shells out directly — exec.Command(parts[0], parts[1:]...) — which means any valid curl invocation works as-is, including pasted browser cURL commands with all their headers and cookies intact.

The content detection in parsers/utils.go is blunt but effective:

isJSON := json.Valid(bytes.TrimSpace(output))

if isJSON {
    return ParseJSONQueries(output, block.Queries)
} else {
    return ParseHTMLQueries(output, block.Queries, block.NextPage)
}

No content-type header inspection. If the response body is valid JSON, treat it as JSON; otherwise treat it as HTML. Crude, but it works for the 99% case.

HTML parsing uses goquery (a Go jQuery port). The indentation model is where it gets interesting — whitespace nesting encodes iteration. When ParseHTMLQueries in parsers/html.go sees a line whose next sibling has greater indentation, it treats the current selector as a parent iterator and runs the child selectors against each matched element's outer HTML. The recursion means you can nest multiple levels deep:

elements.Each(func(subi int, subdoc *goquery.Selection) {
    html, _ := goquery.OuterHtml(subdoc)
    // table rows get wrapped in <table> for goquery to parse correctly
    if strings.HasPrefix(html, "<tr") || strings.HasPrefix(html, "<td") {
        html= "<table>" + html + "</table>"
    }
    value2, _, _ := ParseHTMLQueries([]byte(html), lines, "")
    ...
})

That <tr>/<td> wrapping is a real detail — goquery chokes on table fragments without a table ancestor, so they wrap them before re-parsing. Small thing, real fix.

JSON parsing uses gjson with the same indentation trick. A parent GJSON path iterates over array results; child paths extract fields per item. You can break out to jq mid-query by piping: @this | jq '.[].name' is valid syntax, and Pipet will parse the returned string back as JSON if it's valid.

Pagination works by replacing the URL in Block.Command with whatever the > CSS selector resolves to on the current page, then re-running the block — up to --max-pages times (default 3).

Using it

The HN example from the README is a good starting point:

curl https://news.ycombinator.com/
.title .titleline
  span > a
  .sitebit a

Run pipet --json hackernews.pipet and you get a nested JSON array — one entry per story, each with title and domain.

For JSON APIs, the GJSON syntax handles most structures:

curl https://api.github.com/repos/bjesus/pipet/releases
#.tag_name
#.published_at

Add --interval 300 --on-change 'notify-send "New release: {}"' to monitor for new releases.

Playwright blocks work differently — each query is raw JavaScript evaluated in the page context after load:

playwright https://github.com/bjesus/pipet
Array.from(document.querySelectorAll('.about-margin .Link')).map(e => e.innerText.trim()).filter(t => /^\d/.test(t))

No nesting here. You write JS that returns a value, and that value lands in the output.

Rough edges

There are no tests for the application logic — parsers/html_test.go, parsers/json_test.go, and parsers/playwright_test.go exist, but the app.go parsing and execution path isn't covered. This matters because the indentation parsing is the trickiest part of the format and has no automated guard against regressions.

The --max-pages flag defaults to 3 even when you don't have a > next-page selector, which means an extra loop iteration on every non-paginated block before it breaks on empty nextPageURL. Minor, but visible in the code.

Error messages from piped shell commands are swallowed silently — ExecutePipe returns "" on error and the loop just breaks. If your | jq '...' expression is wrong you get empty output, not a useful message.

The last commit was October 2024 — docs update only. The project feels stable-ish but not actively developed.

Bottom line

Pipet is the right tool when you need a quick, repeatable scrape without standing up a script. The .pipet file format is easy to version, share, and hand off. If you're already comfortable with CSS selectors and curl, you can extract data from most pages in under five minutes.