Advanced usage

Running in Docker

webchanges can be run in a Docker container. Please see https://github.com/yubiuser/webchanges-docker for one such implementation.

Making POST requests

The POST HTTP request method is used to submit form-encoded data to the specified resource (server). In webchanges, simply supply your data in the data directive. The method will be automatically changed to POST and, if no Content-type header is supplied, it will be set to application/x-www-form-urlencoded.

If the data needs to be sent in JSON, add the directive data_as_json to the job. If no Content-type header is supplied, it will be set to application/json.

See examples here.

Selecting items from a JSON dictionary

If you are watching JSON-encoded dictionary data but are only interested in the data contained in certain key, you can use the jq filter (Linux/macOS only, ASCII only) to extract it, or write a cross-platform Python command like the one below:

url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
  - execute: "python3 -c \"import sys, json; print(json.load(sys.stdin)['data'])\""

Escaping of the Python is a bit complex due to being inside a double quoted shell string inside a double quoted YAML string. For example, " code becomes \\\" and \n becomes \\n – and so on. The example below provides seemingly complex escaping and also informs the downstream html reporter that the extracted data is in Markdown:

url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
  - execute: "python3 -c \"import sys, json; d = json.load(sys.stdin); [print(f\\\"[{v['Title']}]\\n({v['DownloadUrl']})\\\") for v in d['value']]\""
is_markdown: true

Alternatively, you could run a script like this

url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
  - execute: python3 ~/.config/webchanges/parse.py
is_markdown: true

With the script file ~/.config/webchanges/parse.py containing the following:

# ~/.config/webchanges/parse.py
import json
import sys

data = json.load(sys.stdin)
for v in d['value']:
    print(f"[{v['Title']}]\n({v['DownloadUrl']})")

More advanced programmers can write their own Class and hook it into webchanges.

Selecting HTML elements with wildcards

Some pages appends/generates random characters to the end of the class name, which change every time it’s loaded. For example: contentWrap–qVat7asG contentWrap–wSlxapCk contentWrap–JV0HGsqD etc.

element-by-class does not support this, but XPATH does:

filters:
  - xpath: //div[contains(@class, 'contentWrap-')]
  - html2text

Alternatively, especially if you want to do more custom filtering, you can write an external Python script that uses e.g. Beautiful Soup and call it:

filters:
  - execute: python3 ~/.config/webchanges/content_wrap.py
  - html2text

With the script file ~/.config/webchanges/content_wrap.py containing the following:

# ~/.config/webchanges/content_wrap.py
import os
import re
import sys

from bs4 import BeautifulSoup

data = sys.stdin.read()
soup = BeautifulSoup(data, 'lxml')

# search for "div" elements with the according class
for element in soup.find_all('div', {'class' : re.compile(r'contentWrap-*')}):
    print(element)

More advanced programmers can write their own Class and hook it into webchanges.

.onion (Tor) top level domain name

.onion is a special-use top level domain name designating an anonymous onion service reachable only via the Tor network. As sites with URLs in the .onion pseudo-TLD are not accessible via public DNS and TCP, you need to run a Tor service as a SOCKS5 proxy service and use it to proxy these websites through it, as per this example:

name: A .onion website (unencrypted http)
url: http://www.example.onion
proxy: socks5h://localhost:9050
---
name: Another .onion website
url: https://www.example2.onion
proxy: socks5h://localhost:9050

Note the “h” in socks5h//, which tells the underlying urllib3 library to resolve the hostname using the SOCKS5 server (see here).

Setting up Tor is out of scope for this document, but in Windows install the Windows Expert Bundle from here and execute tor --service install as an Administrator; in Linux the installation of the tor package usually is sufficient to create a SOCKS5 proxy service, otherwise run with tor --options RunAsDaemon 1. Some useful options may be HardwareAccel 1 CircuitPadding 0 ConnectionPadding 0 ClientUseIPv6 1 FascistFirewall 1 (check documentation).

Alternatively (Linux/macOS only), instead of proxying those sites you can use the torsocks (fka torify) tool from the tor package to to make every Internet communication go through the Tor network. Just run webchanges within the torsocks wrapper:

torsocks webchanges

Passing diff output to a custom script

In some situations, it might be useful to run a script with the diff as input when changes were detected (e.g. to start an update or process something). This can be done by combining diff_filter with the shellpipe filter, which can run any custom script.

The output of the custom script will then be the diff result as reported by webchanges, so if it outputs any status, the CHANGED notification that webchanges does will contain the output of the custom script, not the original diff. This can even have a “normal” filter attached to only watch links (the css: a part of the filter definitions):

url: https://example.org/downloadlist.html
filters:
  - css: a
diff_filters:
  - execute: /usr/local/bin/process_new_links.sh

If running on Linux/macOS, please read about file permission restrictions in the filter’s explanation here.

Using word-based differ for Markdown (`pandiff`)

You can also specify an external diff-style tool (a tool that takes two filenames (old, new) as parameter and returns the difference of the files on its standard output). For example, to get Markdown differences you can use PanDiff:

url: https://example.com/
differ:
  name: command
  command: pandiff

or:

url: https://example.com/
differ:
  name: command
  command: pandiff --to=HTML
  is_html: true

In order for this to work, pandiff needs to be installed separately (see PanDiff.

Creating a separate notification for each change

Each type of reports (Text, HTML or Markdown) have an optional sub-directive separate, which when set to true will cause webchanges to send a report for each job separately instead of a single combined report with all jobs.

These sub-directives are set in the configuration.

Selecting recipients by individual job

Currently, configuring reporters on a per-job basis is not supported. All jobs share the same reporter configuration.

To manage different reporting needs you can use multiple job files each with its separate configuration (--config) or, if using email, setting separate reports (separate: true in config.yaml) and use email filtering to forward the reports as needed.

Using environment variables in URLs

Currently this cannot be done natively.

However, as a workaround you can use a job with a :ref:command to invoke e.g. curl or wget which in turn reads the environment variable. Example:

command: wget https://www.example.com/test?resource=$RESOURCE

Authenticated requests

Set the Authorization header to provide credentials that authenticate a url job with a server, allowing access to a protected resource. Some of the most popular authentication schemes are Basic, Digest and NTLM. For more information, see here.

Using persistent browser storage (for e.g. authentication)

Some sites may use a combination of cookies and/or their functional equivalent of storing data in ‘Local Storage’ to authenticate or initialize their state and will not display the content you want unless you first authenticate (or accept cookies or whatever). In these circumstances, you can use webchanges with use_browser: true directive and its user_data_dir sub-directive to instruct it to use a pre-existing user directory, which you can pre-initialize beforehand. Specifically:

Create an empty directory somewhere (e.g. mkdir ~/chrome_user_data_webchanges);
Run a Google Chrome browser with the --user-data-dir switch pointing to this directory (e.g. chrome.exe --user-data-dir=~/chrome_user_data_webchanges);
Browse to the site that you’re interested in tracking and log in or do whatever is needed for it to save the authentication data in local storage;
Exit the browser.

You can now run a webchanges job defined like this:

url: https://example.org/usedatadir.html
use_browser: true
user_data_dir: ~/chrome_user_data_webchanges

Overriding the content encoding

(rare) For web pages with missing or incorrect 'Content-type' HTTP header or whose encoding cannot be correctly guessed by the chardet library our default HTTP client uses, it may be useful to explicitly specify an encoding from Python’s Standard Encodings list like this:

url: https://example.com/
encoding: utf-8

Monitoring the HTTP response status code

To monitor the HTTP response status code of a resource and be notified when it changes, use an external command like curl to extract it. Here’s a job example:

command: curl --silent --output /dev/null --write-out '%{response_code}' https://example.com
name: Example.com response status code
note: Requires curl

Creating job urls based on keywords

webchanges does not support arrays and loops to generate jobs (e.g. to check different pricing of a set of products on a set of shots). The best way to do this is to use some template language outside of webchanges and let it generate the urls.yaml file from that template.

Add bullet points in reports for clarity

You can improve the readability of line-by-line data by adding bullet points using the technique below. Notice the two spaces before the asterisk (*): this ensure it works in Markdown, and therfore also after the `html2text` filter, which outputs Markdown. When using an HTML reporter, these bullet points will be converted to proper <ul><li> tags, and most browsers will display each line with an indented bullet point (●), providing clear visual separation between items.

filters:
  - html2text #  example only: can be used with any or no previous filters!
  - re.sub
      pattern: (?m)^
      repl: '  * '

Speeding up browser jobs by blocking elements

Warning

This Pyppeteer feature is not (yet?) implemented by Playwright, and therefore the block_elements directive is ignored (does nothing) for the time being.

If you’re running a browser job (use_browser: true) and not interested in all elements of a website, you can skip downloading the ones that you don’t care, paying attention that some elements may be required for the correct rendering of the website (always test!). Typical elements to skip include stylesheet, font, image, media, and other, and they can be specified like this on a job-by-job basis:

name: This is a Javascript site
note: It's just a test
url: https://www.example.com
use_browser: true
block_elements:
  - stylesheet
  - font
  - image
  - media
  - other

or like this in the config file for all use_browser: true jobs:

job_defaults:
  browser:
    block_elements:
      - stylesheet
      - font
      - image
      - media
      - other

The full list of supported resources is the following (from here):

document
stylesheet
image
media
font
script
texttrack
xhr
fetch
eventsource
websocket
manifest
other