Advanced usage
Running in Docker
webchanges can be run in a Docker container. Please see https://github.com/yubiuser/webchanges-docker for one such implementation.
Making POST requests
The POST HTTP request method is used to submit
form-encoded data to the specified resource (server). In webchanges, simply supply your data in the data
directive. The method will be automatically changed to POST and, if no Content-type header is supplied, it will be set to
application/x-www-form-urlencoded.
If the data needs to be sent in JSON, add the directive data_as_json to the job. If no Content-type header is supplied, it will be set to
application/json.
See examples here.
Selecting items from a JSON dictionary
If you are watching JSON-encoded dictionary data but are only interested in the data contained in certain key, you can use the jq filter (Linux/macOS only, ASCII only) to extract it, or write a cross-platform Python command like the one below:
url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
- execute: "python3 -c \"import sys, json; print(json.load(sys.stdin)['data'])\""
Escaping of the Python is a bit complex due to being inside a double quoted shell string inside a double quoted YAML
string. For example, " code becomes \\\" and \n becomes \\n – and so on. The example below provides
seemingly complex escaping and also informs the downstream html reporter that the extracted data is in Markdown:
url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
- execute: "python3 -c \"import sys, json; d = json.load(sys.stdin); [print(f\\\"[{v['Title']}]\\n({v['DownloadUrl']})\\\") for v in d['value']]\""
is_markdown: true
Alternatively, you could run a script like this
url: https://example.com/api_data.json
user_visible_url: https://example.com
filters:
- execute: python3 ~/.config/webchanges/parse.py
is_markdown: true
With the script file ~/.config/webchanges/parse.py containing the following:
# ~/.config/webchanges/parse.py
import json
import sys
data = json.load(sys.stdin)
for v in d['value']:
print(f"[{v['Title']}]\n({v['DownloadUrl']})")
More advanced programmers can write their own Class and hook it into webchanges.
Selecting HTML elements with wildcards
Some pages appends/generates random characters to the end of the class name, which change every time it’s loaded. For example: contentWrap–qVat7asG contentWrap–wSlxapCk contentWrap–JV0HGsqD etc.
element-by-class does not support this, but XPATH does:
filters:
- xpath: //div[contains(@class, 'contentWrap-')]
- html2text
Alternatively, especially if you want to do more custom filtering, you can write an external Python script that uses e.g. Beautiful Soup and call it:
filters:
- execute: python3 ~/.config/webchanges/content_wrap.py
- html2text
With the script file ~/.config/webchanges/content_wrap.py containing the following:
# ~/.config/webchanges/content_wrap.py
import os
import re
import sys
from bs4 import BeautifulSoup
data = sys.stdin.read()
soup = BeautifulSoup(data, 'lxml')
# search for "div" elements with the according class
for element in soup.find_all('div', {'class' : re.compile(r'contentWrap-*')}):
print(element)
More advanced programmers can write their own Class and hook it into webchanges.
.onion (Tor) top level domain name
.onion is a special-use top level domain name designating an anonymous onion service reachable only via the Tor network. As sites with URLs in the .onion pseudo-TLD are not accessible via public DNS and TCP, you need to run a Tor service as a SOCKS5 proxy service and use it to proxy these websites through it, as per this example:
name: A .onion website (unencrypted http)
url: http://www.example.onion
proxy: socks5h://localhost:9050
---
name: Another .onion website
url: https://www.example2.onion
proxy: socks5h://localhost:9050
Note the “h” in socks5h//, which tells the underlying urllib3 library to resolve the hostname using the SOCKS5
server (see here).
Setting up Tor is out of scope for this document, but in Windows install the Windows Expert Bundle from here and execute tor --service install as an Administrator; in Linux the
installation of the tor package usually is sufficient to create a SOCKS5 proxy service, otherwise run with
tor --options RunAsDaemon 1. Some useful options may be HardwareAccel 1 CircuitPadding 0 ConnectionPadding 0
ClientUseIPv6 1 FascistFirewall 1 (check documentation).
Alternatively (Linux/macOS only), instead of proxying those sites you can use the torsocks (fka torify) tool from the tor package to to make every Internet communication go through the Tor network. Just run webchanges within the torsocks wrapper:
torsocks webchanges
Passing diff output to a custom script
In some situations, it might be useful to run a script with the diff as input when changes were detected (e.g. to start
an update or process something). This can be done by combining diff_filter with the shellpipe filter, which
can run any custom script.
The output of the custom script will then be the diff result as reported by webchanges, so if it outputs any status, the
CHANGED notification that webchanges does will contain the output of the custom script, not the original diff. This
can even have a “normal” filter attached to only watch links (the css: a part of the filter definitions):
url: https://example.org/downloadlist.html
filters:
- css: a
diff_filters:
- execute: /usr/local/bin/process_new_links.sh
If running on Linux/macOS, please read about file permission restrictions in the filter’s explanation here.
Using word-based differ for Markdown (pandiff)
You can also specify an external diff-style tool (a tool that takes two filenames (old, new) as parameter and
returns the difference of the files on its standard output). For example, to get Markdown differences you can use
PanDiff:
url: https://example.com/
differ:
name: command
command: pandiff
or:
url: https://example.com/
differ:
name: command
command: pandiff --to=HTML
is_html: true
In order for this to work, pandiff needs to be installed separately (see
PanDiff.
Creating a separate notification for each change
Each type of reports (Text, HTML or Markdown) have an optional sub-directive separate, which
when set to true will cause webchanges to send a report for each job separately instead of a single combined
report with all jobs.
These sub-directives are set in the configuration.
Selecting recipients by individual job
Currently, configuring reporters on a per-job basis is not supported. All jobs share the same reporter configuration.
To manage different reporting needs you can use multiple job files each with its separate configuration (--config)
or, if using email, setting separate reports (separate: true in config.yaml) and use email filtering to
forward the reports as needed.
Using environment variables in URLs
Currently this cannot be done natively.
However, as a workaround you can use a job with a :ref:command to invoke e.g. curl or wget which in turn reads
the environment variable. Example:
command: wget https://www.example.com/test?resource=$RESOURCE
Authenticated requests
Set the Authorization header to provide credentials that authenticate a url job with a server, allowing access
to a protected resource. Some of the most popular authentication schemes are Basic, Digest and NTLM. For
more information, see here.
Using persistent browser storage (for e.g. authentication)
Some sites may use a combination of cookies and/or their functional equivalent of storing data in ‘Local Storage’ to
authenticate or initialize their state and will not display the content you want unless you first authenticate (or
accept cookies or whatever). In these circumstances, you can use webchanges with use_browser: true
directive and its user_data_dir sub-directive to instruct it to use a pre-existing user directory, which you can
pre-initialize beforehand. Specifically:
Create an empty directory somewhere (e.g.
mkdir ~/chrome_user_data_webchanges);Run a Google Chrome browser with the
--user-data-dirswitch pointing to this directory (e.g.chrome.exe --user-data-dir=~/chrome_user_data_webchanges);Browse to the site that you’re interested in tracking and log in or do whatever is needed for it to save the authentication data in local storage;
Exit the browser.
You can now run a webchanges job defined like this:
url: https://example.org/usedatadir.html
use_browser: true
user_data_dir: ~/chrome_user_data_webchanges
Overriding the content encoding
(rare) For web pages with missing or incorrect 'Content-type' HTTP header or whose encoding cannot be
correctly guessed
by the chardet
library our default HTTP client uses, it may be useful to explicitly specify an encoding from Python’s Standard
Encodings list like this:
url: https://example.com/
encoding: utf-8
Monitoring the HTTP response status code
To monitor the HTTP response status code of a resource and be notified when it changes, use an external command like curl to extract it. Here’s a job example:
command: curl --silent --output /dev/null --write-out '%{response_code}' https://example.com
name: Example.com response status code
note: Requires curl
Creating job urls based on keywords
webchanges does not support arrays and loops to generate jobs (e.g. to check different pricing of a set of
products on a set of shots). The best way to do this is to use some template language outside of
webchanges and let it generate the urls.yaml file from that template.
Add bullet points in reports for clarity
You can improve the readability of line-by-line data by adding bullet points using the technique below. Notice the two
spaces before the asterisk (*): this ensure it works in Markdown, and therfore also after the `html2text` filter,
which outputs Markdown. When using an HTML reporter, these bullet points will be converted to proper <ul><li> tags, and
most browsers will display each line with an indented bullet point (●), providing clear visual separation between items.
filters:
- html2text # example only: can be used with any or no previous filters!
- re.sub
pattern: (?m)^
repl: ' * '
Speeding up browser jobs by blocking elements
Warning
This Pyppeteer feature is not (yet?) implemented by Playwright, and therefore the block_elements directive
is ignored (does nothing) for the time being.
If you’re running a browser job (use_browser: true) and not interested in all elements of a website, you can skip
downloading the ones that you don’t care, paying attention that some elements may be required for the correct rendering
of the website (always test!). Typical elements to skip include stylesheet, font, image, media, and
other, and they can be specified like this on a job-by-job basis:
name: This is a Javascript site
note: It's just a test
url: https://www.example.com
use_browser: true
block_elements:
- stylesheet
- font
- image
- media
- other
or like this in the config file for all use_browser: true jobs:
job_defaults:
browser:
block_elements:
- stylesheet
- font
- image
- media
- other
The full list of supported resources is the following (from here):
documentstylesheetimagemediafontscripttexttrackxhrfetcheventsourcewebsocketmanifestother