Jobs

Each job contains the pointer to the source of the data to be monitored (URL or command) and related directives, plus eventual directives on transformations (filters) to apply to the data (and/or diff) once retrieved.

The list of jobs is contained in the jobs file jobs.yaml, a YAML text file editable with the command webchanges --edit or using any text editor.

YAML tips

The YAML syntax has lots of idiosyncrasies that make it finicky, and new users often have issues with it. Below are some tips and things to look for when using YAML, but please also see a more comprehensive introduction to YAML here.

  • Indentation: All indentation must be done with spaces (2 spaces is suggested); tabs are not recognized/allowed. Indentation is mandatory and needs to be consistent throughout the file.

  • Nesting: The indentation logic sometimes changes when nesting dictionaries.

filter:
  - html2text:           # a list item; notice 2 spaces before the '-'
      pad_tables: true   # a directory item; notice 6 spaces before the name
  • There must be a space after the : between the key name and its value. The lack of such space is often the reason behind “Unknown filter kind” errors with no arguments.

filter:
  - re.sub: text  # This is correct
filter:
  - re.sub:text  # This is INCORRECT; space is required
  • Escaping special characters: Certain characters at the beginning of the line such as a -, a : followed by a space, a space followed by #, the % sign (anywhere), all sort of brackets, and more are all considered special characters by YAML. Strings containing these characters or sequences need to be enclosed in quotes:

name: This is a human-readable name/label of the job  # and this is a remark
name: "This one has a: colon followed by a space and a space followed by a # hash mark"
name: "I must escape \"double\" quotes within a double quoted string"
  • You can learn more about quoting special characters here (the library we use supports YAML 1.1, and our examples use “flow scalars”). URLs and XPaths are always safe and don’t need to be enclosed in quotes.

For additional information on YAML, see the YAML syntax page and the references at the bottom of that page.

Multiple jobs

Multiple jobs are separated by a line containing three hyphens, i.e. ---.

Naming a job

While optional, it is recommended that each job starts with a name entry. If omitted and the data monitored is HTML or XML, webchanges will automatically use the first 60 characters of the document’s title (if present) as the job’s name.

name: This is a human-readable name/label of the job
url: https://example.org/

URL

This is the main job type. It retrieves a document from a web server (https:// or http://), an ftp server (ftp://), or a local file (file://).

name: Example homepage
url: https://www.example.org/
---
name: Example page 2
url: https://www.example.org/page2
---
name: Example a local file
url: file://syslog
---
name: Example of an FTP file (username anonymous if not specified)
url: ftp://username:password@ftp.example.com/file.txt

Caution

Due to a legacy architectural choice, URLs must be unique to each job. If for some reason you want to monitor the same resource multiple times, make each job’s URL unique by e.g. adding # at the end of the link followed by a unique remark (the # and everything after is typically discarded by a web server, but captured by webchanges):

name: Example homepage
url: https://example.org/
---
name: Example homepage -- again!
url: https://example.org/#2

If you specify user_visible_url, then the value of this directive is the one used for this restriction.

Internally, this type of job has the attribute kind: url.

Changed in version 3.6: Added support for ftp:// URIs.

JavaScript rendering (use_browser: true)

If you’re monitoring a website that requires for its content to be rendered with JavaScript in order to monitor the data you are interested in, add the directive use_browser: true to the job:

name: A page with JavaScript
url: https://example.org/
use_browser: true

Warning

As this job type renders the page in a headless Google Chrome instance, it requires more resources and time than a simple url job. Only use when you can’t find alternate ways to get to the data (e.g. an API being called by the page, see tip below) you want to monitor.

Tip

In many instances you can get the data you want to monitor directly from a REST API (URL) called by the site during its page loading. Use browser developer tools (e.g., Chrome DevTools - Ctrl+Shift+I) to inspect network activity (use the its network activity inspection tab. If you find relevant API calls, extract the URL, method, and data to monitor it in a url job without the need to specify use_browser: true.

Important

  • The optional Playwright Python package must be installed; run pip install webchanges[use_browser] to install it.

  • The first time you run a job with use_browser:true, if the latest version of Google Chrome is not found, Playwright will download it (~350 MiB). This it could take some time (and bandwidth). You can pre-install the latest version of Chrome at any time by running webchanges --install-chrome.

When using use_browser: true, you do not need to set any headers in the configuration file or job unless the site you’re monitoring has special requirements.

While we implement measures to minimize website detection of headless Chrome, passing basic detection tests here, some sites use advanced anti-automation methods such as rate limiting, session initialization (see :ref:initialization_url for handling), CAPTCHAs, browser fingerprinting, etc. that might block your monitoring.

Tip

Please see the no_conditional_request directive if you need to turn off the use of conditional requests for those extremely rare websites that don’t handle it (e.g. Google Flights).

Tip

If a job fails, you can run in verbose (-v) mode to save in the temporary folder a screenshot, a full page image, and the HTML contents at the moment of failure (see log for filenames) to aid in debugging.

Internally, this type of job has the attribute kind: browser.

Changed in version 3.0: JavaScript rendering is done using the use_browser: true directive instead of replacing the url directive with navigate, which is now deprecated.

Changed in version 3.10: Using Playwright and Google Chrome instead of Pyppeteer and Chromium.

Changed in version 3.11: Implemented measures to reduce the chance of detection.

Changed in version 3.14: Saves the screenshot, full page image and HTML contents when a job fails while running in verbose mode.

Required directives

url

The URI of the resource to monitor. https://, http://, ftp:// and file:// are supported.

Optional directives (all url jobs)

The following optional directives are available for all url jobs:

use_browser

Whether to use a Chrome web browser (true/false). Defaults to false.

If true, it renders the URL via a JavaScript-enabled web browser and extracts the HTML after rendering (see above for important information).

compared_versions

Number of saved snapshots to compare against (int). Defaults to 1.

If set to a number greater than 1, instead of comparing the current data to only the very last snapshot captured, it is matched against any of n snapshots. This is very useful when a webpage frequently changes between several known stable states (e.g. they’re doing A/B testing), as changes will be reported only when the content changes to a new unknown state, in which case the differences are shown relative to the closest match.

Refer to the command line argument --max-snapshots to ensure that you are saving the number of snapshots you need for this directive to run successfully (default is 4) (see here).

New in version 3.10.2.

cookies

Cookies to send with the request (a dict).

See examples here.

Changed in version 3.0: Works for all url jobs, including those with use_browser: true.

enabled

Convenience setting to disable running the job while leaving it in the jobs file (true/false). Defaults to true.

New in version 3.18.

headers

Headers to send along with the request (a dict).

See examples here.

Note that with browser: true the Referer header will be replaced by the contents of the referer directive if specified.

Changed in version 3.0: Works for all url jobs, including those with use_browser: true.

http_client

The Python HTTP client library to be used, either HTTPX or requests. Defaults to HTTPX.

We use HTTPX as some web servers will refuse a connection or serve an error if a connection is attempted using an earlier version than the newer HTTP/2 network protocol. Use http_client: requests to use the requests library used by default in releases prior to 3.16 (but it only supports up to HTTP/1.1 protocol).

Required packages

To use http_client: requests, unless the requests library is already installed in the system, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[requests]

New in version 3.16.

http_proxy

Proxy server to use for HTTP requests (a string). If unspecified or null/false, the system environment variable HTTP_PROXY, if defined, will be used. Can be one of https://, http:// or socks5:// protocols.

E.g. https://username:password@proxy.com:8080.

Changed in version 3.0: Works for all url jobs, including those with use_browser: true.

https_proxy

Proxy server to use for HTTPS (i.e. secure) requests (a string). If unspecified or null/false, the system environment variable HTTPS_PROXY, if defined, will be used. Can be one of https://, http:// or socks5:// protocols.

E.g. https://username:password@proxy.com:8080.

Changed in version 3.0: Works for all url jobs, including those with use_browser: true.

data

The request payload to send with an HTTP request method like POST (a dict or string).

If the data is a dict, it will be sent urlencoded unless the directive data_as_json: true is also present, in which case it will be serialized as JSON before being sent.

When this directive is specified:

  • If no method directive is specified, it is set to POST.

  • If no Content-type header is specified, such header is set to application/x-www-form-urlencoded unless the data_as_json: true directive is present, in which case it is set to application/json.

See example here.

Changed in version 3.8: Works for all url jobs, including those with use_browser: true.

Changed in version 3.15: Added data_as_json: true.

data_as_json

The data in data is to be sent in JSON format (true/false). Defaults to false.

If true, the data will be serialized into JSON before being sent, and if no Content-type header is specified, such header is set to application/json.

See example within the directive ‘data’.

New in version 3.15.

method

HTTP request method to use (a string).

Must be one of GET, OPTIONS, HEAD, POST, PUT, PATCH, or DELETE. Defaults to GET unless the data directive, below, is set when it defaults to POST.

Error

Setting a method other than GET with use_browser: true may result in any 3xx redirections received by the website to be ignored and the job hanging until it times out. This is due to bug #937719 in Chromium. Please take the time to add a star to the bug report so it will be prioritized for a faster fix.

Changed in version 3.8: Works for all url jobs, including those with use_browser: true.

no_conditional_request

In order to speed things up, webchanges sets the If-Modified-Since and/or If-None-Match headers on all requests, making them conditional requests (see more here). In extremely rare cases (e.g. Google Flights) the If-Modified-Since will cause the website to hang or return invalid data, so you can disable conditional requests with the directive no_conditional_request: true to ensure it is not added to the query.

note

Informational note added under the header in reports (a string). Example:

name: Weather warnings
note: If there's a hurricane watch, book a flight to get out of town
url: https://example.org/weatherwarnings

New in version 3.2.

ignore_cached

Do not use cache control values (ETag/Last-Modified) (true/false). Defaults to false.

Also see no_conditional_request.

Changed in version 3.10: Works for all url jobs, including those with use_browser: true.

ignore_connection_errors

Ignore (temporary) connection errors (true/false). Defaults to false.

See more here.

Changed in version 3.5: Works for all url jobs, including those with use_browser: true.

ignore_http_error_codes

Ignore error if a specified HTTP response status code is received (an integer, string, or list).

Also accepts 3xx, 4xx, and 5xx as values to denote an entire class of response status codes. For example, 4xx will suppress any error from 400 to 499 inclusive, i.e. all client error response status codes.

See more here.

Changed in version 3.5: Works for all url jobs, including those with use_browser: true.

ignore_timeout_errors

Ignore error if caused by a timeout (true/false). Defaults to false.

See more here.

Changed in version 3.5: Works for all url jobs, including those with use_browser: true.

ignore_too_many_redirects

Ignore error if caused by a redirect loop (true/false). Defaults to false.

See more here.

Changed in version 3.5: Works for all url jobs, including those with use_browser: true.

timeout

Override the default timeout, in seconds (a number). The default is 60 seconds for url jobs unless they have the directive `use_browser: true, in which case it’s 90 seconds. If set to 0, timeout is disabled.

See example here.

Changed in version 3.0: Works for all url jobs, including those with use_browser: true.

Optional directives (without use_browser: true)

The following directives are available only for url jobs without use_browser: true:

encoding

Override the character encoding from the server or determined programmatically by the HTTP client library (a string).

See more here.

ignore_dh_key_too_small

Enable insecure workaround for servers using a weak (smaller than 2048-bit) Diffie-Hellman (true/false). Defaults to false.

A weak key can allow a man-in-the-middle attack with through the Logjam Attack against the TLS protocol and therefore generates an error. This workaround attempts the use of a potentially weaker cipher, one that doesn’t rely on a DH key and therefore doesn’t trigger the error.

Set it as a last resort if you’re getting a ssl.SSLError: [SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:1129) error and can’t get the anyone to fix the security vulnerability on the server.

New in version 3.9.2.

no_redirects

Disables GET, OPTIONS, POST, PUT, PATCH, DELETE, HEAD redirection (true/false). Defaults to false (i.e. redirection is enabled) for all methods except HEAD. See more here. Redirection takes place whenever an HTTP status code of 301, 302, 303, 307 or 308 is returned.

Example:

url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/203001.pdf
no_redirects: true
filter:
  - html2text:

Returns:

302 Found
---------

# Found
The document has moved [here](https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible).
* * *
Apache/2.2.15 (CentOS) Server at donneespubliques.meteofrance.fr Port 80

New in version 3.2.7.

retries

Number of times to retry a url after receiving an error before giving up (a number). Default 0.

Setting it to 1 will often solve the ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) error received when attempting to connect to a misconfigured server.

url: https://www.example.com/
retries: 1
filter:
  - html2text:

ssl_no_verify

Do not verify SSL certificates (true/false).

See more here.

Optional directives (only with use_browser: true)

The following directives are available only for url jobs with use_browser: true (i.e. using Chrome):

block_elements

Do not load specified resource types requested by page loading (a list).

Used to speed up loading (typical elements to skip include stylesheet, font, image, media, and other).

name: This is a Javascript site
note: It's just a test
url: https://www.example.com
use_browser: true
block_elements:
  - stylesheet
  - font
  - image
  - media
  - other

Supported resources are document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, and other.

New in version 3.19.

ignore_default_args

If true, Playwright does not pass its own configurations args to Google Chrome and only uses the ones from switches (args in Playwright-speak); if a list is given, then it filters out the given default arguments (true/false or list). Defaults to false.

Dangerous option; use with care. However, the following settings at times improves things:

New in version 3.10.

ignore_https_errors

Ignore HTTPS errors (true/false). Defaults to false.

New in version 3.0.

init_script

Executes the JavaScript in Chrome after launching it and before navigating to url (a string).

This could be useful to e.g. unset certain default Chrome navigator properties by calling a JavaScript function to do so.

New in version 3.19.

initialization_js

Only used with initialization_url, executes the JavaScript in Chrome after navigating to initialization_url and before navigating to url (a string).

This could be useful to e.g. emulate logging in when it’s done by a JavaScript function.

New in version 3.10.

initialization_url

The browser will navigate to initialization_url before navigating to url (a string).

This could be useful for monitoring subpages on websites that rely on a state established when first landing on their “home” page. Also see initialization_js above. Note that all the wait_for_* directives apply only after navigating to url and not after initialization_url.

New in version 3.10.

referer

The referer header value (a string).

If provided, it will take preference over the the Referer header value set within the headers directive.

New in version 3.10.

switches

Additional command line switch(es) to pass to Google Chrome, which is a derivative of Chromium (a list). These are called args in Playwright.

New in version 3.0.

user_data_dir

A path to a pre-existing user directory (containing, e.g., cookies etc.) that Chrome should be using (a string).

New in version 3.0.

wait_for_function

Waits for a JavaScript string to be evaluated in the browser context to return a truthy value (a string or dict).

If the string (or the string in the expression key of the dict) looks like a function declaration, it is interpreted as a function. Otherwise, it is evaluated as an expression.

Additional options can be passed when a dict is used: see here.

If wait_for_url and/or wait_for_selector is also used, wait_for_function is applied after.

New in version 3.10.

Changed in version 3.10: Replaces wait_for with a JavaScript function.

wait_for_selector

Waits for the element specified by selector string to become visible (a string or dict).

This happens when for the element to have non-empty bounding box and no visibility:hidden. Note that an element without any content or with display:none has an empty bounding box and is not considered visible.

Selectors supported include text, css, layout, XPath, React and Vue, as well as the :has-text(), :text(), :has() and :nth-match() pseudo classes. More information on working with selectors is here.

Additional options (especially what state to wait for, which could be one of attached, detached and hidden in addition to the default visible) can be passed by using a dict. See here for all the arguments and additional details.

If wait_for_url is also used, wait_for_selector is applied after.

New in version 3.10.

Changed in version 3.10: Replaces wait_for with a selector or xpath string.

wait_for_timeout

Waits for the given timeout in seconds (a number).

If wait_for_url, wait_for_selector and/or wait_for_function is also used, wait_for_timeout is applied after.

Cannot be used with block_elements.

New in version 3.10.

Changed in version 3.10: Replaces wait_for with a number.

wait_for_url

Wait until navigation lands on a URL matching this text (a string or dict).

The string (or the string in the url key of the dict) can be a glob pattern or regex pattern to match while waiting for the navigation. Note that if the parameter is a string without wildcard characters, the method will wait for navigation to a URL that is exactly equal to the string.

Useful to avoid capturing intermediate redirect pages.

Additional options can be passed when a dict is used: see here.

If other wait_for_* directives are used, wait_for_url is applied first.

Cannot be used with block_elements.

New in version 3.10.

Changed in version 3.10: Replaces wait_for_navigation

wait_until

The event of when to consider navigation succeeded (a string):

  • load (default): Consider operation to be finished when the load event is fired.

  • domcontentloaded: Consider operation to be finished when the DOMContentLoaded event is fired.

  • networkidle (old networkidle0 and networkidle2 map here): Consider operation to be finished when there are no network connections for at least 500 ms.

  • commit: Consider operation to be finished when network response is received and the document started loading.

New in version 3.0.

Changed in version 3.10: networkidle0 and networkidle2 are replaced by networkidle; added commit.

Command

This job type allows you to watch the output of arbitrary shell commands. This could be useful for monitoring files in a folder, output of scripts that query external devices (RPi GPIO), and many other applications.

name: What is in my home directory?
command: dir -al ~

Important

On Linux and macOS systems, due to security reasons, a command job or a job with a command differ will not run unless both the jobs file and the directory it is located in are owned and writeable by only the user who is running the job (and not by its group or by other users). To set this up:

cd ~/.config/webchanges  # could be different
sudo chown $USER:$(id -g -n) . *.yaml
sudo chmod go-w . *.yaml
  • sudo may or may not be required.

  • Replace $USER with the username that runs webchanges if different than the use you’re logged in when making the above changes, similarly with $(id -g -n) for the group.

Internally, this type of job has the attribute kind: command.

Changed in version 3.11: kind attribute was renamed from shell to command but the former is still recognized.

Required directives

command

The shell command to execute.

Optional directives (for all job types)

These optional directives apply to all job types:

additions_only

Filter the unified diff output to keep only addition lines (no value required).

See here.

New in version 3.0.

deletions_only

Filter the unified diff output to keep only deleted lines (no value required).

See here.

New in version 3.0.

diff_filter

Filter(s) to be applied to the diff result (a list of dicts).

See here.

Can be tested with --test-differ.

diff_tool (deprecated)

Deprecated command to an external tool for generating diff text (a string). See new Differs directive command.

Replace:

diff_tool: my_command

with:

differ:
  command: my_command

Changed in version 3.21: Deprecated and replaced with differ command.

Changed in version 3.0.1: * Reports now show date/time of diffs generated using diff_tool. * Output from diff_tool: wdiff is colorized in html reports.

filter

Filter(s) to apply to the data retrieved (a list of dicts).

See here.

Can be tested with --test.

kind

For Python programmers only, this is used to associate the job to a custom job Class defined in hooks.py, by matching the contents of this directive to the __kind__ variable of the custom Class.

The three built-in job Classes are:

  • kind: url for url jobs without the browser directive;

  • kind: browser for url jobs with the browser: true directive;

  • kind: command for command jobs (formerly called shell).

is_markdown

Data is in Markdown format (true/false). Defaults to false unless set by a filter such as html2text.

Tells the html report that the data is in Markdown format and should be reconstructed into HTML.

max_tries

Number of consecutive times the job has to fail before reporting an error (an integer). Defaults to 1.

Due to legacy naming, this directive doesn’t do what intuition would tell you it should do, rather, it tells webchanges not to report a job error until the job has failed for the number of consecutive times of max_tries.

Specifically, when a job fails for any reason, webchanges increases an internal counter; it will report an error only when this counter reaches or exceeds the number of max_tries (default: 1, i.e. at the first error encountered). The internal counter is reset to 0 when the job succeeds.

For example, if you set a job with max_tries: 12 and run webchanges every 5 minutes, you will only get notified after the job has failed every single time during the span of one hour (5 minutes * 12 = 60 minutes), and from then onwards at every run until the job succeeds again and the counter resets to 0.

monospace

Data is to be reported using a monospace font (true/false). Defaults to false unless set to true by a filter of a differ (see that filter/differ).

When using an html report the data will be displayed using a monospace font. Useful e.g. when the pdf2text filter extracts tabular text.

New in version 3.9.

Changed in version 3.20: Default setting can be overridden by a filter or differ.

name

Human-readable name/label of the job used in reports (a string).

If this directive is not specified, the label used in reports will either be the url or the command itself or, for url jobs retrieving HTML or XML data, the first 60 character of the contents of the <title> field if found.

While jobs are executed in parallel for speed, they appear in the report in alphabetical order by name, so you can control the order in which they appear through their naming.

Changed in version 3.0: Added auto-detect <title> tag in HTML or XML.

Changed in version 3.11: Reports are sorted by job name.

user_visible_url

URL or text to use in reports instead of contents of url or command (a string).

Useful e.g. when a watched URL is a REST API endpoint or you are using a custom script but you want a link to the webpage on your report.

New in version 3.0.3.

Changed in version 3.8: Added support for command jobs; previously worked only with url jobs.

Setting default directives

See here for how to set default directives for all jobs or for jobs of an individual kind.