Filters

Overview

Filters can be applied at either of two stages of processing:

Applied to the downloaded data before storing it and diffing for changes (filter).
Applied to the diff result before reporting the changes (diff_filter).

While creating your job pipeline, you might want to preview what the filtered output looks like. For filters applied to the data, you can run webchanges with the --test-filter command-line option, passing in the index (from --list) or the URL/command of the job to be tested:

webchanges --test 1   # Test the first job in the jobs list and show the data collected after it's filtered
webchanges --test https://example.net/  # Test the job that matches the given URL
webchanges --test -1  # Test the last job in the jobs list

This command will show the output that will be captured, stored, and used for the comparison to the old version stored from a previous run against the same url or command.

Once webchanges has collected at least 2 historic snapshots of a job (e.g. two different states of a webpage) you can start testing the effects of your diff_filter with the command-line option --test-differ, passing in the index (from --list) or the URL/command of the job to be tested:

webchanges --test-differ 1    # Test the first job in the jobs list and show the report
webchanges --test-differ -2   # Test the second-to-last job in the jobs list and show the report

At the moment, the following filters are available:

To select HTML (or XML) elements:
- css: Filter XML/HTML using CSS selectors.
- xpath: Filter XML/HTML using XPath expressions.
- element-by-class: Get all HTML elements matching a class.
- element-by-id: Get all HTML elements matching an id.
- element-by-style: Get all HTML elements matching a style.
- element-by-tag: Get all HTML elements matching a tag.
To extract text from HTML (or XML):
- html2text: Convert HTML to plaintext.
To make HTML more readable:
- beautify: Beautify HTML.
To fix HTML links from relative to absolute:
- absolute_links: fix HTML relative links.
To make CSV files more readable:
- csv2text: Convert CSV to plaintext.
To extract text from PDFs:
- pypdf: Convert PDF to plaintext (recommended).
- pdf2text: Convert PDF to plaintext (Poppler required to be manually installed in the OS).
To save images:
- ascii85: Convert binary data such as images to text (for downstream differ image).
To extract text from images:
- ocr: Extract text from images.
To extract ASCII text from JSON:
- jq: Filter ASCII JSON.
To make JSON more readable:
- jsontoyaml: Reformat JSON to YAML.
- format-json: Reformat (pretty-print) JSON.
To make XML more readable:
- format-xml: Reformat (pretty-print) XML (using lxml.etree).
- pretty-xml: Reformat (pretty-print) XML (using Python’s xml.minidom).
To make iCal more readable:
- ical2text: Convert iCalendar to plaintext.
To make binary readable:
- hexdump: Display data in hex dump format.
To just detect if anything changed:
- sha1sum: Calculate the SHA-1 checksum of the data.
To filter and/or edit text:
- delete_lines_containing: Delete lines containing specified text or matching a Python regular expression.
- keep_lines_containing: Keep only lines containing specified text or matching a Python regular expression.
- re.findall: Extract, replace or remove all non-overlapping text matching a Python regular expression.
- re.sub: Replace or remove text matching a Python regular expression.
- remove_repeated: Remove repeated items (lines).
- reverse: Reverse the order of items (lines).
- sort: Sort lines.
- strip: Strip leading and/or trailing whitespace or specified characters.
To run any custom script or program:
- execute: Run a program that filters the data (see also shellpipe, to be avoided).

Advanced Python programmers can write their own custom filters; see Hook your own Python code.

`absolute_links`

Convert relative URLs of all action, href` and ``src attribute in any HTML tag, as well the data attribute of the <object> tag, to absolute ones.

Note

This filter is not needed (and could interfere) if you already are using the beautify filter (which has an absolute_links sub-directive that defaults to true) or the html2text filter (which already converts relative links).

url: https://example.net/absolute_links.html
filters:
  - absolute_links

Added in version 3.16.

Changed in version 3.21: Converts URLs of all action, href and src attributes found in any tag as well the data attribute of the <object> tag.

`ascii85`

Encodes binary data (e.g. image data) to text using Ascii85. Ascii85 is more space-efficient than Base64, encoding more bytes into fewer characters. This filter can be useful to monitor images in combination with the image differ.

url: https://example.net/favicon_85.ico
filters:
  - ascii85

Added in version 3.21.

`base64`

Encodes binary data (e.g. image data) to text using RFC 4648 Base64. This filter can be useful to monitor images in combination with the image differ. Also see ascii85, which is more efficient.

url: https://example.net/favicon.ico
filters:
  - base64

Added in version 3.16.

`beautify`

This filter uses the Beautiful Soup, jsbeautifier and cssbeautifier Python packages to reformat the HTML in a document to make it more readable (keeping it as HTML).

url: https://example.net/beautify.html
filters:
  - beautify: 1

Optional sub-directives

absolute_links (true/false): Convert relative links to absolute ones (default: true).
indent (integer or string): If indent is a non-negative integer or string, then the contents of HTML elements will be indented appropriately when pretty-printing them. An indent level of 0, negative, or “” will only insert newlines. Using a positive integer indent indents that many spaces per level. If indent is a string (such as “t”), that string is used to indent each level (default: 1, i.e. indent one space per level).

url: https://example.net/beautify_absolute_links_false.html
filters:
  - beautify:
      absolute_links: false
      indent: 1

Changed in version 3.16: Relative links are converted to absolute ones; use the absolute_links: false sub-directive to disable.

Changed in version 3.16: Added absolute_links sub-directive.

Changed in version 3.9.2: Added indent sub-directive.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[beautify]

`css`

The css filter extracts HTML or XML content based on a CSS selector. It uses the cssselect Python package, which has limitations and extensions as explained in its documentation. This filter works very similarly to, and its sub-directives are almost identical to, those of the xpath filter.

Examples: to filter only the <body> element of the HTML document, stripping out everything else:

url: https://example.net/css.html
filters:
  - css: ul#groceries > li.unchecked

Tip

If you are looking at a website using Google Chrome, you can find the css of an HTML node in DevTools (Ctrl+Shift+I) by right clicking on the element and selecting ‘Copy -> Copy selector’. You can learn more about Chrome DevTools here.

Using the `css` filter with XML

By default, the css filter is set up to handle HTML documents, but they also work on XML documents by declaring the sub-directive method: xml.

For example, to parse an RSS feed for the titles and publication dates, use:

url: https://example.com/blog/css-index.rss
filters:
  - css:
      method: xml
      selector: 'item > title, item > pubDate'
  - html2text: strip_tags

To match an element in an XML namespace, use a namespace prefix before the tag name. Use a : to separate the namespace prefix and the tag name in an XPath expression.

url: https://example.org/feed/css-namespace.xml
filters:
  - css:
      method: xml
      selector: 'item > media|keywords'
      namespaces:
        media: http://search.yahoo.com/mrss/
  - html2text:

Using the `css` filter to exclude content

Elements selected by the exclude sub-directive are removed from the final result. For example, the following job will not have any <a> tag in its results:

url: https://example.org/css-exclude.html
filters:
  - css:
      selector: 'body'
      exclude: 'a'

Limiting the returned items from a CSS selector

If you only want to return a subset of the items returned by a CSS selector, you can use two additional sub-directives:

skip: How many elements to skip from the beginning (default: 0).
maxitems: How many elements to return at most (default: no limit).

For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:

url: https://example.net/css-skip-maxitems.html
filters:
  - css:
      selector: div.cpu
      skip: 1
      maxitems: 2

Duplicated results

If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use maxitems: 1 to only return the first item.

Sorting output to fix list reordering

In some cases, the ordering of items on a webpage might change regularly without the actual content changing. By default, this would show up in the diff output as an element being removed from one part of the page and inserted in another part of the page.

In cases where the order of items doesn’t matter, it’s possible to sort matched items lexicographically to avoid spurious reports when only the ordering of items changes on the page.

The subfilter for the css filter is sort, and can be true or false (the default):

url: https://example.org/css-items-random-order.html
filters:
  - css:
      selector: span.item
      sort: true

Alternatively, you can chain the sort filter.

Optional directives

selector (default): the CSS selector.
method: Either of html (default) or xml.
namespaces: Mapping of XML namespaces for matching (default: None).
exclude: css selector for elements to remove from the final result (default: None).
skip: Number of elements to skip from the beginning (default: 0).
maxitems: Maximum number of items to return (default: all).
sort (true/false): Sort elements lexographically (default: false).

`csv2text`

The filter csv2text turns tabular data formatted as comma separated values (CSV) into a prettier textual representation. This is done by supplying a Python format string where the csv data is replaced into. If the CSV has a header, the format string should use the header names (lowercased).

For example, given the following csv data:

Name,Company
Smith,Apple
Doe,Google

we can make it more readable by using:

url: https://example.org/data.csv
filters:
  - csv2text:
     format_message: Mr. or Ms. {name} works at {company}.  # note the lowercase in the replacement_fields
     has_header: true

to produce:

Mr. or Ms. Smith works at Apple.
Mr. or Ms. Doe works at Google.

If there is no header row, or ignore_header is set to true, you will need to use the numeric array notation: Mr. or Mrs. {0} works at {1}..

Optional sub-directives

format_message (default): The Python format string containing “replacement fields” into which the data from the csv is substituted. Field names are the column headers (in lowercase) if the data has column headers or numeric starting from 0 if the data has no column headers or ignore_header is set to true.
has_header (true/false): Specifies whether the first row is a series of column headers (default: use the rough heuristics provided by Python’s csv.Sniffer.has_header method.
ignore_header (true/false): If set to true, it will parse the format_message as having numeric replacement fields even if the data has column headers (or has_header, immediately above, is set to true).

`delete_lines_containing`

This filter discards all lines that contain the text specified (default) or match the Python regular expression specified (with re sub-directive), keeping the others.

Note that while this filter emulates Linux’s grep, it does not use the executable grep.

Examples:

name: "eliminate lines that contain 'xyz'"
url: https://example.com/delete_lines_containing.txt
filters:
  - delete_lines_containing: 'xyz'

name: "eliminate lines that start with 'warning' irrespective of its case (e.g. Warning, Warning, warning, etc.)"
url: https://example.com/delete_lines_containing_re.txt
filters:
  - delete_lines_containing:
      re: '(?i)^warning'

Notes: in regex, (?i) is the inline flag for case-insensitive matching and ^ (caret) matches the start of the string.

Optional sub-directives

text: (default) Match the text provided.
re: Match the the Python regular expression provided.

Changed in version 3.0: Renamed from grepi to avoid confusion.

`element-by-` [ `class` | `id` | `style` | `tag` ]

The filters element-by-class, element-by-id, element-by-style, and element-by-tag allow you to select all matching instances of a given HTML element.

Examples:

To extract only the <body> of a page:

url: https://example.org/bodytag.html
filters:
  - element-by-tag: body

To extract <div id="something">.../<div> from a page:

url: https://example.org/idtest.html
filters:
  - element-by-id: something

Since you can chain filters, use this to extract an element within another element:

url: https://example.org/idtest_2.html
filters:
  - element-by-id: outer_container
  - element-by-id: something_inside

To make the output human-friendly you can chain html2text on the result:

url: https://example.net/id2text.html
filters:
  - element-by-id: something
  - html2text:

To extract <div style="something">.../<div> from a page:

url: https://example.org/styletest.html
filters:
  - element-by-style: something

`execute`

The data to be filtered is passed as the input to a command to be run, and the output from the command is used in webchanges’s next step. All environment variables are preserved and the following ones added:

Environment variable	Description
`WEBCHANGES_JOB_JSON`	All job parameters in JSON format
`WEBCHANGES_JOB_LOCATION`	Value of either `url` or `command`
`WEBCHANGES_JOB_NAME`	Name of the job
`WEBCHANGES_JOB_NUMBER`	The job’s index number

For example, we can execute a Python script:

name: Test execute filter
url: https://example.net/execute.html
filters:
  # For multiline YAML, quote the string and unindent its continuation. A space is added at the end
  # of each line. Pay attention to escaping!
  - execute: "python -c \"import os, sys;
  print(f\\\"The data is '{sys.stdin.read()}'\\nThe job location is
  '{os.getenv('WEBCHANGES_JOB_LOCATION')}'\\nThe job name is
  '{os.getenv('WEBCHANGES_JOB_NAME')}'\\nThe job number is
  '{os.getenv('WEBCHANGES_JOB_INDEX_NUMBER')}'\\nThe job JSON is
  '{os.getenv('WEBCHANGES_JOB_JSON')}'\\\", end='')\""

Or instead we can call a script we have saved, e.g. - execute: python3 myscript.py.

If the command generates an error, the output of the error will be in the first line, before the traceback.

Tip

If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8 mode as per instructions here.

Optional sub-directives

command (default, str): The command to execute.
escape_characters (bool): When running in Windows, escape characters in command (e.g. % become %% and ! become ^!) (default: false).

Changed in version 3.8: Added additional WEBCHANGES_JOB_* environment variables.

Changed in version 3.34: Added escape_characters sub-directive.

`format-json`

This filter serializes the JSON data to a pretty-printed indented string using Python’s json.dumps (or, if installed, the same function from the simplejson library) with a default indent level of 4.

..tip:: For a more compact and legible output, use jsontoyaml instead.

If the job directive monospace is unset, to improve the readability in HTML reports this filter will set it to true. To override, add the directive monospace: true to the job (see here).

Optional sub-directives

indentation (integer or string): Either the number of spaces or a string to be used to indent each level with; if 0, a negative number or "" then no indentation (default: 4, i.e. 4 spaces).
sort_keys (true/false): Whether to sort the output of dictionaries by key (default: false).

Added in version 3.0.1: sort_keys sub-directive.

Changed in version 3.20: The filter sets the job’s monospace directive to true.

`format-xml`

This filter deserializes an XML object and reformats it. It uses the lxml Python package’s etree.tostring pretty_print function.

name: "reformat XML using lxml's etree.tostring"
url: https://example.com/format_xml.xml
filters:
  - format-xml:

Added in version 3.0.

`hexdump`

This filter displays the contents both in binary and ASCII using the hex dump format.

name: Display binary and ASCII test
command: cat testfile
filters:
  - hexdump:

`html2text`

This filter converts HTML (or XML) to Unicode text.

Optional sub-directives

method: One of:

html2text (default): Uses the html2text Python package and retains some simple formatting from HTML, outputting Markup language with absolute links;

bs4: Uses the Beautiful Soup Python package to extract text from either HTML or XML;

strip_tags: Uses regex to strip tags (HTML or XML).

`html2text`

This method is the default (does not need to be specified) and converts HTML into Markdown using the html2text Python package.

Warning

As this filter relies on the external html2text Python package, new releases of this package may generate text that is formatted slightly differently, and, if so, will cause webchanges to send a one-off change report.

It is the recommended option to convert all types of HTML into readable text, as it can be displayed (after conversion) in HTML.

Example configuration:

url: https://example.com/html2text.html
filters:
  - xpath: '//section[@role="main"]'
  - html2text:
      pad_tables: true

Note

If the content has tables, adding the sub-directive pad_tables: true may improve readability.

Optional sub-directives

See the optional sub-directives in the html2text Python package’s documentation. The following options are set by webchanges but can be overridden:
- unicode_snob: true to ensure that accented characters are kept as they are;
- body_width: 0 to ensure that lines aren’t chopped up;
- ignore_images: true to ignore images (since we’re dealing with text);
- single_line_break: true to ensure that additional empty lines aren’t added between sections;
- wrap_links: false to ensure that links are not wrapped (in case body_width is not set to 0) as it’s not Markdown compatible.

`strip_tags`

This filter method is a simple HTML/XML tag stripper based on applying a regular expression-based function. Very fast but may not yield the prettiest of results.

url: https://example.com/html2text_strip_tags.html
filters:
  - html2text: strip_tags

`bs4`

This filter method extracts visible text from HTML using the Beautiful Soup Python package, specifically its get_text(strip=True) method.

url: https://example.com/html2text_bs4.html
filters:
  - xpath: '//section[@role="main"]'
  - html2text:
      method: bs4
      strip: true

Parsers

Beautiful Soup supports multiple parsers as documented here. We default to the use of the lxml parser as recommended, but you can specify the parser by using the parser sub-directive:

url: https://example.com/html2text_bs4_html5lib.html
filters:
  - xpath: '//section[@role="main"]'
  - html2text:
      method: bs4
      parser: html5lib
      strip: true

Extracting text from XML

This filter can be used to extract text from XML by using the xml parser as follows:

url: https://example.com/html2text_bs4_xml
filters:
  - html2text:
      method: bs4
      parser: xml

Optional sub-directives

parser: the name of the parser library you want to use as per documentation (default: lxml).
separator: Strings extracted from the HTML or XML object will be concatenated using this separator (defaults to the empty string ``).
strip (true/false): If true, strings will be stripped before being concatenated (defaults to false).

Required packages

To run jobs with this filter method, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[bs4]

If (and only if) you specify parser: html5lib, then you also need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[bs4,html5lib]

Changed in version 3.0: Filter defaults to the use of Python html2text package.

Changed in version 3.0: Method re renamed to strip_tags.

Deprecated since version urlwatch: Removed method lynx (external OS-specific dependency).

`ical2text`

This filter reads an iCalendar document and converts it to easy-to read text.

name: "Make iCal file readable"
url: https://example.com/cal.ics
filters:
  - ical2text:

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[ical2text]

`jq`

Linux/macOS ASCII only

The jq filter uses the Python bindings for jq, a lightweight ASCII JSON processor. It is currently available only for Linux (most flavors) and macOS (no Windows) and does not handle Unicode; see below for a cross-platform and Unicode-friendly way of selecting JSON.

Please also note that command line arguments of the standalone jq program are NOT supported by this library.

url: https://example.net/jq-ascii.json
filters:
   - jq: '.[].title'

Supports aggregations, selections, and the built-in operators like length.

For more information on the operations permitted, see the jq Manual.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[jq]

Filtering JSON on Windows or containing Unicode and without `jq`

Python programmers on all OSs can use an advanced technique to select only certain elements of the JSON object; see Selecting items from a JSON dictionary. This method will preserve Unicode characters.

`jsontoyaml`

This filter serializes the JSON data to YAML using Python’s PyYAML library with a default indent level of 2.

If the job directive monospace is unset, to improve the readability in HTML reports this filter will set it to true. To override, add the directive monospace: true to the job (see here).

Optional sub-directives

indentation (integer or string): Either the number of spaces or a string to be used to indent each level with; if 0, a negative number or "" then no indentation (default: 2, i.e. 2 spaces).

Added in version 3.30.

`keep_lines_containing`

This filter keeps only lines that contain the text specified (default) or match the Python regular expression specified (with re sub-directive), discarding the others.

Note that while this filter emulates Linux’s grep, it does not use the executable grep.

Examples:

name: "convert HTML to text, strip whitespace, and only keep lines that have the sequence ``a,b:`` in them"
url: https://example.com/keep_lines_containing.html
filters:
  - html2text:
  - keep_lines_containing: 'a,b:'

name: "keep only lines that contain 'error' irrespective of its case (e.g. Error, ERROR, error, etc.)"
url: https://example.com/keep_lines_containing_re.txt
filters:
  - keep_lines_containing:
      re: '(?i)error'

Note: in regex (?i) is the inline flag for case-insensitive matching.

Optional sub-directives

text (default): Match the text provided.
re: Match the the Python regular expression provided.

Changed in version 3.0: Renamed from grep to avoid confusion.

`ocr`

This filter extracts text from images using the Tesseract OCR engine. Any file format supported by the Pillow (PIL Fork) Python package is supported.

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/ocr-test.png
filters:
  - ocr:
      timeout: 5
      language: eng

Optional sub-directives

timeout: Timeout for the recognition, in seconds (default: 10 seconds).
language: Text language (e.g. fra or eng+fra) (default: eng).

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[ocr]

In addition, you need to install Tesseract itself.

`pdf2text`

This filter converts a PDF file to plaintext using the pdftotext Python library, itself based on the Poppler library.

For most uses, we recommend using the filter pypdf, which achieves similar results without having to separately install OS-specific dependencies (Poppler).

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/pdf-test.pdf
filters:
  - pdf2text

If the PDF file is password protected, you can specify its password:

url: https://example.net/pdf-test-password.pdf
filters:
  - pdf2text:
      password: webchangessecret

By default, pdf2text tries to reproduce the layout of the original document by using spaces. Be aware that these spaces may change when a document is updated, so you may get reports containing a lot of changes consisting of nothing but changes in the spacing between the columns; in this case try turning it off with the sub-directive physical: false.

url: https://example.net/pdf-test-no-physical-layout.pdf
filters:
  - pdf2text:
      physical: false
monospace: true

Tip

If your reports are in HTML format and the PDF is columnar in nature, try using the job directive monospace: true to improve readability (see here).

url: https://example.net/pdf-test-keep-monospace.pdf
filters:
  - pdf2text:
monospace: true

To the opposite, if you don’t care about the layout, you might want to strip all additional spaces that might be added by this filter:

url: https://example.net/pdf-no-multiple-spaces.pdf
filters:
  - pdf2text:
  - re.sub:
      pattern: ' +'
      repl: ' '
  - strip:
      splitlines: true

Optional sub-directives

password: Password for a password-protected PDF file.
physical (true/false): If true, page text is output in the order it appears on the page, regardless of columns or other layout features (default: true). Only one of raw and physical can be set to true.
raw (true/false): If true, page text is output in the order it appears in the content stream (default: false). Only one of raw and physical can be set to true.

Changed in version 3.8.2: Added physical and raw sub-directives.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

uv pip install --upgrade webchanges[pdf2text]

In addition, you need to install any of the OS-specific dependencies of Poppler (see website).

`pretty-xml`

This filter deserializes an XML object and pretty-prints it. It uses Python’s xml.dom.minidom toprettyxml function.

name: "reformat XML using Python's xml.dom.minidom toprettyxml function"
url: https://example.com/pretty_xml.xml
filters:
  - pretty-xml:

Added in version 3.3.

`pypdf`

This filter converts a PDF file to plaintext using the pypdf Python library.

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/pypdf-test.pdf
filters:
  - pypdf

If the PDF file is password protected, you can specify its password:

url: https://example.net/pypdf-test-password.pdf
filters:
  - pypdf:
      password: webchangessecret

The pypdf library locates all text drawing commands in the order they appear in the PDF’s content stream, and then extracts the text. To extract text in a fixed width format that closely adheres to the rendered layout in the source PDF (experimental), use the sub-directive extraction_mode: layout:

url: https://example.net/pypdf-test-layout.pdf
filters:
  - pypdf:
      extraction_mode: layout

Tip

If your reports are in HTML format and the PDF is columnar in nature, try using the job directive monospace: true to improve readability (see here).

url: https://example.net/pypdf-test-monospace.pdf
filters:
  - pypdf:
      extraction_mode: layout
monospace: true

If the layout is not a concern, you may want to remove any additional spaces that the filter might have introduced.

url: https://example.net/pypdf-no-multiple-spaces.pdf
filters:
  - pypdf:
  - re.sub:
      pattern: ' +'
      repl: ' '
  - strip:
      splitlines: true

extract text in a fixed width format that closely adheres to the rendered # layout in the source pdf

Note

Users should be aware that updating the underlying pypdf library may trigger webchanges to generate a new report, even if the actual content of the PDFs has not changed. This is due to the potential formatting improvements introduced by pypdf updates.

Optional sub-directives

password: Password for a password-protected PDF file (dependency required; see below).
extraction_mode: set to layout for experimental layout mode functionality.

Added in version 3.16.

Changed in version 3.27: extraction_mode sub-directive

Required packages

To run jobs with this filter, you need to first install additional Python packages. If you’re not using the password sub-directive, then use the following:

uv pip install --upgrade webchanges[pypdf]

To run jobs with the password sub-directive, then use the following:

uv pip install --upgrade webchanges[pypdf_crypto]

`re.findall`

This filter extracts, deletes or replaces non-overlapping text using Python re.findall regular expression operation.

Just specifying a regular expression (regex) or string as the value will extract the match. Patterns can be replaced with another string using pattern as the expression and repl as the replacement, or deleted by setting repl to an empty string.

All features are described in Python’s re.findall’s documentation. The pattern is first iteratively matched using re.finditer and the repl value is applied to each non-overlapping match; if repl is missing, then group “0” (the entire match) is extracted.

Each match is outputted on its own line.

The following example applies the filter twice:

Just specifying a string as the value will include the full match in the output.
You can use groups (()) and back-reference them with \1 (etc..) to put groups into the replacement string.

By default, the full match will be included in the output.

url: https://example.com/regex-findall.html
filters:
    - re.findall: '<span class="price">.*</span>'
    - re.findall:
        pattern: 'Price: \$([0-9]+)'
        repl: '\1'

Tip

Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.

You can use the entire range of Python’s regular expression (regex) syntax, and you can ask your favorite Generative AI chatbot for help. Some examples:

To extract the first line:

url: https://example.com/regex-firstline.html
command: python -c "[print(f'line {n}') for n in range(1, 3)]"
filters:
  - re.findall: '^.*'

To extract the last line, we use the inline MULTILINE flag ((?m)) and look for a line (^.*$)) that is not followed (negative lookahead assertion) by a newline plus additional text ((?!\n.+)):

url: https://example.com/regex-lastline.html
command: python -c "[print(f'line {n}') for n in range(3)]"
filters:
  - re.findall: '(?m)(^.*$)(?!\n.+)'

Optional sub-directives

pattern: Regular expression pattern or string for matching; this sub-directive must be specified when using the repl sub-directive, otherwise the pattern can be specified as the value of re.sub (in which case a match will be extracted).
repl: The string applied iteratively to each match (default: ‘g<0>’, or extract all matches).

Added in version 3.20.

`re.sub`

This filter deletes or replaces text using Python Python re.sub regular expression operation.

Just specifying a regular expression (regex) or string as the value will remove the match. Patterns can be replaced with another string by specifying repl as the replacement.

All features are described in Python’s re.sub’s documentation. The pattern and repl values are passed to this function as-is; if repl is missing, then it’s considered to be an empty string, and this filter deletes the the leftmost non-overlapping occurrences of pattern.

Tip

Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.

The following example applies the filter 3 times:

name: "Strip href and change a few tags"
url: https://example.com/re_sub.html
filters:
  - re.sub: '\s*href="[^"]*"'
  - re.sub:
      pattern: '<h1>'
      repl: 'HEADING 1: '
  - re.sub:
      pattern: '</([^>]*)>'
      repl: '<END OF TAG \1>'

You can use the entire range of Python’s regular expression (regex) syntax: for example groups (()) in the pattern and \1 (etc.) to refer to these groups in the repl as in the example below, which replaces the number of milliseconds (which may vary each time you check this page and generate a change report) with an X (which therefore never changes):

name: "Replace a changing number in a sentence with an X"
url: https://example.com/re_sub_group.html
filters:
  - html2text:
  - re.sub:
      pattern: '(Page generated in )([0-9.])*( milliseconds.)'
      repl: '\1X\3'

Optional sub-directives

pattern: Regular expression pattern or string to match for replacement; this sub-directive must be specified when using the repl sub-directive, otherwise the pattern can be specified as the value of re.sub (in which case a match will be deleted).
repl: The string for replacement (default: empty string, i.e. deletes the string matched in pattern).

`remove_repeated`

This filter compares adjacent items (lines), and the second and succeeding copies of repeated items (lines) are removed. Repeated items (lines) must be adjacent in order to be found. Works similarly to Unix’s uniq.

By default, it acts over adjacent lines. Three lines consisting of dog - dog - cat will be turned into dog - cat, while dog - cat - dog will stay the same

url: https://example.com/remove-repeated.txt
filters:
  - remove_repeated

This behavior can be changed by using an optional separator string argument. Also, ignore_case will tell it to ignore differences in case and of leading and/or trailing whitespace when comparing. For example, the below will turn mixed-case items separated by a pipe (|) a|b|B |c into a|b|c:

url: https://example.net/remove-repeated-separator.txt
filters:
  - remove_repeated:
      separator: '|'
      ignore_case: true

Prepend it with sort to capture globally unique lines, e.g. to turn dog - cat - dog to cat - dog:

url: https://example.com/remove-repeated-sorted.txt
filters:
  - sort
  - remove_repeated

Finally, setting the adjacent sub-directive to false will cause all duplicates to be removed, even if not adjacent. For example, the below will turn items separated by a pipe (|) a|b|a|c into a|b|c:

url: https://example.net/remove-repeated-non-adjacent.txt
filters:
  - remove_repeated:
      separator: '|'
      adjacent: false

Optional sub-directives

separator (default): The string used to separate items whose order is to be reversed (default: \n, i.e. line-based); it can also be specified inline as the value of remove_repeated.
ignore_case: Ignore differences in case and of leading and/or trailing whitespace when comparing (true/false) (default: false).
adjacent: Remove only adjacent lines or items (true/false) (default: true).

Added in version 3.8.

Changed in version 3.13: Added adjacent sub-directive.

`reverse`

This filter reverses the order of items (lines) without sorting:

url: https://example.com/reverse-lines.txt
filters:
  - reverse

This behavior can be changed by using an optional separator string argument (e.g. items separated by a pipe (|) symbol, as in 1|4|2|3, which would be reversed to 3|2|4|1):

url: https://example.net/reverse-separator.txt
filters:
  - reverse: '|'

Alternatively, the filter can be specified more verbose with a dict. In this example "\n\n" is used to separate paragraphs (items that are separated by an empty line):

url: https://example.org/reverse-paragraphs.txt
filters:
  - reverse:
      separator: "\n\n"

Optional sub-directives

separator: The string used to separate items whose order is to be reversed (default: \n, i.e. line-based reversing); it can also be specified inline as the value of reverse.

`sha1sum`

This filter calculates a SHA-1 hash for the contents. Useful to be notified when anything has changed without any detail and avoiding saving large snapshots of data.

name: "Calculate SHA-1 hash"
url: https://example.com/sha.html
filters:
  - sha1sum:

`shellpipe`

This filter works like execute, except that an intermediate shell process is spawned to run the command. This is to allow for certain corner situations (e.g. relying on variables, glob patterns, and other special shell features in the command) that the execute filter cannot handle.

Danger

The execution of a shell command opens up all sort of security issues and the use of this filter should be avoided in favor of the execute filter.

Example:

url: https://example.net/shellpipe.html
filters:
  - shellpipe: echo TEST

Important

On Linux and macOS systems, due to security reasons the shellpipe filter will not run unless both the jobs file and the directory it is located in are owned and writeable by only the user who is running the job (and not by its group or by other users) or by the root user. To set this up:

cd ~/.config/webchanges  # could be different
sudo chown $USER:$(id -g -n) . *.yaml
sudo chmod go-w . *.yaml

sudo may or may not be required;
If making the change from a different account than the one you run webchanges from, replace $USER:$(id -g -n) with the username:group of the account running webchanges.

Tip

If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8 mode as per instructions here.

Optional sub-directives

command (default, str): The command to execute.
escape_characters (bool): When running in Windows, escape characters in command (e.g. % become %% and ! become ^!) (default: false).

Changed in version 3.34: Added escape_characters sub-directive.

`sort`

This filter performs a line-based sorting, ignoring cases (i.e. case folding as per Python’s implementation).

If the source provides data in random order, you should sort it before the comparison in order to avoid diffing based only on changes in the sequence.

name: "Sorting lines test"
url: https://example.net/sorting.txt
filters:
  - sort

The sort filter takes an optional separator parameter that defines the item separator (by default sorting is line-based), for example to sort text paragraphs (text separated by an empty line):

url: https://example.org/paragraphs.txt
filters:
  - sort:
      separator: "\n\n"

This can be combined with a true/false reverse option, which is useful for sorting and reversing with the same separator (using % as separator, this would turn 3%2%4%1 into 4%3%2%1):

url: https://example.org/sort-reverse-percent.txt
filters:
  - sort:
      separator: '%'
      reverse: true

Optional sub-directives

separator (default): The string used to separate items to be sorted (default: \n, i.e. line-based sorting).
reverse (true/false): Whether the sorting direction is reversed (default: false).

`strip`

This filter removes leading and trailing whitespace or specified characters from a set of characters. Whitespace includes the characters space, tab, linefeed, return, formfeed, and vertical tab.

name: "Strip leading and trailing whitespace from the block of data"
url: https://example.com/strip.html
filters:
  - strip:

name: "Strip trailing commas or periods from all lines"
url: https://example.com/strip_by_line.html
filters:
  - strip:
      chars: ',.'
      side: right
      splitlines: true

name: "Strip beginning spaces, tabs, etc. from all lines"
url: https://example.com/strip_leading_spaces.txt
filters:
  - strip:
      side: left
      splitlines: true

name: "Strip spaces, tabs etc. from both ends of all lines"
url: https://example.com/strip_each_line.html
filters:
  - strip:
      splitlines: true

Optional sub-directives

chars (default): A string specifying the set of characters to be removed instead of the default whitespace.
side: For one-sided removal: either left (strip only leading whitespace or matching characters) or right (strip only trailing whitespace or matching characters).
splitlines (true/false): Apply the filter on each line of text (default: false, apply to the entire data as a block).

Changed in version 3.5: Added optional sub-directives chars, side and splitlines.

`xpath`

The xpath filter extracts HTML or XML content based on a XPath version 1.0 expression. This filter works very similarly to, and its sub-directives are almost identical to, those of the css filter.

See Microsoft’s XPath Examples page for additional information on XPath.

Warning

Make sure to use XPath 1.0 syntax and avoid using certain constructs available only in later versions, as in many cases they will simply be ignored without an error and cause unexpected results to be returned.

Examples: to filter only the <body> element of the HTML document, stripping out everything else:

url: https://example.net/xpath.html
filters:
  - xpath: /html/body/marquee

Tip

If you are looking at a website using Google Chrome, you can find the XPath of an HTML node in DevTools (Ctrl+Shift+I) by right clicking on the element and selecting ‘Copy -> Copy XPath’. You can learn more about Chrome DevTools here.

Using the `xpath` filter with XML

By default, the xpath filter is set up to handle HTML documents, but it also works on XML documents by declaring the sub-directive method: xml.

For example, to parse an RSS feed for the titles and publication dates, use:

url: https://example.com/blog/xpath-index.rss
filters:
  - xpath:
      method: xml
      path: //item/title/text()|//item/pubDate/text()

To match an element in an XML namespace, use a namespace prefix before the tag name. Use a | to separate the namespace prefix and the tag name in a CSS selector.

url: https://example.net/feed/xpath-namespace.xml
filters:
  - xpath:
      method: xml
      path: //item/media:keywords/text()
      namespaces:
        media: http://search.yahoo.com/mrss/

Alternatively, use the XPath expression //*[name()='<tag_name>'] to bypass the namespace entirely.

Using the `xpath` filter to exclude content

Elements selected by the exclude sub-directive are removed from the final result. For example, the following job will not have any <a> tag in its results:

url: https://example.org/xpath-exclude.html
filters:
  - xpath:
      path: //body
      exclude: //a

Limiting the returned items from an XPath expression

If you only want to return a subset of the items returned by an XPath expression, you can use two additional sub-directives:

skip: How many elements to skip from the beginning (default: 0).
maxitems: How many elements to return at most (default: no limit).

For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:

url: https://example.net/xpath-skip-maxitems.html
filters:
  - xpath:
      path: //div[@class="cpu"]
      skip: 1
      maxitems: 2

Duplicated results

If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use maxitems: 1 to only return the first item.

Sorting output to fix list reordering

In some cases, the ordering of items on a webpage might change regularly without the actual content changing. By default, this would show up in the diff output as an element being removed from one part of the page and inserted in another part of the page.

In cases where the order of items doesn’t matter, it’s possible to sort matched items lexicographically to avoid spurious reports when only the ordering of items changes on the page.

The subfilter for the xpath filter is sort, and can be true or false (the default):

url: https://example.org/xpath-items-random-order.html
filters:
  - xpath:
      path: //span[@class="item"]
      sort: true

Alternatively, you can chain the sort filter.

Optional directives

path (default): the XPath expression.
method: Either of html (default) or xml.
namespaces: Mapping of XML namespaces for matching (default: None).
exclude: XPath expression for elements to remove from the final result (default: None).
skip: Number of elements to skip from the beginning (default: 0).
maxitems: Maximum number of items to return (default: all).
sort (true/flase): Sort elements lexographically (default: false).

Filters

Overview

absolute_links

ascii85

base64

beautify

Optional sub-directives

Required packages

css

Using the css filter with XML

Using the css filter to exclude content

Limiting the returned items from a CSS selector

Duplicated results

Sorting output to fix list reordering

Optional directives

csv2text

Optional sub-directives

delete_lines_containing

Optional sub-directives

element-by- [ class | id | style | tag ]

execute

Optional sub-directives

format-json

Optional sub-directives

format-xml

hexdump

html2text

Optional sub-directives

html2text

Optional sub-directives

strip_tags

bs4

Parsers

Extracting text from XML

Optional sub-directives

Required packages

ical2text

Required packages

jq

Linux/macOS ASCII only

Required packages

Filtering JSON on Windows or containing Unicode and without jq

jsontoyaml

Optional sub-directives

keep_lines_containing

Optional sub-directives

ocr

Optional sub-directives

Required packages

pdf2text

Optional sub-directives

Required packages

pretty-xml

pypdf

Optional sub-directives

Required packages

re.findall

Optional sub-directives

re.sub

Optional sub-directives

remove_repeated

Optional sub-directives

reverse

Optional sub-directives

sha1sum

shellpipe

Optional sub-directives

sort

Optional sub-directives

strip

Optional sub-directives

xpath

Using the xpath filter with XML

Using the xpath filter to exclude content

Limiting the returned items from an XPath expression

Duplicated results

Sorting output to fix list reordering

Optional directives

`absolute_links`

`ascii85`

`base64`

`beautify`

`css`

Using the `css` filter with XML

Using the `css` filter to exclude content

`csv2text`

`delete_lines_containing`

`element-by-` [ `class` | `id` | `style` | `tag` ]

`execute`

`format-json`

`format-xml`

`hexdump`

`html2text`

`html2text`

`strip_tags`

`bs4`

`ical2text`

`jq`

Filtering JSON on Windows or containing Unicode and without `jq`

`jsontoyaml`

`keep_lines_containing`

`ocr`

`pdf2text`

`pretty-xml`

`pypdf`

`re.findall`

`re.sub`

`remove_repeated`

`reverse`

`sha1sum`

`shellpipe`

`sort`

`strip`

`xpath`

Using the `xpath` filter with XML

Using the `xpath` filter to exclude content