Filters
Overview
Filters can be applied at either of two stages of processing:
Applied to the downloaded data before storing it and diffing for changes (
filter).Applied to the diff result before reporting the changes (
diff_filter).
While creating your job pipeline, you might want to preview what the filtered output looks like. For filters applied
to the data, you can run webchanges with the --test-filter command-line option, passing in the index
(from --list) or the URL/command of the job to be tested:
webchanges --test 1 # Test the first job in the jobs list and show the data collected after it's filtered
webchanges --test https://example.net/ # Test the job that matches the given URL
webchanges --test -1 # Test the last job in the jobs list
This command will show the output that will be captured, stored, and used for the comparison to the old version stored from a previous run against the same url or command.
Once webchanges has collected at least 2 historic snapshots of a job (e.g. two different states of a webpage)
you can start testing the effects of your diff_filter with the command-line option --test-differ, passing in the
index (from --list) or the URL/command of the job to be tested:
webchanges --test-differ 1 # Test the first job in the jobs list and show the report
webchanges --test-differ -2 # Test the second-to-last job in the jobs list and show the report
At the moment, the following filters are available:
To select HTML (or XML) elements:
css: Filter XML/HTML using CSS selectors.
xpath: Filter XML/HTML using XPath expressions.
element-by-class: Get all HTML elements matching a class.
element-by-id: Get all HTML elements matching an id.
element-by-style: Get all HTML elements matching a style.
element-by-tag: Get all HTML elements matching a tag.
To extract text from HTML (or XML):
html2text: Convert HTML to plaintext.
To make HTML more readable:
beautify: Beautify HTML.
To fix HTML links from relative to absolute:
absolute_links: fix HTML relative links.
To make CSV files more readable:
csv2text: Convert CSV to plaintext.
To extract text from PDFs:
To save images:
To extract text from images:
ocr: Extract text from images.
To extract ASCII text from JSON:
jq: Filter ASCII JSON.
To make JSON more readable:
jsontoyaml: Reformat JSON to YAML.
format-json: Reformat (pretty-print) JSON.
To make XML more readable:
format-xml: Reformat (pretty-print) XML (using lxml.etree).
pretty-xml: Reformat (pretty-print) XML (using Python’s xml.minidom).
To make iCal more readable:
ical2text: Convert iCalendar to plaintext.
To make binary readable:
hexdump: Display data in hex dump format.
To just detect if anything changed:
sha1sum: Calculate the SHA-1 checksum of the data.
To filter and/or edit text:
delete_lines_containing: Delete lines containing specified text or matching a Python regular expression.
keep_lines_containing: Keep only lines containing specified text or matching a Python regular expression.
re.findall: Extract, replace or remove all non-overlapping text matching a Python regular expression.
re.sub: Replace or remove text matching a Python regular expression.
remove_repeated: Remove repeated items (lines).
reverse: Reverse the order of items (lines).
sort: Sort lines.
strip: Strip leading and/or trailing whitespace or specified characters.
To run any custom script or program:
Advanced Python programmers can write their own custom filters; see Hook your own Python code.
absolute_links
Convert relative URLs of all action, href` and ``src attribute in any HTML tag, as well the data
attribute of the <object> tag, to absolute ones.
Note
This filter is not needed (and could interfere) if you already are using the beautify filter (which has
an absolute_links sub-directive that defaults to true) or the html2text filter (which already converts
relative links).
url: https://example.net/absolute_links.html
filters:
- absolute_links
Added in version 3.16.
Changed in version 3.21: Converts URLs of all action, href and src attributes found in any tag as well the data attribute
of the <object> tag.
ascii85
Encodes binary data (e.g. image data) to text using Ascii85. Ascii85 is more space-efficient than Base64, encoding more bytes into fewer characters. This filter can be useful to monitor images in combination with the image differ.
url: https://example.net/favicon_85.ico
filters:
- ascii85
Added in version 3.21.
base64
Encodes binary data (e.g. image data) to text using RFC 4648 Base64. This filter can be useful to monitor images in combination with the image differ. Also see ascii85, which is more efficient.
url: https://example.net/favicon.ico
filters:
- base64
Added in version 3.16.
beautify
This filter uses the Beautiful Soup, jsbeautifier and cssbeautifier Python packages to reformat the HTML in a document to make it more readable (keeping it as HTML).
url: https://example.net/beautify.html
filters:
- beautify: 1
Optional sub-directives
absolute_links(true/false): Convert relative links to absolute ones (default: true).indent(integer or string): If indent is a non-negative integer or string, then the contents of HTML elements will be indented appropriately when pretty-printing them. An indent level of 0, negative, or “” will only insert newlines. Using a positive integer indent indents that many spaces per level. If indent is a string (such as “t”), that string is used to indent each level (default:1, i.e. indent one space per level).
url: https://example.net/beautify_absolute_links_false.html
filters:
- beautify:
absolute_links: false
indent: 1
Changed in version 3.16: Relative links are converted to absolute ones; use the absolute_links: false sub-directive to disable.
Changed in version 3.16: Added absolute_links sub-directive.
Changed in version 3.9.2: Added indent sub-directive.
Required packages
To run jobs with this filter, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[beautify]
css
The css filter extracts HTML or XML content based on a CSS selector. It uses
the cssselect Python package, which has limitations and extensions as
explained in its documentation. This filter works
very similarly to, and its sub-directives are almost identical to, those of the xpath filter.
Examples: to filter only the <body> element of the HTML document, stripping out everything else:
url: https://example.net/css.html
filters:
- css: ul#groceries > li.unchecked
Tip
If you are looking at a website using Google Chrome, you can find the css of an HTML node in DevTools (Ctrl+Shift+I) by right clicking on the element and selecting ‘Copy -> Copy selector’. You can learn more about Chrome DevTools here.
Using the css filter with XML
By default, the css filter is set up to handle HTML documents, but they also work on XML documents by declaring the
sub-directive method: xml.
For example, to parse an RSS feed for the titles and publication dates, use:
url: https://example.com/blog/css-index.rss
filters:
- css:
method: xml
selector: 'item > title, item > pubDate'
- html2text: strip_tags
To match an element in an XML namespace, use a namespace prefix before the tag
name. Use a : to separate the namespace prefix and the tag name in an XPath expression.
url: https://example.org/feed/css-namespace.xml
filters:
- css:
method: xml
selector: 'item > media|keywords'
namespaces:
media: http://search.yahoo.com/mrss/
- html2text:
Using the css filter to exclude content
Elements selected by the exclude sub-directive are removed from the final result. For example, the following job
will not have any <a> tag in its results:
url: https://example.org/css-exclude.html
filters:
- css:
selector: 'body'
exclude: 'a'
Limiting the returned items from a CSS selector
If you only want to return a subset of the items returned by a CSS selector, you can use two additional sub-directives:
skip: How many elements to skip from the beginning (default: 0).maxitems: How many elements to return at most (default: no limit).
For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:
url: https://example.net/css-skip-maxitems.html
filters:
- css:
selector: div.cpu
skip: 1
maxitems: 2
Duplicated results
If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and
desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use
maxitems: 1 to only return the first item.
Sorting output to fix list reordering
In some cases, the ordering of items on a webpage might change regularly without the actual content changing. By default, this would show up in the diff output as an element being removed from one part of the page and inserted in another part of the page.
In cases where the order of items doesn’t matter, it’s possible to sort matched items lexicographically to avoid spurious reports when only the ordering of items changes on the page.
The subfilter for the css filter is sort, and can be true or false (the default):
url: https://example.org/css-items-random-order.html
filters:
- css:
selector: span.item
sort: true
Alternatively, you can chain the sort filter.
Optional directives
selector(default): the CSS selector.method: Either ofhtml(default) orxml.namespaces: Mapping of XML namespaces for matching (default: None).exclude: css selector for elements to remove from the final result (default: None).skip: Number of elements to skip from the beginning (default: 0).maxitems: Maximum number of items to return (default: all).sort(true/false): Sort elements lexographically (default: false).
csv2text
The filter csv2text turns tabular data formatted as comma separated values (CSV) into a prettier textual representation. This is done by supplying a Python format string where the csv data is replaced into. If the CSV has a header, the format string should use the header names (lowercased).
For example, given the following csv data:
Name,Company
Smith,Apple
Doe,Google
we can make it more readable by using:
url: https://example.org/data.csv
filters:
- csv2text:
format_message: Mr. or Ms. {name} works at {company}. # note the lowercase in the replacement_fields
has_header: true
to produce:
Mr. or Ms. Smith works at Apple.
Mr. or Ms. Doe works at Google.
If there is no header row, or ignore_header is set to true, you will need to use the numeric array notation: Mr.
or Mrs. {0} works at {1}..
Optional sub-directives
format_message(default): The Python format string containing “replacement fields” into which the data from the csv is substituted. Field names are the column headers (in lowercase) if the data has column headers or numeric starting from 0 if the data has no column headers orignore_headeris set to true.has_header(true/false): Specifies whether the first row is a series of column headers (default: use the rough heuristics provided by Python’s csv.Sniffer.has_header method.ignore_header(true/false): If set to true, it will parse the format_message as having numeric replacement fields even if the data has column headers (orhas_header, immediately above, is set to true).
delete_lines_containing
This filter discards all lines that contain the text specified (default) or match the Python regular
expression specified (with re
sub-directive), keeping the others.
Note that while this filter emulates Linux’s grep, it does not use the executable grep.
Examples:
name: "eliminate lines that contain 'xyz'"
url: https://example.com/delete_lines_containing.txt
filters:
- delete_lines_containing: 'xyz'
name: "eliminate lines that start with 'warning' irrespective of its case (e.g. Warning, Warning, warning, etc.)"
url: https://example.com/delete_lines_containing_re.txt
filters:
- delete_lines_containing:
re: '(?i)^warning'
Notes: in regex, (?i) is the inline flag for case-insensitive matching and ^ (caret) matches the start of the string.
Optional sub-directives
text: (default) Match the text provided.re: Match the the Python regular expression provided.
Changed in version 3.0: Renamed from grepi to avoid confusion.
element-by- [ class | id | style | tag ]
The filters element-by-class, element-by-id, element-by-style, and element-by-tag allow you to select all matching instances of a given HTML element.
Examples:
To extract only the <body> of a page:
url: https://example.org/bodytag.html
filters:
- element-by-tag: body
To extract <div id="something">.../<div> from a page:
url: https://example.org/idtest.html
filters:
- element-by-id: something
Since you can chain filters, use this to extract an element within another element:
url: https://example.org/idtest_2.html
filters:
- element-by-id: outer_container
- element-by-id: something_inside
To make the output human-friendly you can chain html2text on the result:
url: https://example.net/id2text.html
filters:
- element-by-id: something
- html2text:
To extract <div style="something">.../<div> from a page:
url: https://example.org/styletest.html
filters:
- element-by-style: something
execute
The data to be filtered is passed as the input to a command to be run, and the output from the command is used in webchanges’s next step. All environment variables are preserved and the following ones added:
Environment variable |
Description |
|---|---|
|
All job parameters in JSON format |
|
Value of either |
|
Name of the job |
|
The job’s index number |
For example, we can execute a Python script:
name: Test execute filter
url: https://example.net/execute.html
filters:
# For multiline YAML, quote the string and unindent its continuation. A space is added at the end
# of each line. Pay attention to escaping!
- execute: "python -c \"import os, sys;
print(f\\\"The data is '{sys.stdin.read()}'\\nThe job location is
'{os.getenv('WEBCHANGES_JOB_LOCATION')}'\\nThe job name is
'{os.getenv('WEBCHANGES_JOB_NAME')}'\\nThe job number is
'{os.getenv('WEBCHANGES_JOB_INDEX_NUMBER')}'\\nThe job JSON is
'{os.getenv('WEBCHANGES_JOB_JSON')}'\\\", end='')\""
Or instead we can call a script we have saved, e.g. - execute: python3 myscript.py.
If the command generates an error, the output of the error will be in the first line, before the traceback.
Tip
If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8
mode as per instructions here.
Optional sub-directives
command(default, str): The command to execute.escape_characters(bool): When running in Windows, escape characters in command (e.g.%become%%and!become^!) (default: false).
Changed in version 3.8: Added additional WEBCHANGES_JOB_* environment variables.
Changed in version 3.34: Added escape_characters sub-directive.
format-json
This filter serializes the JSON data to a pretty-printed indented string using Python’s json.dumps (or, if installed, the same function from the simplejson library) with a default indent level of 4.
..tip:: For a more compact and legible output, use jsontoyaml instead.
If the job directive monospace is unset, to improve the readability in HTML reports this filter will set it to
true. To override, add the directive monospace: true to the job (see here).
Optional sub-directives
indentation(integer or string): Either the number of spaces or a string to be used to indent each level with; if0, a negative number or""then no indentation (default: 4, i.e. 4 spaces).sort_keys(true/false): Whether to sort the output of dictionaries by key (default: false).
Added in version 3.0.1: sort_keys sub-directive.
Changed in version 3.20: The filter sets the job’s monospace directive to true.
format-xml
This filter deserializes an XML object and reformats it. It uses the lxml Python package’s etree.tostring pretty_print function.
name: "reformat XML using lxml's etree.tostring"
url: https://example.com/format_xml.xml
filters:
- format-xml:
Added in version 3.0.
hexdump
This filter displays the contents both in binary and ASCII using the hex dump format.
name: Display binary and ASCII test
command: cat testfile
filters:
- hexdump:
html2text
This filter converts HTML (or XML) to Unicode text.
Optional sub-directives
method: One of:
html2text(default): Uses the html2text Python package and retains some simple formatting from HTML, outputting Markup language with absolute links;
bs4: Uses the Beautiful Soup Python package to extract text from either HTML or XML;
strip_tags: Uses regex to strip tags (HTML or XML).
html2text
This method is the default (does not need to be specified) and converts HTML into Markdown using the html2text Python package.
Warning
As this filter relies on the external html2text Python package, new releases of this package may generate text that is formatted slightly
differently, and, if so, will cause webchanges to send a one-off change report.
It is the recommended option to convert all types of HTML into readable text, as it can be displayed (after conversion) in HTML.
Example configuration:
url: https://example.com/html2text.html
filters:
- xpath: '//section[@role="main"]'
- html2text:
pad_tables: true
Note
If the content has tables, adding the sub-directive pad_tables: true may improve readability.
Optional sub-directives
See the optional sub-directives in the html2text Python package’s documentation. The following options are set by webchanges but can be overridden:
unicode_snob: trueto ensure that accented characters are kept as they are;body_width: 0to ensure that lines aren’t chopped up;ignore_images: trueto ignore images (since we’re dealing with text);single_line_break: trueto ensure that additional empty lines aren’t added between sections;wrap_links: falseto ensure that links are not wrapped (in case body_width is not set to 0) as it’s not Markdown compatible.
bs4
This filter method extracts visible text from HTML using the Beautiful Soup Python package, specifically its get_text(strip=True) method.
url: https://example.com/html2text_bs4.html
filters:
- xpath: '//section[@role="main"]'
- html2text:
method: bs4
strip: true
Parsers
Beautiful Soup supports multiple parsers as documented here. We default to the use of the
lxml parser as recommended, but you can specify the parser by using the parser sub-directive:
url: https://example.com/html2text_bs4_html5lib.html
filters:
- xpath: '//section[@role="main"]'
- html2text:
method: bs4
parser: html5lib
strip: true
Extracting text from XML
This filter can be used to extract text from XML by using the xml parser as follows:
url: https://example.com/html2text_bs4_xml
filters:
- html2text:
method: bs4
parser: xml
Optional sub-directives
parser: the name of the parser library you want to use as per documentation (default:lxml).separator: Strings extracted from the HTML or XML object will be concatenated using this separator (defaults to the empty string``).strip(true/false): If true, strings will be stripped before being concatenated (defaults to false).
Required packages
To run jobs with this filter method, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[bs4]
If (and only if) you specify parser: html5lib, then you also need to first install additional Python
packages as follows:
uv pip install --upgrade webchanges[bs4,html5lib]
Changed in version 3.0: Filter defaults to the use of Python html2text package.
Changed in version 3.0: Method re renamed to strip_tags.
Deprecated since version urlwatch: Removed method lynx (external OS-specific dependency).
ical2text
This filter reads an iCalendar document and converts it to easy-to read text.
name: "Make iCal file readable"
url: https://example.com/cal.ics
filters:
- ical2text:
Required packages
To run jobs with this filter, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[ical2text]
jq
Linux/macOS ASCII only
The jq filter uses the Python bindings for jq, a lightweight ASCII JSON
processor. It is currently available only for Linux (most flavors) and macOS (no Windows) and does not handle Unicode;
see below for a cross-platform and Unicode-friendly way of selecting JSON.
Please also note that command line arguments of the standalone jq program are NOT supported by this library.
url: https://example.net/jq-ascii.json
filters:
- jq: '.[].title'
Supports aggregations, selections, and the built-in operators like length.
For more information on the operations permitted, see the jq Manual.
Required packages
To run jobs with this filter, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[jq]
Filtering JSON on Windows or containing Unicode and without jq
Python programmers on all OSs can use an advanced technique to select only certain elements of the JSON object; see Selecting items from a JSON dictionary. This method will preserve Unicode characters.
jsontoyaml
This filter serializes the JSON data to YAML using Python’s PyYAML library with a default indent level of 2.
If the job directive monospace is unset, to improve the readability in HTML reports this filter will set it to
true. To override, add the directive monospace: true to the job (see here).
Optional sub-directives
indentation(integer or string): Either the number of spaces or a string to be used to indent each level with; if0, a negative number or""then no indentation (default: 2, i.e. 2 spaces).
Added in version 3.30.
keep_lines_containing
This filter keeps only lines that contain the text specified (default) or match the Python regular
expression specified (with re
sub-directive), discarding the others.
Note that while this filter emulates Linux’s grep, it does not use the executable grep.
Examples:
name: "convert HTML to text, strip whitespace, and only keep lines that have the sequence ``a,b:`` in them"
url: https://example.com/keep_lines_containing.html
filters:
- html2text:
- keep_lines_containing: 'a,b:'
name: "keep only lines that contain 'error' irrespective of its case (e.g. Error, ERROR, error, etc.)"
url: https://example.com/keep_lines_containing_re.txt
filters:
- keep_lines_containing:
re: '(?i)error'
Note: in regex (?i) is the inline flag for case-insensitive matching.
Optional sub-directives
text(default): Match the text provided.re: Match the the Python regular expression provided.
Changed in version 3.0: Renamed from grep to avoid confusion.
ocr
This filter extracts text from images using the Tesseract OCR engine. Any file format supported by the Pillow (PIL Fork) Python package is supported.
This filter must be the first filter in a chain of filters, since it consumes binary data.
url: https://example.net/ocr-test.png
filters:
- ocr:
timeout: 5
language: eng
Optional sub-directives
timeout: Timeout for the recognition, in seconds (default: 10 seconds).language: Text language (e.g.fraoreng+fra) (default:eng).
Required packages
To run jobs with this filter, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[ocr]
In addition, you need to install Tesseract itself.
pdf2text
This filter converts a PDF file to plaintext using the pdftotext Python library, itself based on the Poppler library.
For most uses, we recommend using the filter pypdf, which achieves similar results without having to separately install OS-specific dependencies (Poppler).
This filter must be the first filter in a chain of filters, since it consumes binary data.
url: https://example.net/pdf-test.pdf
filters:
- pdf2text
If the PDF file is password protected, you can specify its password:
url: https://example.net/pdf-test-password.pdf
filters:
- pdf2text:
password: webchangessecret
By default, pdf2text tries to reproduce the layout of the original document by using spaces. Be aware that these
spaces may change when a document is updated, so you may get reports containing a lot of changes consisting of
nothing but changes in the spacing between the columns; in this case try turning it off with the sub-directive
physical: false.
url: https://example.net/pdf-test-no-physical-layout.pdf
filters:
- pdf2text:
physical: false
monospace: true
Tip
If your reports are in HTML format and the PDF is columnar in nature, try using the job directive
monospace: true to improve readability (see here).
url: https://example.net/pdf-test-keep-monospace.pdf
filters:
- pdf2text:
monospace: true
To the opposite, if you don’t care about the layout, you might want to strip all additional spaces that might be added by this filter:
url: https://example.net/pdf-no-multiple-spaces.pdf
filters:
- pdf2text:
- re.sub:
pattern: ' +'
repl: ' '
- strip:
splitlines: true
Optional sub-directives
password: Password for a password-protected PDF file.physical(true/false): If true, page text is output in the order it appears on the page, regardless of columns or other layout features (default: true). Only one ofrawandphysicalcan be set to true.raw(true/false): If true, page text is output in the order it appears in the content stream (default: false). Only one ofrawandphysicalcan be set to true.
Changed in version 3.8.2: Added physical and raw sub-directives.
Required packages
To run jobs with this filter, you need to first install additional Python packages as follows:
uv pip install --upgrade webchanges[pdf2text]
In addition, you need to install any of the OS-specific dependencies of Poppler (see website).
pretty-xml
This filter deserializes an XML object and pretty-prints it. It uses Python’s xml.dom.minidom toprettyxml function.
name: "reformat XML using Python's xml.dom.minidom toprettyxml function"
url: https://example.com/pretty_xml.xml
filters:
- pretty-xml:
Added in version 3.3.
pypdf
This filter converts a PDF file to plaintext using the pypdf Python library.
This filter must be the first filter in a chain of filters, since it consumes binary data.
url: https://example.net/pypdf-test.pdf
filters:
- pypdf
If the PDF file is password protected, you can specify its password:
url: https://example.net/pypdf-test-password.pdf
filters:
- pypdf:
password: webchangessecret
The pypdf library locates all text drawing commands in the order they appear in the PDF’s content stream, and then
extracts the text. To extract text in a fixed width format that closely adheres to the rendered layout in the source
PDF (experimental), use the sub-directive extraction_mode: layout:
url: https://example.net/pypdf-test-layout.pdf
filters:
- pypdf:
extraction_mode: layout
Tip
If your reports are in HTML format and the PDF is columnar in nature, try using the job directive
monospace: true to improve readability (see here).
url: https://example.net/pypdf-test-monospace.pdf
filters:
- pypdf:
extraction_mode: layout
monospace: true
If the layout is not a concern, you may want to remove any additional spaces that the filter might have introduced.
url: https://example.net/pypdf-no-multiple-spaces.pdf
filters:
- pypdf:
- re.sub:
pattern: ' +'
repl: ' '
- strip:
splitlines: true
extract text in a fixed width format that closely adheres to the rendered # layout in the source pdf
Note
Users should be aware that updating the underlying pypdf library may trigger webchanges to generate a new report, even if the actual content of the PDFs has not changed. This is due to the potential formatting improvements introduced by pypdf updates.
Optional sub-directives
password: Password for a password-protected PDF file (dependency required; see below).extraction_mode: set tolayoutfor experimental layout mode functionality.
Added in version 3.16.
Changed in version 3.27: extraction_mode sub-directive
Required packages
To run jobs with this filter, you need to first install additional Python packages. If
you’re not using the password sub-directive, then use the following:
uv pip install --upgrade webchanges[pypdf]
To run jobs with the password sub-directive, then use the following:
uv pip install --upgrade webchanges[pypdf_crypto]
re.findall
This filter extracts, deletes or replaces non-overlapping text using Python re.findall regular expression operation.
Just specifying a regular expression (regex) or string as the value will extract the match. Patterns can be replaced
with another string using pattern as the expression and repl as the replacement, or deleted by setting
repl to an empty string.
All features are described in Python’s re.findall’s documentation. The pattern is first iteratively matched using
re.finditer and the repl value is applied to each
non-overlapping match; if repl is missing, then group “0” (the entire match) is extracted.
Each match is outputted on its own line.
The following example applies the filter twice:
Just specifying a string as the value will include the full match in the output.
You can use groups (
()) and back-reference them with\1(etc..) to put groups into the replacement string.
By default, the full match will be included in the output.
url: https://example.com/regex-findall.html
filters:
- re.findall: '<span class="price">.*</span>'
- re.findall:
pattern: 'Price: \$([0-9]+)'
repl: '\1'
Tip
Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.
You can use the entire range of Python’s regular expression (regex) syntax, and you can ask your favorite Generative AI chatbot for help. Some examples:
To extract the first line:
url: https://example.com/regex-firstline.html
command: python -c "[print(f'line {n}') for n in range(1, 3)]"
filters:
- re.findall: '^.*'
To extract the last line, we use the inline MULTILINE
flag ((?m)) and look for a line (^.*$)) that is not followed (negative lookahead assertion) by a newline
plus additional text ((?!\n.+)):
url: https://example.com/regex-lastline.html
command: python -c "[print(f'line {n}') for n in range(3)]"
filters:
- re.findall: '(?m)(^.*$)(?!\n.+)'
Optional sub-directives
pattern: Regular expression pattern or string for matching; this sub-directive must be specified when using thereplsub-directive, otherwise the pattern can be specified as the value ofre.sub(in which case a match will be extracted).repl: The string applied iteratively to each match (default: ‘g<0>’, or extract all matches).
Added in version 3.20.
re.sub
This filter deletes or replaces text using Python Python re.sub regular expression operation.
Just specifying a regular expression (regex) or string as the value will remove the match. Patterns can be replaced
with another string by specifying repl as the replacement.
All features are described in Python’s re.sub’s documentation.
The pattern and repl values are passed to this function as-is; if repl is missing, then it’s considered
to be an empty string, and this filter deletes the the leftmost non-overlapping occurrences of pattern.
Tip
Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.
The following example applies the filter 3 times:
name: "Strip href and change a few tags"
url: https://example.com/re_sub.html
filters:
- re.sub: '\s*href="[^"]*"'
- re.sub:
pattern: '<h1>'
repl: 'HEADING 1: '
- re.sub:
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'
You can use the entire range of Python’s regular expression (regex) syntax: for example groups (()) in the pattern
and \1 (etc.) to refer to these groups in the repl as in the example below, which replaces the number of
milliseconds (which may vary each time you check this page and generate a change report) with an X (which therefore
never changes):
name: "Replace a changing number in a sentence with an X"
url: https://example.com/re_sub_group.html
filters:
- html2text:
- re.sub:
pattern: '(Page generated in )([0-9.])*( milliseconds.)'
repl: '\1X\3'
Optional sub-directives
pattern: Regular expression pattern or string to match for replacement; this sub-directive must be specified when using thereplsub-directive, otherwise the pattern can be specified as the value ofre.sub(in which case a match will be deleted).repl: The string for replacement (default: empty string, i.e. deletes the string matched inpattern).
remove_repeated
This filter compares adjacent items (lines), and the second and succeeding copies of repeated items (lines) are
removed. Repeated items (lines) must be adjacent in order to be found. Works similarly to Unix’s uniq.
By default, it acts over adjacent lines. Three lines consisting of dog - dog - cat will be turned into
dog - cat, while dog - cat - dog will stay the same
url: https://example.com/remove-repeated.txt
filters:
- remove_repeated
This behavior can be changed by using an optional separator string argument. Also, ignore_case will tell it to
ignore differences in case and of leading and/or trailing whitespace when comparing. For example, the below will turn
mixed-case items separated by a pipe (|) a|b|B |c into a|b|c:
url: https://example.net/remove-repeated-separator.txt
filters:
- remove_repeated:
separator: '|'
ignore_case: true
Prepend it with sort to capture globally unique lines, e.g. to turn dog - cat - dog to cat -
dog:
url: https://example.com/remove-repeated-sorted.txt
filters:
- sort
- remove_repeated
Finally, setting the adjacent sub-directive to false will cause all duplicates to be removed, even if not
adjacent. For example, the below will turn items separated by a pipe (|) a|b|a|c into a|b|c:
url: https://example.net/remove-repeated-non-adjacent.txt
filters:
- remove_repeated:
separator: '|'
adjacent: false
Optional sub-directives
separator(default): The string used to separate items whose order is to be reversed (default:\n, i.e. line-based); it can also be specified inline as the value ofremove_repeated.ignore_case: Ignore differences in case and of leading and/or trailing whitespace when comparing (true/false) (default: false).adjacent: Remove only adjacent lines or items (true/false) (default: true).
Added in version 3.8.
Changed in version 3.13: Added adjacent sub-directive.
reverse
This filter reverses the order of items (lines) without sorting:
url: https://example.com/reverse-lines.txt
filters:
- reverse
This behavior can be changed by using an optional separator string argument (e.g. items separated by a pipe (|)
symbol, as in 1|4|2|3, which would be reversed to 3|2|4|1):
url: https://example.net/reverse-separator.txt
filters:
- reverse: '|'
Alternatively, the filter can be specified more verbose with a dict. In this example "\n\n" is used to separate
paragraphs (items that are separated by an empty line):
url: https://example.org/reverse-paragraphs.txt
filters:
- reverse:
separator: "\n\n"
Optional sub-directives
separator: The string used to separate items whose order is to be reversed (default:\n, i.e. line-based reversing); it can also be specified inline as the value ofreverse.
sha1sum
This filter calculates a SHA-1 hash for the contents. Useful to be notified when anything has changed without any detail and avoiding saving large snapshots of data.
name: "Calculate SHA-1 hash"
url: https://example.com/sha.html
filters:
- sha1sum:
shellpipe
This filter works like execute, except that an intermediate shell process is spawned to run the command. This
is to allow for certain corner situations (e.g. relying on variables, glob patterns, and other special shell features in
the command) that the execute filter cannot handle.
Danger
The execution of a shell command opens up all sort of security issues and the use of this filter should be avoided in favor of the execute filter.
Example:
url: https://example.net/shellpipe.html
filters:
- shellpipe: echo TEST
Important
On Linux and macOS systems, due to security reasons the shellpipe filter will not run unless both
the jobs file and the directory it is located in are owned and writeable by only the user who is
running the job (and not by its group or by other users) or by the root user. To set this up:
cd ~/.config/webchanges # could be different
sudo chown $USER:$(id -g -n) . *.yaml
sudo chmod go-w . *.yaml
sudomay or may not be required;If making the change from a different account than the one you run webchanges from, replace
$USER:$(id -g -n)with the username:group of the account running webchanges.
Tip
If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8
mode as per instructions here.
Optional sub-directives
command(default, str): The command to execute.escape_characters(bool): When running in Windows, escape characters in command (e.g.%become%%and!become^!) (default: false).
Changed in version 3.34: Added escape_characters sub-directive.
sort
This filter performs a line-based sorting, ignoring cases (i.e. case folding as per Python’s implementation).
If the source provides data in random order, you should sort it before the comparison in order to avoid diffing based only on changes in the sequence.
name: "Sorting lines test"
url: https://example.net/sorting.txt
filters:
- sort
The sort filter takes an optional separator parameter that defines the item separator (by default sorting is
line-based), for example to sort text paragraphs (text separated by an empty line):
url: https://example.org/paragraphs.txt
filters:
- sort:
separator: "\n\n"
This can be combined with a true/false reverse option, which is useful for sorting and reversing with the same
separator (using % as separator, this would turn 3%2%4%1 into 4%3%2%1):
url: https://example.org/sort-reverse-percent.txt
filters:
- sort:
separator: '%'
reverse: true
Optional sub-directives
separator(default): The string used to separate items to be sorted (default:\n, i.e. line-based sorting).reverse(true/false): Whether the sorting direction is reversed (default: false).
strip
This filter removes leading and trailing whitespace or specified characters from a set of characters. Whitespace includes the characters space, tab, linefeed, return, formfeed, and vertical tab.
name: "Strip leading and trailing whitespace from the block of data"
url: https://example.com/strip.html
filters:
- strip:
name: "Strip trailing commas or periods from all lines"
url: https://example.com/strip_by_line.html
filters:
- strip:
chars: ',.'
side: right
splitlines: true
name: "Strip beginning spaces, tabs, etc. from all lines"
url: https://example.com/strip_leading_spaces.txt
filters:
- strip:
side: left
splitlines: true
name: "Strip spaces, tabs etc. from both ends of all lines"
url: https://example.com/strip_each_line.html
filters:
- strip:
splitlines: true
Optional sub-directives
chars(default): A string specifying the set of characters to be removed instead of the default whitespace.side: For one-sided removal: eitherleft(strip only leading whitespace or matching characters) orright(strip only trailing whitespace or matching characters).splitlines(true/false): Apply the filter on each line of text (default: false, apply to the entire data as a block).
Changed in version 3.5: Added optional sub-directives chars, side and splitlines.
xpath
The xpath filter extracts HTML or XML content based on a XPath version
1.0 expression. This filter works very similarly to, and its sub-directives are almost identical to, those of the
css filter.
See Microsoft’s XPath Examples page for additional information on XPath.
Warning
Make sure to use XPath 1.0 syntax and avoid using certain constructs available only in later versions, as in many cases they will simply be ignored without an error and cause unexpected results to be returned.
Examples: to filter only the <body> element of the HTML document, stripping out everything else:
url: https://example.net/xpath.html
filters:
- xpath: /html/body/marquee
Tip
If you are looking at a website using Google Chrome, you can find the XPath of an HTML node in DevTools (Ctrl+Shift+I) by right clicking on the element and selecting ‘Copy -> Copy XPath’. You can learn more about Chrome DevTools here.
Using the xpath filter with XML
By default, the xpath filter is set up to handle HTML documents, but it also works on XML documents by declaring the
sub-directive method: xml.
For example, to parse an RSS feed for the titles and publication dates, use:
url: https://example.com/blog/xpath-index.rss
filters:
- xpath:
method: xml
path: //item/title/text()|//item/pubDate/text()
To match an element in an XML namespace, use a namespace prefix before the tag
name. Use a | to separate the namespace prefix and the tag name in a CSS selector.
url: https://example.net/feed/xpath-namespace.xml
filters:
- xpath:
method: xml
path: //item/media:keywords/text()
namespaces:
media: http://search.yahoo.com/mrss/
Alternatively, use the XPath expression //*[name()='<tag_name>'] to bypass the namespace entirely.
Using the xpath filter to exclude content
Elements selected by the exclude sub-directive are removed from the final result. For example, the following job
will not have any <a> tag in its results:
url: https://example.org/xpath-exclude.html
filters:
- xpath:
path: //body
exclude: //a
Limiting the returned items from an XPath expression
If you only want to return a subset of the items returned by an XPath expression, you can use two additional sub-directives:
skip: How many elements to skip from the beginning (default: 0).maxitems: How many elements to return at most (default: no limit).
For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:
url: https://example.net/xpath-skip-maxitems.html
filters:
- xpath:
path: //div[@class="cpu"]
skip: 1
maxitems: 2
Duplicated results
If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and
desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use
maxitems: 1 to only return the first item.
Sorting output to fix list reordering
In some cases, the ordering of items on a webpage might change regularly without the actual content changing. By default, this would show up in the diff output as an element being removed from one part of the page and inserted in another part of the page.
In cases where the order of items doesn’t matter, it’s possible to sort matched items lexicographically to avoid spurious reports when only the ordering of items changes on the page.
The subfilter for the xpath filter is sort, and can be true or false (the default):
url: https://example.org/xpath-items-random-order.html
filters:
- xpath:
path: //span[@class="item"]
sort: true
Alternatively, you can chain the sort filter.
Optional directives
path(default): the XPath expression.method: Either ofhtml(default) orxml.namespaces: Mapping of XML namespaces for matching (default: None).exclude: XPath expression for elements to remove from the final result (default: None).skip: Number of elements to skip from the beginning (default: 0).maxitems: Maximum number of items to return (default: all).sort(true/flase): Sort elements lexographically (default: false).