Filters

Filters can be applied at either of two stages of processing:

Applied to the downloaded data before storing it and diffing for changes (filter).
Applied to the diff result before reporting the changes (diff_filter).

While creating your job pipeline, you might want to preview what the filtered output looks like. For filters applied to the data, you can run webchanges with the --test-filter command-line option, passing in the index (from --list) or the URL/command of the job to be tested:

webchanges --test 1   # Test the first job in the jobs list and show the data collected after it's filtered
webchanges --test https://example.net/  # Test the job that matches the given URL
webchanges --test -1  # Test the last job in the jobs list

This command will show the output that will be captured, stored, and used for the comparison to the old version stored from a previous run against the same url or command.

Once webchanges has collected at least 2 historic snapshots of a job (e.g. two different states of a webpage) you can start testing the effects of your diff_filter with the command-line option --test-differ, passing in the index (from --list) or the URL/command of the job to be tested:

webchanges --test-differ 1    # Test the first job in the jobs list and show the report
webchanges --test-differ -2   # Test the second-to-last job in the jobs list and show the report

At the moment, the following filters are available:

To select HTML (or XML) elements:
- css: Filter XML/HTML using CSS selectors.
- xpath: Filter XML/HTML using XPath expressions.
- element-by-class: Get all HTML elements matching a class.
- element-by-id: Get all HTML elements matching an id.
- element-by-style: Get all HTML elements matching a style.
- element-by-tag: Get all HTML elements matching a tag.
To extract text from HTML (or XML):
- html2text: Convert HTML to plaintext.
To make HTML more readable:
- beautify: Beautify HTML.
To fix HTMLs links from relative to absolute:
- absolute_links: fix HTML relative links.
To make CSV files more readable:
- csv2text: Convert CSV to plaintext.
To extract text from PDFs:
- pypdf: Convert PDF to plaintext.
- pdf2text: Convert PDF to plaintext (Poppler required as an external dependency).
To save images:
- ascii85: Convert binary data such as images to text (for downstream differ image).
To extract text from images:
- ocr: Extract text from images.
To extract ASCII text from JSON:
- jq: Filter ASCII JSON.
To make JSON more readable:
- format-json: Reformat (pretty-print) JSON.
To make XML more readable:
- format-xml: Reformat (pretty-print) XML (using lxml.etree).
- pretty-xml: Reformat (pretty-print) XML (using Python’s xml.minidom).
To make iCal more readable:
- ical2text: Convert iCalendar to plaintext.
To make binary readable:
- hexdump: Display data in hex dump format.
To just detect if anything changed:
- sha1sum: Calculate the SHA-1 checksum of the data.
To filter and/or edit text:
- keep_lines_containing: Keep only lines containing specified text or matching a Python regular expression.
- delete_lines_containing: Delete lines containing specified text or matching a Python regular expression.
- re.sub: Replace or remove text matching a Python regular expression.
- re.findall: Extract, replace or remove all non-overlapping text matching a Python regular expression.
- strip: Strip leading and/or trailing whitespace or specified characters.
- sort: Sort lines.
- remove_repeated: Remove repeated items (lines).
- reverse: Reverse the order of items (lines).
To run any custom script or program:
- execute: Run a program that filters the data (see also shellpipe, to be avoided).

Advanced Python programmers can write their own custom filters; see Hook your own Python code.

absolute_links

Convert relative URLs of all action, href` and ``src attribute in any HTML tag, as well the data attribute of the <object> tag, to absolute ones.

Note

This filter is not needed (and could interfere) if you already are using the beautify filter (which has an absolute_links sub-directive that defaults to true) or the html2text filter (which already converts relative links).

url: https://example.net/absolute_links.html
filter:
  - absolute_links

Added in version 3.16.

Changed in version 3.21: Converts URLs of all action, href and src attributes found in any tag as well the data attribute of the <object> tag.

ascii85

Encodes binary data (e.g. image data) to text using Ascii85. Ascii85 is more space-efficient than Base64, encoding more bytes into fewer characters. This filter can be useful to monitor images in combination with the image differ.

url: https://example.net/favicon_85.ico
filter:
  - ascii85

Added in version 3.21.

beautify

This filter uses the Beautiful Soup, jsbeautifier and cssbeautifier Python packages to reformat the HTML in a document to make it more readable (keeping it as HTML).

url: https://example.net/beautify.html
filter:
  - beautify: 1

Optional sub-directives

absolute_links (true/false): Convert relative links to absolute ones (default: true).
indent (integer or string): If indent is a non-negative integer or string, then the contents of HTML elements will be indented appropriately when pretty-printing them. An indent level of 0, negative, or “” will only insert newlines. Using a positive integer indent indents that many spaces per level. If indent is a string (such as “t”), that string is used to indent each level (default: 1, i.e. indent one space per level).

url: https://example.net/beautify_absolute_links_false.html
filter:
  - beautify:
      absolute_links: false
      indent: 1

Changed in version 3.16: Relative links are converted to absolute ones; use the absolute_links: false sub-directive to disable.

Added in version 3.16: absolute_links sub-directive.

Added in version 3.9.2: indent sub-directive.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[beautify]

css and xpath

The css filter extracts HTML or XML content based on a CSS selector. It uses the cssselect Python package, which has limitations and extensions as explained in its documentation.

The xpath filter extracts HTML or XML content based on a XPath version 1.0 expression.

Examples: to filter only the <body> element of the HTML document, stripping out everything else:

url: https://example.net/css.html
filter:
  - css: ul#groceries > li.unchecked

url: https://example.net/xpath.html
filter:
  - xpath: /html/body/marquee

Tip

If you are looking at a website using Google Chrome, you can find the XPath of an HTML node in DevTools (Ctrl+Shift+I) by right clicking on the element and selecting ‘Copy -> Copy XPath’, or its css by selecting ‘Copy -> Copy selector’. You can learn more about Chrome DevTools here.

See Microsoft’s XPath Examples page for additional information on XPath.

Using CSS and XPath filters with XML

By default, CSS and XPath filters are set up for HTML documents, but they also work on XML documents by declaring the sub-directive method: xml.

For example, to parse an RSS feed and filter only the titles and publication dates, use:

url: https://example.com/blog/css-index.rss
filter:
  - css:
      method: xml
      selector: 'item > title, item > pubDate'
  - html2text: strip_tags

url: https://example.com/blog/xpath-index.rss
filter:
  - xpath:
      method: xml
      path: '//item/title/text()|//item/pubDate/text()'

To match an element in an XML namespace, use a namespace prefix before the tag name. Use a | to separate the namespace prefix and the tag name in a CSS selector, and use a : in an XPath expression.

url: https://example.org/feed/css-namespace.xml
filter:
  - css:
      method: xml
      selector: 'item > media|keywords'
      namespaces:
        media: http://search.yahoo.com/mrss/
  - html2text:

url: https://example.net/feed/xpath-namespace.xml
filter:
  - xpath:
      method: xml
      path: '//item/media:keywords/text()'
      namespaces:
        media: http://search.yahoo.com/mrss/

Alternatively, use the XPath expression //*[name()='<tag_name>'] to bypass the namespace entirely.

Using CSS and XPath filters to exclude content

Elements selected by the exclude sub-directive are removed from the final result. For example, the following job will not have any <a> tag in its results:

url: https://example.org/css-exclude.html
filter:
  - css:
      selector: 'body'
      exclude: 'a'

Limiting the returned items from a CSS Selector or XPath

If you only want to return a subset of the items returned by a CSS selector or XPath filter, you can use two additional sub-directives:

skip: How many elements to skip from the beginning (default: 0).
maxitems: How many elements to return at most (default: no limit).

For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:

url: https://example.net/css-skip-maxitems.html
filter:
  - css:
      selector: div.cpu
      skip: 1
      maxitems: 2

Duplicated results

If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use maxitems: 1 to only return the first item.

Fixing list reorderings with CSS Selector or XPath filters

In some cases, the ordering of items on a webpage might change regularly without the actual content changing. By default, this would show up in the diff output as an element being removed from one part of the page and inserted in another part of the page.

In cases where the order of items doesn’t matter, it’s possible to sort matched items lexicographically to avoid spurious reports when only the ordering of items changes on the page.

The subfilter for css and xpath filters is sort, and can be true or false (the default):

url: https://example.org/items-random-order.html
filter:
  - css:
      selector: span.item
      sort: true

Optional directives

selector (for css) or path (for xpath) [can be entered as the value of the xpath or css directive].
method: Either of html (default) or xml.
namespaces Mapping of XML namespaces for matching.
exclude: Elements to remove from the final result.
skip: Number of elements to skip from the beginning (default: 0).
maxitems: Maximum number of items to return (default: all).
sort: Sort elements lexographically (boolean) (default: false).

csv2text

The filter csv2text turns tabular data formatted as comma separated values (CSV) into a prettier textual representation. This is done by supplying a Python format string where the csv data is replaced into. If the CSV has a header, the format string should use the header names (lowercased).

For example, given the following csv data:

Name,Company
Smith,Apple
Doe,Google

we can make it more readable by using:

url: https://example.org/data.csv
filter:
  - csv2text:
     format_message: Mr. or Ms. {name} works at {company}.  # note the lowercase in the replacement_fields
     has_header: true

to produce:

Mr. or Ms. Smith works at Apple.
Mr. or Ms. Doe works at Google.

If there is no header row, or ignore_header is set to true, you will need to use the numeric array notation: Mr. or Mrs. {0} works at {1}..

Optional sub-directives

format_message (default): The Python format string containing “replacement fields” into which the data from the csv is substituted. Field names are the column headers (in lowercase) if the data has column headers or numeric starting from 0 if the data has no column headers or ignore_header is set to true.
has_header (true/false): Specifies whether the first row is a series of column headers (default: use the rough heuristics provided by Python’s csv.Sniffer.has_header method.
ignore_header (true/false): If set to true, it will parse the format_message as having numeric replacement fields even if the data has column headers (or has_header, immediately above, is set to true).

delete_lines_containing

This filter is the inverse of keep_lines_containing above and discards all lines that contain the text specified (default) or match the Python regular expression, keeping the others.

Examples:

name: "eliminate lines that contain 'xyz'"
url: https://example.com/delete_lines_containing.txt
filter:
  - delete_lines_containing: 'xyz'

name: "eliminate lines that start with 'warning' irrespective of its case (e.g. Warning, Warning, warning, etc.)"
url: https://example.com/delete_lines_containing_re.txt
filter:
  - delete_lines_containing:
      re: '(?i)^warning'

Notes: in regex, (?i) is the inline flag for case-insensitive matching and ^ (caret) matches the start of the string.

Optional sub-directives

text: (default) Match the text provided.
re: Match the the Python regular expression provided.

Changed in version 3.0: Renamed from grepi to avoid confusion.

element-by-[class|id|style|tag]

The filters element-by-class, element-by-id, element-by-style, and element-by-tag allow you to select all matching instances of a given HTML element.

Examples:

To extract only the <body> of a page:

url: https://example.org/bodytag.html
filter:
  - element-by-tag: body

To extract <div id="something">.../<div> from a page:

url: https://example.org/idtest.html
filter:
  - element-by-id: something

Since you can chain filters, use this to extract an element within another element:

url: https://example.org/idtest_2.html
filter:
  - element-by-id: outer_container
  - element-by-id: something_inside

To make the output human-friendly you can chain html2text on the result:

url: https://example.net/id2text.html
filter:
  - element-by-id: something
  - html2text:

To extract <div style="something">.../<div> from a page:

url: https://example.org/styletest.html
filter:
  - element-by-style: something

execute

The data to be filtered is passed as the input to a command to be run, and the output from the command is used in webchanges’s next step. All environment variables are preserved and the following ones added:

Environment variable	Description
`WEBCHANGES_JOB_JSON`	All job parameters in JSON format
`WEBCHANGES_JOB_LOCATION`	Value of either `url` or `command`
`WEBCHANGES_JOB_NAME`	Name of the job
`WEBCHANGES_JOB_NUMBER`	The job’s index number

For example, we can execute a Python script:

name: Test execute filter
url: https://example.net/execute.html
filter:
  # For multiline YAML, quote the string and unindent its continuation. A space is added at the end
  # of each line. Pay attention to escaping!
  - execute: "python -c \"import os, sys;
  print(f\\\"The data is '{sys.stdin.read()}'\\nThe job location is
  '{os.getenv('WEBCHANGES_JOB_LOCATION')}'\\nThe job name is
  '{os.getenv('WEBCHANGES_JOB_NAME')}'\\nThe job number is
  '{os.getenv('WEBCHANGES_JOB_INDEX_NUMBER')}'\\nThe job JSON is
  '{os.getenv('WEBCHANGES_JOB_JSON')}'\\\", end='')\""

Or instead we can call a script we have saved, e.g. - execute: python3 myscript.py.

If the command generates an error, the output of the error will be in the first line, before the traceback.

Tip

If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8 mode as per instructions here.

Changed in version 3.8: Added additional WEBCHANGES_JOB_* environment variables.

format-json

This filter serializes the JSON data to a pretty-printed indented string using Python’s json.dumps (or, if installed, the same function from the simplejson library) with a default indent level of 4.

If the job directive monospace is unset, to improve the readability in HTML reports this filter will set it to true. To override, add the directive monospace: true to the job (see here).

Optional sub-directives

indentation (integer or string): Either the number of spaces or a string to be used to indent each level with; if 0, a negative number or "" then no indentation (default: 4, i.e. 4 spaces).
sort_keys (true/false): Whether to sort the output of dictionaries by key (default: false).

Added in version 3.0.1: sort_keys sub-directive.

Changed in version 3.20: The filter sets the job’s monospace directive to true.

format-xml

This filter deserializes an XML object and reformats it. It uses the lxml Python package’s etree.tostring pretty_print function.

name: "reformat XML using lxml's etree.tostring"
url: https://example.com/format_xml.xml
filter:
  - format-xml:

Added in version 3.0.

hexdump

This filter displays the contents both in binary and ASCII using the hex dump format.

name: Display binary and ASCII test
command: cat testfile
filter:
  - hexdump:

html2text

This filter converts HTML (or XML) to Unicode text.

Optional sub-directives

method: One of:

html2text (default): Uses the html2text Python package and retains some simple formatting from HTML, outputting Markup language with absolute links;

bs4: Uses the Beautiful Soup Python package to extract text from either HTML or XML;

strip_tags: Uses regex to strip tags (HTML or XML).

`html2text`

This method is the default (does not need to be specified) and converts HTML into Markdown using the html2text Python package.

Warning

As this filter relies on the external html2text Python package, new releases of this package may generate text that is formatted slightly differently, and, if so, will cause webchanges to send a one-off change report.

It is the recommended option to convert all types of HTML into readable text, as it can be displayed (after conversion) in HTML.

Example configuration:

url: https://example.com/html2text.html
filter:
  - xpath: '//section[@role="main"]'
  - html2text:
      pad_tables: true

Note

If the content has tables, adding the sub-directive pad_tables: true may improve readability.

Optional sub-directives

See the optional sub-directives in the html2text Python package’s documentation. The following options are set by webchanges but can be overridden:
- unicode_snob: true to ensure that accented characters are kept as they are;
- body_width: 0 to ensure that lines aren’t chopped up;
- ignore_images: true to ignore images (since we’re dealing with text);
- single_line_break: true to ensure that additional empty lines aren’t added between sections;
- wrap_links: false to ensure that links are not wrapped (in case body_width is not set to 0) as it’s not Markdown compatible.

`strip_tags`

This filter method is a simple HTML/XML tag stripper based on applying a regular expression-based function. Very fast but may not yield the prettiest of results.

url: https://example.com/html2text_strip_tags.html
filter:
  - html2text: strip_tags

`bs4`

This filter method extracts visible text from HTML using the Beautiful Soup Python package, specifically its get_text(strip=True) method.

url: https://example.com/html2text_bs4.html
filter:
  - xpath: '//section[@role="main"]'
  - html2text:
      method: bs4
      strip: true

Parsers

Beautiful Soup supports multiple parsers as documented here. We default to the use of the lxml parser as recommended, but you can specify the parser by using the parser sub-directive:

url: https://example.com/html2text_bs4_html5lib.html
filter:
  - xpath: '//section[@role="main"]'
  - html2text:
      method: bs4
      parser: html5lib
      strip: true

Extracting text from XML

This filter can be used to extract text from XML by using the xml parser as follows:

url: https://example.com/html2text_bs4_xml
filter:
  - html2text:
      method: bs4
      parser: xml

Optional sub-directives

parser: the name of the parser library you want to use as per documentation (default: lxml).
separator: Strings extracted from the HTML or XML object will be concatenated using this separator (defaults to the empty string ``).
strip (true/false): If true, strings will be stripped before being concatenated (defaults to false).

Required packages

To run jobs with this filter method, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[bs4]

If (and only if) you specify parser: html5lib, then you also need to first install additional Python packages as follows:

pip install --upgrade webchanges[bs4,html5lib]

Changed in version 3.0: Filter defaults to the use of Python html2text package.

Changed in version 3.0: Method re renamed to strip_tags.

Deprecated since version urlwatch: Removed method lynx (external OS-specific dependency).

ical2text

This filter reads an iCalendar document and converts it to easy-to read text.

name: "Make iCal file readable"
url: https://example.com/cal.ics
filter:
  - ical2text:

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[ical2text]

jq

Linux/macOS ASCII only

The jq filter uses the Python bindings for jq, a lightweight ASCII JSON processor. It is currently available only for Linux (most flavors) and macOS (no Windows) and does not handle Unicode; see below for a cross-platform and Unicode-friendly way of selecting JSON.

url: https://example.net/jq-ascii.json
filter:
   - jq: '.[].title'

Supports aggregations, selections, and the built-in operators like length.

For more information on the operations permitted, see the jq Manual.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[jq]

Filtering JSON on Windows or containing Unicode and without `jq`

Python programmers on all OSs can use an advanced technique to select only certain elements of the JSON object; see Selecting items from a JSON dictionary. This method will preserve Unicode characters.

keep_lines_containing

This filter keeps only lines that contain the text specified (default) or match the Python regular expression specified, discarding the others. Note that while this filter emulates Linux’s grep, it does not use the executable grep.

Examples:

name: "convert HTML to text, strip whitespace, and only keep lines that have the sequence ``a,b:`` in them"
url: https://example.com/keep_lines_containing.html
filter:
  - html2text:
  - keep_lines_containing: 'a,b:'

name: "keep only lines that contain 'error' irrespective of its case (e.g. Error, ERROR, error, etc.)"
url: https://example.com/keep_lines_containing_re.txt
filter:
  - keep_lines_containing:
      re: '(?i)error'

Note: in regex (?i) is the inline flag for case-insensitive matching.

Optional sub-directives

text (default): Match the text provided.
re: Match the the Python regular expression provided.

Changed in version 3.0: Renamed from grep to avoid confusion.

ocr

This filter extracts text from images using the Tesseract OCR engine. Any file format supported by the Pillow (PIL Fork) Python package is supported.

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/ocr-test.png
filter:
  - ocr:
      timeout: 5
      language: eng

Optional sub-directives

timeout: Timeout for the recognition, in seconds (default: 10 seconds).
language: Text language (e.g. fra or eng+fra) (default: eng).

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[ocr]

In addition, you need to install Tesseract itself.

pdf2text

This filter converts a PDF file to plaintext using the pdftotext Python library, itself based on the Poppler library.

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/pdf-test.pdf
filter:
  - pdf2text

If the PDF file is password protected, you can specify its password:

url: https://example.net/pdf-test-password.pdf
filter:
  - pdf2text:
      password: webchangessecret

By default, pdf2text tries to reproduce the layout of the original document by using spaces. Be aware that these spaces may change when a document is updated, so you may get reports containing a lot of changes consisting of nothing but changes in the spacing between the columns; in this case try turning it off with the sub-directive physical: false.

Tip

If your reports are in HTML format and the PDF is columnar in nature, try using the job directive monospace: true to improve readability (see here).

url: https://example.net/pdf-test-keep-physical-layout.pdf
filter:
  - pdf2text:
      physical: true
monospace: true

To the opposite, if you don’t care about the layout, you might want to strip all additional spaces that might be added by this filter:

url: https://example.net/pdf-no-multiple-spaces.pdf
filter:
  - pdf2text:
  - re.sub:
      pattern: ' +'
      repl: ' '
  - strip:
      splitlines: true

Optional sub-directives

password: Password for a password-protected PDF file.
physical (true/false): If true, page text is output in the order it appears on the page, regardless of columns or other layout features (default: true). Only one of raw and physical can be set to true.
raw (true/false): If true, page text is output in the order it appears in the content stream (default: false). Only one of raw and physical can be set to true.

Added in version 3.8.2: physical and raw sub-directives.

Required packages

To run jobs with this filter, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[pdf2text]

In addition, you need to install any of the OS-specific dependencies of Poppler (see website).

pretty-xml

This filter deserializes an XML object and pretty-prints it. It uses Python’s xml.dom.minidom toprettyxml function.

name: "reformat XML using Python's xml.dom.minidom toprettyxml function"
url: https://example.com/pretty_xml.xml
filter:
  - pretty-xml:

Added in version 3.3.

pypdf

This filter converts a PDF file to plaintext using the pypdf Python library.

This filter must be the first filter in a chain of filters, since it consumes binary data.

url: https://example.net/pypdf-test.pdf
filter:
  - pypdf

If the PDF file is password protected, you can specify its password:

url: https://example.net/pypdf-test-password.pdf
filter:
  - pypdf:
      password: webchangessecret

pypdf locates all text drawing commands, in the order they are provided in the content stream of the PDF, and extracts the text.

Tip

If your reports are in HTML format and the PDF is columnar in nature, try using the job directive monospace: true to improve readability (see here).

url: https://example.net/pypdf-test-keep-physical-layout.pdf
filter:
  - pypdf:
monospace: true

To the opposite, if you don’t care about the layout, you might want to strip all additional spaces that might be added by this filter:

url: https://example.net/pypdf-no-multiple-spaces.pdf
filter:
  - pypdf:
  - re.sub:
      pattern: ' +'
      repl: ' '
  - strip:
      splitlines: true

Optional sub-directives

password: Password for a password-protected PDF file (dependency required; see below).

Added in version 3.16.

Required packages

To run jobs with this filter, you need to first install additional Python packages. If you’re not using the password sub-directive, then use the following:

pip install --upgrade webchanges[pypdf]

To run jobs with the password sub-directive, then use the following:

pip install --upgrade webchanges[pypdf_crypto]

re.findall

This filter extracts, deletes or replaces non-overlapping text using Python re.findall regular expression operation.

Just specifying a regular expression (regex) or string as the value will extract the match. Patterns can be replaced with another string using pattern as the expression and repl as the replacement, or deleted by setting repl to an empty string.

All features are described in Python’s re.findall’s documentation. The pattern is first iteratively matched using re.finditer and the repl value is applied to each non-overlapping match; if repl is missing, then group “0” (the entire match) is extracted.

Each match is outputted on its own line.

The following example applies the filter twice:

Just specifying a string as the value will include the full match in the output.
You can use groups (()) and back-reference them with \1 (etc..) to put groups into the replacement string.

By default, the full match will be included in the output.

url: https://example.com/regex-findall.html
filter:
    - re.findall: '<span class="price">.*</span>'
    - re.findall:
        pattern: 'Price: \$([0-9]+)'
        repl: '\1'

Tip

Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.

You can use the entire range of Python’s regular expression (regex) syntax.

Optional sub-directives

pattern: Regular expression pattern or string for matching; this sub-directive must be specified when using the repl sub-directive, otherwise the pattern can be specified as the value of re.sub (in which case a match will be extracted).
repl: The string applied iteratively to each match (default: ‘g<0>’, or extract all matches).

Added in version 3.20.

re.sub

This filter deletes or replaces text using Python Python re.sub regular expression operation.

Just specifying a regular expression (regex) or string as the value will remove the match. Patterns can be replaced with another string by specifying repl as the replacement.

All features are described in Python’s re.sub’s documentation. The pattern and repl values are passed to this function as-is; if repl is missing, then it’s considered to be an empty string, and this filter deletes the the leftmost non-overlapping occurrences of pattern.

Tip

Remember that some useful Python regex flags, such as IGNORECASE, MULTILINE, DOTALL, and VERBOSE, can be specified as inline flags and therefore can be used with webchanges.

The following example applies the filter 3 times:

name: "Strip href and change a few tags"
url: https://example.com/re_sub.html
filter:
  - re.sub: '\s*href="[^"]*"'
  - re.sub:
      pattern: '<h1>'
      repl: 'HEADING 1: '
  - re.sub:
      pattern: '</([^>]*)>'
      repl: '<END OF TAG \1>'

You can use the entire range of Python’s regular expression (regex) syntax: for example groups (()) in the pattern and \1 (etc.) to refer to these groups in the repl as in the example below, which replaces the number of milliseconds (which may vary each time you check this page and generate a change report) with an X (which therefore never changes):

name: "Replace a changing number in a sentence with an X"
url: https://example.com/re_sub_group.html
filter:
  - html2text:
  - re.sub:
      pattern: '(Page generated in )([0-9.])*( milliseconds.)'
      repl: '\1X\3'

Optional sub-directives

pattern: Regular expression pattern or string to match for replacement; this sub-directive must be specified when using the repl sub-directive, otherwise the pattern can be specified as the value of re.sub (in which case a match will be deleted).
repl: The string for replacement (default: empty string, i.e. deletes the string matched in pattern).

remove_repeated

This filter compares adjacent items (lines), and the second and succeeding copies of repeated items (lines) are removed. Repeated items (lines) must be adjacent in order to be found. Works similarly to Unix’s uniq.

By default, it acts over adjacent lines. Three lines consisting of dog - dog - cat will be turned into dog - cat, while dog - cat - dog will stay the same

url: https://example.com/remove-repeated.txt
filter:
  - remove_repeated

Prepend it with sort to capture globally unique lines, e.g. to turn dog - cat - dog to cat - dog:

url: https://example.com/remove-repeated-sorted.txt
filter:
  - sort
  - remove_repeated

This behavior can be changed by using an optional separator string argument. Also, ignore_case will tell it to ignore differences in case and of leading and/or trailing whitespace when comparing. For example, the below will turn mixed-case items separated by a pipe (|) a|b|B |c into a|b|c:

url: https://example.net/remove-repeated-separator.txt
filter:
  - remove_repeated:
      separator: '|'
      ignore_case: true

Finally, setting the adjacent sub-directive to false will cause all duplicates to be removed, even if not adjacent. For example, the below will turn items separated by a pipe (|) a|b|a|c into a|b|c:

url: https://example.net/remove-repeated-non-adjacent.txt
filter:
  - remove_repeated:
      separator: '|'
      adjacent: false

Optional sub-directives

separator (default): The string used to separate items whose order is to be reversed (default: \n, i.e. line-based); it can also be specified inline as the value of remove_repeated.
ignore_case: Ignore differences in case and of leading and/or trailing whitespace when comparing (true/false) (default: false).
adjacent: Remove only adjacent lines or items (true/false) (default: true).

Added in version 3.8.

Added in version 3.13: adjacent sub-directive.

reverse

This filter reverses the order of items (lines) without sorting:

url: https://example.com/reverse-lines.txt
filter:
  - reverse

This behavior can be changed by using an optional separator string argument (e.g. items separated by a pipe (|) symbol, as in 1|4|2|3, which would be reversed to 3|2|4|1):

url: https://example.net/reverse-separator.txt
filter:
  - reverse: '|'

Alternatively, the filter can be specified more verbose with a dict. In this example "\n\n" is used to separate paragraphs (items that are separated by an empty line):

url: https://example.org/reverse-paragraphs.txt
filter:
  - reverse:
      separator: "\n\n"

Optional sub-directives

separator: The string used to separate items whose order is to be reversed (default: \n, i.e. line-based reversing); it can also be specified inline as the value of reverse.

sha1sum

This filter calculates a SHA-1 hash for the contents. Useful to be notified when anything has changed without any detail and avoiding saving large snapshots of data.

name: "Calculate SHA-1 hash"
url: https://example.com/sha.html
filter:
  - sha1sum:

shellpipe

This filter works like execute, except that an intermediate shell process is spawned to run the command. This is to allow for certain corner situations (e.g. relying on variables, glob patterns, and other special shell features in the command) that the execute filter cannot handle.

Danger

The execution of a shell command opens up all sort of security issues and the use of this filter should be avoided in favor of the execute filter.

Example:

url: https://example.net/shellpipe.html
filter:
  - shellpipe: echo TEST

Important

On Linux and macOS systems, due to security reasons the shellpipe filter will not run unless both the jobs file and the directory it is located in are owned and writeable by only the user who is running the job (and not by its group or by other users). To set this up:

cd ~/.config/webchanges  # could be different
sudo chown $USER:$(id -g -n) . *.yaml
sudo chmod go-w . *.yaml

sudo may or may not be required;
If making the change from a different account than the one you run webchanges from, replace $USER:$(id -g -n) with the username:group of the account running webchanges.

Tip

If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8 mode as per instructions here.

sort

This filter performs a line-based sorting, ignoring cases (i.e. case folding as per Python’s implementation).

If the source provides data in random order, you should sort it before the comparison in order to avoid diffing based only on changes in the sequence.

name: "Sorting lines test"
url: https://example.net/sorting.txt
filter:
  - sort

The sort filter takes an optional separator parameter that defines the item separator (by default sorting is line-based), for example to sort text paragraphs (text separated by an empty line):

url: https://example.org/paragraphs.txt
filter:
  - sort:
      separator: "\n\n"

This can be combined with a boolean reverse option, which is useful for sorting and reversing with the same separator (using % as separator, this would turn 3%2%4%1 into 4%3%2%1):

url: https://example.org/sort-reverse-percent.txt
filter:
  - sort:
      separator: '%'
      reverse: true

Optional sub-directives

separator (default): The string used to separate items to be sorted (default: \n, i.e. line-based sorting).
reverse (true/false): Whether the sorting direction is reversed (default: false).

strip

This filter removes leading and trailing whitespace or specified characters from a set of characters. Whitespace includes the characters space, tab, linefeed, return, formfeed, and vertical tab.

name: "Strip leading and trailing whitespace from the block of data"
url: https://example.com/strip.html
filter:
  - strip:

name: "Strip trailing commas or periods from all lines"
url: https://example.com/strip_by_line.html
filter:
  - strip:
      chars: ',.'
      side: right
      splitlines: true

name: "Strip beginning spaces, tabs, etc. from all lines"
url: https://example.com/strip_leading_spaces.txt
filter:
  - strip:
      side: left
      splitlines: true

name: "Strip spaces, tabs etc. from both ends of all lines"
url: https://example.com/strip_each_line.html
filter:
  - strip:
      splitlines: true

Optional sub-directives

chars (default): A string specifying the set of characters to be removed instead of the default whitespace.
side: For one-sided removal: either left (strip only leading whitespace or matching characters) or right (strip only trailing whitespace or matching characters).
splitlines (true/false): Apply the filter on each line of text (default: false, apply to the entire data as a block).

Changed in version 3.5: Added optional sub-directives chars, side and splitlines.

Filters

absolute_links

ascii85

beautify

Optional sub-directives

Required packages

css and xpath

Using CSS and XPath filters with XML

Using CSS and XPath filters to exclude content

Limiting the returned items from a CSS Selector or XPath

Duplicated results

Fixing list reorderings with CSS Selector or XPath filters

Optional directives

csv2text

Optional sub-directives

delete_lines_containing

Optional sub-directives

element-by-[class|id|style|tag]

execute

format-json

Optional sub-directives

format-xml

hexdump

html2text

Optional sub-directives

html2text

Optional sub-directives

strip_tags

bs4

Parsers

Extracting text from XML

Optional sub-directives

Required packages

ical2text

Required packages

jq

Linux/macOS ASCII only

Required packages

Filtering JSON on Windows or containing Unicode and without jq

keep_lines_containing

Optional sub-directives

ocr

Optional sub-directives

Required packages

pdf2text

Optional sub-directives

Required packages

pretty-xml

pypdf

Optional sub-directives

Required packages

re.findall

Optional sub-directives

re.sub

Optional sub-directives

remove_repeated

Optional sub-directives

reverse

Optional sub-directives

sha1sum

shellpipe

sort

Optional sub-directives

strip

Optional sub-directives

`html2text`

`strip_tags`

`bs4`

Filtering JSON on Windows or containing Unicode and without `jq`