Upgrading from urlwatch
Introduction
You can easily upgrade to webchanges from the current version of urlwatch using the same job and configuration files (see here) and benefit from many improvements, including:
AI-Powered Summaries: Summary of changes in plain text using generative AI, useful for long documents (e.g. legal);
Image Change Detection: Monitor changes to images and receive notifications with an image highlighting the differences;
Structured Data Monitoring: Track changes in JSON or XML data on an element-by-element basis;
Improved Documentation: We’ve revamped the documentation to make implementation easier;
Enhanced HTML Reports: HTML reports are now much clearer and include:
Clickable links!
Retention of most original formatting (bolding / headers, italics, underlining, lists with bullets (•), and indentation;
added and deleted lines clearly highlighted with color and strikethrough;
Wrapping of long lines (instead of truncation);
Improved compatibility with a wider range of HTML email clients, including those that override stylesheets (e.g., Gmail);
General legibility improvements.
New Filtering Options: New filters, like additions_only, which allows you to focus on added content without the distraction of deletions;
New Command Line Arguments: New command-line arguments such as
--errors, which helps you identify jobs that are no longer functioning correctly;Increased Reliability and Stability: Testing coverage has increased by approximately 30 percentage points;
Additional Enhancements: Numerous other additions, refinements, and bug fixes have been implemented. For more information, see here.
Example enhancements to HTML reporting:
Installation
webchanges 3.36.1rc1 is derived from urlwatch and is mostly backward-compatible with the latest version of urlwatch’s job and configuration files.
Upgrading from a urlwatch setup is automatic (see more below), and gives you:
Vastly improved HTML email reporting, including:
Links that are clickable!
Formatting such as bolding / headers, italics, underlining, list bullets (•) and indentation is preserved
Use of color (compatible with Dark Mode) and strikethrough to highlight added and deleted lines
Correct wrapping of long lines
Correct rendering by email clients who override stylesheets (e.g. Gmail)
Better HTML-to-text translation with updated defaults for the
html2textfilterOther legibility improvements
Improved
telegramreporter that uses MarkdownV2 and preserves most formatting of HTML sites including clickable links, bolding, underlining, italics and strikethrough.A more complete use of Playwright for browsing jobs to render JavaScript (called
navigatein urlwatch), including:Upgraded browser engine to the latest released version of Google Chrome
Higher stability by optimizing of concurrency
More flexibility and control with new directives
switches,wait_until,ignore_https_errors,wait_for_url,wait_for_function,wait_for_selector,wait_for_timeout,user_data_dir,initialization_url,initialization_js,block_elements,cookies,headers,referrer,http_proxy,https_proxy, andtimeoutplus the implementation for this type of jobs of theignore_connection_errors,ignore_timeout_errors,ignore_too_many_redirectsandignore_http_error_codesdirectivesFaster runs due to handling of ETags allowing servers to send a simple “HTTP 304 Not Modified” message when relevant
A new
--no-headlesscommand line argument to help with debugging
A new, more efficient indexed database that is smaller, allows for additional functionality such as rollbacks, and does not infinitely grow
Diffs (changes) that are no longer lost if webchanges is interrupted mid-execution or encounters an error with a reporter
The use of the webpage’s title as a job
nameif one isn’t providedThe ability to add a job
notefor the reportNew filters such as additions_only, which makes it easier to track content that was added without the distractions of the content that was deleted
A new
--errorscommand line argument to help catching any problems by listing all jobs that error out or have empty data after filters are appliedThe support of Unicode throughout, including in filters and in the jobs and configuration YAML files
The fixing of the
format-jsonfilter from unexpectedly reordering contents of dictionaries, now controllable by the new sub-directivesort_keysThe ability to undo mistakes by rolling back the database using
--rollback-database.More reliable releases due to:
A 32 percentage point increase in code testing coverage (to ~74%)
Completely new continuous integration (CI) and continuous delivery (CD) pipeline (GitHub Actions with pre-commit)
Uses of flake8 and doc8 linters and pre-commit checks
Code security checks using bandit
Type-hinted code checked using mypy
Testing on both Linux (Ubuntu) and macOS, with Windows 10 x64 to come
A vast improvement in documentation and error text
And much more!
Examples:
How-to
If you are using the latest version of urlwatch, simply install webchanges and run it. It will
find the existing urlwatch job and configuration files, and, unless you were still running lynx or have
custom code (see below), it should run just fine as is. It may complain about some directive name being changed for
clarity and other deprecations, but you will have time to make the edits if you decide
to stick around!
Tip
If running on Windows and are getting UnicodeEncodeError, make sure that you are running Python in UTF-8
mode as per instructions here.
However, if any of your jobs use a browser (i.e. have navigate or use_browser: true), you MUST install
Playwright:
Install the new dependencies:
uv pip install --upgrade webchanges[use_browser]
(Optional) ensure you have an up-to-date Google Chrome browser:
webchanges --install-chrome
If upgrading from urlwatch 2.27 or earlier, you can free up disk space if no other packages use Pyppeteer by, in order:
Removing the downloaded Chromium images by deleting the entire directory (and its subdirectories) shown by running:
python -c "import pathlib; from pyppeteer.chromium_downloader import DOWNLOADS_FOLDER; print(pathlib.Path(DOWNLOADS_FOLDER).parent)"
Uninstalling the Pyppeteer package by running:
pip uninstall pyppeteer
If you encounter any problems or have any suggestions please open an issue here and someone will look into it.
Note
If you are upgrading from an older version of urlwatch, before running webchanges make sure that you can run the latest version of urlwatch successfully, having implemented all urlwatch breaking changes in your job and configuration files.
For example, per urlwatch issue #600
url: https://example.com/
filters:
- html2text
no longer works in the latest version of urlwatch, and therefore in webchanges, as all filters must be specified as sub-directives like this:
url: https://example.com/
filters:
- html2text:
Upgrade details
Most everything, except the breaking changes below, should work out of the box when upgrading from a urlwatch setup for version 2.29. No changes to the files are made, so that you can switch back whenever you want.
⚠ Breaking Changes
Relative to urlwatch:
Must run on Python version 3.9 or higher.
By default a new much improved database engine is used; run with
--database-engine minidbcommand line argument to preserve backwards-compatibility.By default only 4 snapshots are kept with the new database engine, and older ones are purged after every run; run with
--max-snapshots 0command line argument to keep the existing behavior (but beware of its infinite database growth!).The
html2textfilter’slynxmethod is no longer supported as it was obsoleted by Python packages; use the default method instead or, if you must, construct a custom command using the execute filter.If you are using the
shellpipefilter and are running in Windows, ensure that Python is set to UTF-8 mode to avoid gettingUnicodeEncodeError.If you’re using a hooks file (e.g.
hooks.py), all imports fromurlwatchneed to be replaced with identical imports fromwebchanges.If you are using the
slackreporter you need to rename itwebhook(unified reporter).If you are using browser (
navigate) jobs, see above for upgrading to Playwright.Reporter
shellimitates webchanges:’srun_commandand is not supported (userun_commandreporter instead).
Additions and changes
Relative to the latest version of urlwatch:
Installation and command line
New
--errorscommand line argument will let you know the jobs that result in an error or have empty responses after filters are applied.--testcommand line argument is used to test a job (formerly--test-filter, deprecated and will be removed in a future release).--test-differcommand line argument is used to test a jobs’ differ (formerly--test-diff-filter, deprecated and will be removed in a future release) and display diff history.--test-differcommand line argument is no longer limited to displaying the last 10 snapshots.Add job number(s) in command line to run a subset of jobs; for example, run
webchanges 2 3to only run jobs #2 and #3 of your jobs list (find job numbers by runningwebchanges --list). Negative job indices are allowed; for example, runwebchanges -1to only run the last job of your jobs list, orwebchanges --test -2to test the second to last job of your jobs list.New
--max-snapshotscommand line argument sets the number of snapshots to keep stored in the database; defaults to 4. If set to 0, and unlimited number of snapshots will be kept. Only works if the defaultsqlite3database is being used.New
--database-engine ENGINEcommand line argument to specify database engine. New defaultsqlite3creates a smaller database due to data compression with msgpack, higher speed due to indexing, and offers additional features and flexibility; migration from old ‘minidb’ database is done automatically and the old database preserved for manual deletion. Specifyminidbto continue using the legacy database used by urlwatch.New
--rollback-database TIMESTAMPnew command line argument to rollback the snapshot database to a previous time, useful when you lose notifications. Does not work with database engineminidbortextfiles.New
--delete-snapshotcommand line argument to removes the latest saved snapshot of a job from the database; useful if a change in a website (e.g. layout) requires modifying filters as invalid snapshot can be deleted and webchanges rerun to create a truthful diff.New
--chromium-directorycommand line displays the directory where the downloaded Chromium executables are located to facilitate the deletion of older revisions.New
-Vcommand line argument, as an alias to--version.New
--log-levelcommand line argument to control the amount of logging displayed by the-vargument.If a filename for
--jobs,--configor--hooksis supplied without a path and the file is not present in the current directory, webchanges now looks for it in the default configuration directory.If a filename for
--jobsor--configis supplied without a ‘.yaml’ extension, or a filename for--hookswithout a ‘.py’ extension, webchanges now also looks for one with such an extension appended to it.In Windows,
--editdefaults to using the built-in notepad.exe text editor if both the %EDITOR% and %VISUAL% environment variables are not set.Run a subset of jobs by adding their index number(s) as command line arguments. For example, run
webchanges 2 3to only run jobs #2 and #3 of your jobs list. Runwebchanges --listto find the job numbers. API is experimental and may change in the near future.Installation of optional Python packages required by a feature or filter is now made easier with pip extras (e.g.
pip install -U webchanges[ocr,pdf2text]).html2text,markdown2andmsgpackPython packages are now installed by default, whilekeyringandminidbPython are no longer installed by default.
Files and location
The default name of the jobs file has been changed to
jobs.yaml; for backward-compatibility if at program launch nojobs.yamlexists buturls.yamlis found, its contents are copied into a newly createdjobs.yamlfile and the original preserved for manual deletion.The default name of the program configuration file has been changed to
config.yaml; for backward-compatibility if at program launch noconfig.yamlexists buturlwatch.yamlis found, its contents are copied into a newly createdconfig.yamlfile and the original preserved for manual deletion.In Windows, the location of the jobs and configuration files has been moved to
%USERPROFILE%\Documents\webchanges, where they can be more easily edited (they are indexed there) and backed up; if at program launch jobs and configurations files are only found in the old location (such as during an upgrade), these will be copied to the new directory automatically and the old ones preserved for manual deletion.Legacy
lib/hooks.pyfile location is no longer supported:hooks.pyneeds to be in the same directory as the job and configuration files.
Directives
Navigation by full browser is now accomplished by specifying the
urland adding theuse_browser: truedirective. The use of thenavigatedirective instead of theurlone has been deprecated for clarity and will trigger a warning; this directive will be removed in a future release.The
html2textfilter defaults to using the Pythonhtml2textpackage (with optimized defaults) instead ofre(now renamed strip_tags` for clarity).New
additions_onlydirective to report only added lines (useful when monitoring only new content).New
deletions_onlydirective to report only deleted lines.New
contextlinesdirective to specify the number of context lines in a unified diff.New
no_redirectsjob directive (forurljobs) to disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection.New directives for
use_browser: true(i.e. using Chrome) jobs to allow more flexibility and control:switches,wait_until,ignore_https_errors,wait_for_navigation,wait_for,user_data_dir,block_elements,cookies,headers,http_proxy,https_proxy, andtimeout.New
notejob directive to ad a freetext note appearing in the report after the job header.New sub-directives for the
stripfilter:chars,sideandsplitlines.The
html2textfilter’sremethod has been renamedstrip_tagsfor clarity, the old name is deprecated and will trigger a warning.The
pdf2textfilter now supports therawandphysicalsub-directives, which are passed to the underlying Python package pdftotext (version 2.2.0 or higher).New
format-xmlfilter to pretty-print xml using the lxml Python package’s etree.tostring pretty_print functionurldirective supportsftp://URLs.The
user_visible_urljob directive now applies to all type of jobs, includingcommandones.The
grepfilter has been renamedkeep_lines_containingfor clarity, the old name is deprecated and will trigger a warning; it will be removed in a future release.The
grepifilter has been renameddelete_lines_containingfor clarity, the old name deprecated and will trigger a warning; it will be removed in a future release.Both the
keep_lines_containinganddelete_lines_containingaccepttext(default) in addition tore(regular expressions).New filter
executeto filter the data using an executable without invoking the shell (asshellpipedoes) and therefore exposing to additional security risks.Support for
ftp://URLs to download a file from an ftp server.The use of the
kinddirective injobs.yamlconfiguration files has been deprecated for simplicity (but is, for now, still used internally); it will be removed in a future release.New
browserreporter to display HTML-formatted report on a local browser.The
telegramreporter now uses MarkdownV2 and preserves most formatting of HTML sites processed by thehtml2textfilter, e.g. clickable links, bolding, underlining, italics and strikethrough.New sub-directive
silentfortelegramreporter to receive a notification with no sound.The
slackwebhook reporter allows the setting of maximum report length (for, e.g., usage with Discord) using themax_message_lengthsub-directive.urljobs withuse_browser: true(i.e. using Chrome) now recognizedataandmethoddirectives, enabling e.g. to make aPOSTHTTP request using a browser with JavaScript support.New
tzkey forreportin configuration file sets the timezone for the diff in reports (useful if running e.g. on a cloud server in a different timezone).New
run_commandreporter to execute a command and pass the report text as its input.New
remove_repeatedfilter to remove repeated lines (similar to Unix’suniq).The
executefilter (andshellpipe) sets more environment variables to allow for more flexibility.Whenever a HTTP client error (4xx) response is received, in
--verbosemode the content of the response is displayed with the error.The user is now alerted when the job file and/or configuration file contains unrecognized directives (e.g. typo).
If a newer version of webchanges has been released to PyPI, an advisory notice is printed to stdout.
Reports
Reports are now sorted alphabetically.
If a newer version of webchanges has been released to PyPI, an advisory notice is added to the report footer (if footer is enabled).
Internals
Concurrency with
use_browser: true(i.e. using Chrome) jobs takes into account amount of free memory for higher stability.Upgraded concurrent execution loop to concurrent.futures.ThreadPoolExecutor.map.
A new, more efficient indexed database no longer requiring external Python package
minidb.Changed timing from datetime to timeit.default_timer.
Replaced custom atomic_rename function with built-in os.replace(). (new in Python 3.3) that does the same thing.
Upgraded email construction from using
email.mime(obsolete) to email.message.EmailMessage.Reports’ elapsed time now always has at least 2 significant digits.
Unicode is supported throughout, including in filters and jobs and configuration YAML files.
Implemented pathlib (new in Python 3.4) for better code readability and functionality.
A 32 percentage point increase in code testing coverage (to ~74%), a completely new continuous integration (CI) and continuous delivery (CD) pipeline (GitHub Actions), and testing on Ubuntu and macOS (with Windows 10 x64 to come) increases reliability of new releases.
Using flake8 to check PEP-8 compliance and more.
Using coverage to check unit testing coverage.
Strengthened security with bandit to catch common security issues.
Standardized code formatting with black.
Properly arranging imports with isort.
Added type hinting to the entire code and using mypy to check it.
A vast improvement in documentation and error text.
The support for Python 3.11.
Fixed
Relative to urlwatch:
Diff (change) data is no longer lost if webchanges is interrupted mid-execution or encounters an error in reporting: the permanent database is updated only at the very end (after reports are sent).
The database no longer grows unbounded to infinity. Fix only works when using the new, default,
sqlite3database engine. In this scenario only the latest 4 snapshots are kept, and older ones are purged after every run; the number is selectable with the new--max-snapshotscommand line argument. To keep the existing grow-to-infinity behavior, run webchanges with--max-snapshots 0.The
html2textfilter’shtml2textmethod defaults to Unicode handling.The
html2textfilter’sstrip_tagsmethod is no longer returning HTML character references (e.g. >, > , >) but the corresponding Unicode characters.HTML href links ending with spaces are no longer broken by
xpathreplacing spaces with%20.Initial config file no longer has directives sorted alphabetically, but are saved logically (e.g. ‘enabled’ is always the first sub-directive for a reporter).
The presence of the
datadirective in a job no longer forces the method to POST allowing e.g. PUTs.format-jsonfilter no longer unexpectedly reorders contents of dictionaries, but the new sub-directivesort_keysallows you to set it to do so if you want to.When using the
--editor--edit-configcommand line arguments to edit jobs or configuration files, symbolic file links are maintained (no longer overwritten by the file).Jobs file (e.g.
jobs.yaml) is now loaded only once per run.Fixed various system errors and freezes when running
urljobs withuse_browser: true(formerlynavigatejobs).Job
headersstored in the configuration file (config.yaml) are now merged correctly and case-insensitively with those present in the job (injobs.yaml). A header in the job replaces a header by the same name if already present in the configuration file, otherwise is added to the ones present in the configuration file.Fixed
TypeError: expected string or bytes-like objecterror in cookiejar (called by requests module) caused by somecookiesbeing read from the jobs YAML file in other formats.Use same retrieval duration precision in all reports.
Fixed a rare case when html report would not correctly reconstruct a clickable link from Markdown for (an) item(s) inside an element in a list.
No longer errors out when
telegramreporter’schat_idis numeric.test-differcommand line argument was showing historical diffs in wrong order; now showing most recent firstAn error is now raised when a
urljob withuse_browser: truereturns no data due to an HTTP error (e.g. proxy_authentication_required).Jobs were included in email subject line even if there was nothing to report after filtering with
additions_onlyordeletions_only.hexdumpfilter now correctly formats lines with less than 16 bytes.sha1sumandhexdumpfilters now accept data that is bytes (not just text).Fixed case of wrong ETag being captured and saved when a URL redirection took place.
Rewrote most error messages for increased clarity.
Deprecations
Relative to urlwatch:
The
html2textfilter’slynxmethod is no longer supported as it was obsoleted by Python libraries; use the default method instead or construct a customexecutecommand.The following deprecations are (for now) still working but will issue a warning:
Job directive
kindis unused: remove from job.Job directive
navigateis deprecated: useurland adduse_browser: true.Method
pyhtml2textof filterhtml2textis deprecated; since that method is now the default, remove the method’s sub-directive.Method
reof filterhtml2textis renamed tostrip_tagsfor clarity.Filter
grepis renamed tokeep_lines_containingfor clarity.Filter
grepiis renamed todelete_lines_containingfor clarity.Command line
--test-filterargument is renamed to--testfor clarity.Command line
--test-diff-filterargument is renamed to--test-differfor clarity.
Also be aware that:
The name of the default job file has changed to
jobs.yaml; if not found, legacyurls.yamlwill be automatically copied into it.The name of the default configuration file has changed to
config.yaml; if not found, legacyurlwatch.yamlwill be automatically copied into it.The location of configuration and jobs files in Windows has changed to
%USERPROFILE%/Documents/webchangeswhere they can be more easily edited and backed up.
Legal
The roots of webchanges from code of urlwatch 2.21 dated 30 July 2020 are credited throughout, and its code is appropriately copyrighted/licensed:
webchanges’ main page reads:
License
=======
Released under the `MIT License <https://opensource.org/licenses/MIT>`__ but redistributing modified source code from
`urlwatch 2.21 <https://github.com/thp/urlwatch/tree/346b25914b0418342ffe2fb0529bed702fddc01f>`__ dated 30 July 2020
licensed under a `BSD 3-Clause License
<https://raw.githubusercontent.com/thp/urlwatch/346b25914b0418342ffe2fb0529bed702fddc01f/COPYING>`__. See the
complete license `here <https://github.com/mborsetti/webchanges/blob/main/LICENSE.md>`__.
Each file with code contains this remark at the top:
# The code below is subject to the license contained in the LICENSE.md file, which is part of the source code.
Note: There is no requirement anywhere in law to spam the entire 61-line, 465 words license text on Every. Single. File. In. Every. Single. Directory; the above notice is amply sufficient.
The license file reads:
This software redistributes source code of release 2.21 of urlwatch
https://github.com/thp/urlwatch/tree/346b25914b0418342ffe2fb0529bed702fddc01f of 30 July 2021 which is subject to
the following copyright notice and license from
https://raw.githubusercontent.com/thp/urlwatch/346b25914b0418342ffe2fb0529bed702fddc01f/COPYING hereby retained
and redistributed with the source code (of which this license file is part of), in binary form, and in the
documentation. The appearance of the name of the author below does not constitute an endorsement or promotion of
this software by such author.
Copyright (c) 2008-2020 Thomas Perl <m@thp.io>
All rights reserved.
[follows full text of the urlwatch license]
(4) While a lot of improvements have been made from urlwatch 2.21, there’s no lack of proper and complete acknowledgement of the package’s use of Thomas Perl’s code – in multiple ways, incuding the required full and explicit licensing language.