Differs

A differ is applied to the filtered data if it has changed from the previous run(s). A differ summarizes the changes in the data and produces the content of the report sent to you. The output of the differ can be further filtered using any of the filters listed in Filters (see Filtering the diff below).

At the moment, the following differs are available:

  • unified: (default) Compares data line-by-line, showing changed lines in a “unified format”;

  • command: Executes an outside command that acts as a differ (e.g. wdiff);

  • deepdiff: Compares structured data (JSON or XML) element-by-element.

  • table: A Python version of the unified differ where the changes are displayed as an HTML table;

In addition, the following BETA differs are available:

  • ai_google: Detects and summarizes changes using Generative AI (free API key required).

  • image: Detects changes in an image and displays them as overlay over a grayscale version of the old image.

A differ is specified using the job directive differ. To select a differ with its default directive values, assign the name of the differ as the value:

url: https://example.net/unified.html
differ: unified  # this entire line can be omitted as it's the default differ
url: https://example.net/deepdiff.html
differ: deepdiff  # use the deepdiff differ with its default values

Otherwise, the differ directive is a dictionary, and the name key contains the name of the differ:

url: https://example.net/unified_no_range.html
differ:
  name: unified
  range_info: false

unified

The default differ used when the differ job directive is not specified (except, for backward compatibility, when in the configration file the html report has the deprecated diff key set to table).

It does a line-by-line comparison, and reports lines that have been added (+), deleted (-), or changed. Changed lines are displayed twice: once marked as “deleted” (-) representing the old content, and once as “added” (+) representing the new content. Results are displayed in the unified format (the “unified diff”).

For HTML reports, webchanges colorizes the unified diff for easier legibility.

Examples:

url: https://example.net/unified.html
differ: unified  # this can also be omitted as it's the default
url: https://example.net/unified_no_range.html
differ:
  name: unified
  range_info: false

The context_lines directive causes a unified diff to have a set number of context lines that might be different from Python’s default of 3 (or 0 if the job contains additions_only: true or deletions_only: true).

Example:

url: https://example.com/#lots_of_contextlines
differ:
  name: unified
  context_lines: 5

Output:

---------------------------------------------------------------------------
CHANGED: https://example.com/#lots_of_contextlines
---------------------------------------------------------------------------
--- @   Sat, 01 Oct 2020 00:00:00 +0000
... @   Sat, 01 Oct 2020 01:00:00 +0000
@@ -1,15 +1,15 @@
 This is line 10
 This is line 11
 This is line 12
 This is line 13
 This is line 14
-This is line 15
+This is line fifteen
 This is line 16
 This is line 17
 This is line 18
 This is line 19
 This is line 20

Example (using the default, i.e. 3):

url: https://example.com/#default_contextlines

Output:

---------------------------------------------------------------------------
CHANGED: https://example.com/#default_contextlines
---------------------------------------------------------------------------
--- @   Sat, 01 Oct 2020 00:00:00 +0000
... @   Sat, 01 Oct 2020 01:00:00 +0000
@@ -1,15 +1,15 @@
 This is line 12
 This is line 13
 This is line 14
-This is line 15
+This is line fifteen
 This is line 16
 This is line 17
 This is line 18

Optional directives

  • context_lines (int): The number of lines on each side surrounding changes to include in the report (default: 3).

  • range_info (true/false): Whether to include line range information lines (those starting with @) (default: true).

Changed in version 3.21: Became a standalone differ. Added the range_info directive. Added the context_line directive, which replaces the job directive contextlines.

Added in version 3.0: contextlines (as a job directive)

ai_google

Added in version 3.21.

This differ is currently in BETA and the name and/or directives MAY change in the future, mostly because of the rapid advances in the technology and the prospect of integrating more generative AI models. Feedback welcomed here.

Prefaces a unified diff with a textual summary of changes generated by Google’s Gemini Pro 1.5 Generative AI model (in Preview) called via an API call. This is free of charge for most developers.

Important

Requires a system environment variable GOOGLE_AI_API_KEY containing the Google Cloud AI Studio API Key which you obtain here and which itself requires a Google Cloud account. To access the Gemini Pro 1.5 model during the Preview period, you may have to make a request here. Please note that starting on 2 May 2024, the use of Gemini API from a project that has billing enabled will be subject to pay-as-you-go pricing. To avoid surprises, we recommend you set up your API key on a GCP project without billing or, at a minimum, set up a budget with threshold notification.

Gemini Pro 1.5 is the first widely available model with a context window of up to 1 million tokens, which allows it to analyze changes in long documents (up to 350,000 words, or about 700 pages single-spaced) such as terms and conditions, privacy policies, etc. that other models can’t handle. For clarity, the model can handle up to 700,000 words, but to do a comparison we need up to a half of this for the old text and the rest for the new text.

To improve speed and reduce the number of tokens, by default we generate a separate, complete, unified diff which we feed to the Generative AI model to summarize. See below for a custom prompt that instead feeds both the old data and the new data to the model asking it to make the comparison.

This differ also works with less-powerful Google models, such as Gemini 1.0 Pro (see the model directive), whose access is not gated.

Warning

Generative AI can “hallucinate” (make things up), so always double-check the AI-generated summary with the accompanying unified diff.

Examples

Using the default prompt, a summary is prefaced to the unified diff:

Summary of Changes:

The provided unified diff shows a single line change:

  • Line 1: The timestamp was updated from Sat Apr 6 10:46:13 UTC 2024 to Sat Apr 6 10:55:04 UTC 2024.
--- @ Sat, 06 Apr 2024 10:46:13 +0000
+++ @ Sat, 06 Apr 2024 10:55:04 +0000
@@ -1 +1 @@
Sat Apr 6 10:46:13 UTC 2024
Sat Apr 6 10:55:04 UTC 2024
---
Summary generated by Google Generative AI (differ directive(s): model=gemini-1.5-pro-latest)

The job directive below uses a custom prompt to have the Generative AI make the comparison. This requires a lot more tokens and time, but may work better in certain cases. More information about writing input prompts for these models can be found here.

command: date
differ:
  name: ai_google
  prompt: Identify and summarize the changes:\n\n<old>\n{old_data}\n</old>\n\n<new>\n{new_data}\n</new>

Mandatory environment variable

  • GOOGLE_AI_API_KEY: Must contain your Google Cloud AI Studio API Key.

Optional directives

This differ is currently in BETA and these directives MAY change in the future.

  • model (str): A model name (default: gemini-1.5-pro-latest).

  • prompt (str): The prompt sent to the model; the strings {unified_diff}, {old_data} and {new_data} will be replaced by the respective content (default: Analyze this unified diff and create a summary listing only the changes:\n\n{unified_diff}).

  • prompt_ud_context_lines (int): Number of context lines in the unified diff sent to the model if {unified_diff} is present in the prompt (default: 999). If the resulting model prompt becomes approximately too big for the model to handle, the unified diff will be recalculated with the default number of context lines (3). Note that this unified diff is a different one than the diff included in the report itself.

  • timeout (float): The number of seconds before timing out the API call (default: 300).

  • temperature (float between 0.0 and 1.0): The model’s Temperature parameter, which controls randomness; higher values increase diversity (see note below) (default: 0.0).

  • top_k (int of 1 or greater): The model’s TopK parameter, i.e. sample from the k most likely next tokens at each step; lower k focuses on higher probability tokens (see note below) (default: model-dependent, but typically 1, see Google documentation; not available in gemini-1.5-pro-latest)

  • top_p (float between 0.0 and 1.0): The model’s TopP parameter, or the cumulative probability cutoff for token selection; lower p means sampling from a smaller, more top-weighted nucleus and reduces diversity (see note below) (default: model-dependent, but typically 0.95 or 1.0, see Google documentation)

  • token_limit (int): An override of the maximum size of the model’s context window (used for internal testing).

  • unified (dict): directives passed to unified differ, which prepares the unified diff attached to this report.

Directives for the underlying unified differ can be passed in as key called `unified, as follows:

command: date
differ:
  name: ai_google
  name: unified
    context_lines: 5
    range_info: false

Note

You can learn about Temperature, TopK and TopP parameters here. In general, temperature increases creativity and diversity in phrasing variety, while top-p and top-k influences variety of individual words with low values leading to potentially repetitive summaries. The only way to get these “right” is through experimentation with actual data, as the results are highly dependent on it and subjective to your personal preferences.

Tip

You can do “dry-runs” of this (or any) differ on an existing job by editing the differ in the job file and running e.g. webchanges --test-differ 1 --test-reporter browser. Don’t forget to revert your job file if you don’t like the new outcome!

command

Call an external differ (e.g. wdiff). The old data and new data are written to a temporary file, and the names of the two files are appended to the command. The external program will have to exit with a status of 0 if no differences were found, a status of 1 if any differences were found, or any other status for any error.

If wdiff is used, its output will be colorized when displayed on stdout (typically a screen) and for HTML reports.

For increased legibility, the differ directive name is not required (command is sufficient).

Example:

url: https://example.net/command.html
differ:
  command: wdiff

Please see important note for the file security settings required to run jobs with this differ in Linux.

Changed in version 3.21: Was previously a job sub-directive by the name of diff_tool.

deepdiff

Added in version 3.21.

Inspects structured data (JSON or XML) on an element by element basis and reports which elements have changed, using a customized report based on deepdiff’s library DeepDiff module.

Examples:

url: https://example.net/deepdiff_json.html
differ: deepdiff  # defaults to json data
url: https://example.net/deepdiff_xml_ignore_oder.html
differ:
  name: deepdiff
  data_type: xml
  ignore_order: true

Example diff:

Differ: deepdiff for json Old Sat, 13 Apr 2024 21:19:36 +0800 New Sun, 14 Apr 2024 21:24:14 +0800 ------------------------------------ • Type of ['Items'][0]['CurrentInventory'] changed from int to NoneType and value changed from "1" to None. • Type of ['Items'][0]['Description'] changed from str to NoneType and value changed from "Gadget" to None.

Optional directives

  • data_type (json or xml): The type of data being analyzed (default: json).

  • ignore_order (true/false): Whether to ignore the order in which the items have appeared (default: false).

  • ignore_string_case (true/false): Whether to be case-sensitive or not when comparing strings (default: false).

  • significant_digits (int): The number of digits AFTER the decimal point to be used in the comparison (default: no limit).

Required packages

To run jobs with this differ, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[deepdiff]

image

Added in version 3.21.

This differ is currently in BETA, mostly because it’s unclear what more needs to be changed or parametrized in order to make the differ work with a vast variety of images. Feedback welcomed here.

Highlights changes in an image by overlaying them in yellow on a greyscale version of the original image. Only works with HTML reports.

url: https://example.net/image.html
differ:
  name: image
  data_type: url

Optional directives

This differ is currently in BETA and the directives may change in the future.

  • data_type (url, filename, ascii85 or base64): The type of data to process: a link to the image, the path to the file containing the image, or the image itself encoded as Ascii85 or RFC 4648 Base_64 text (default: url).

  • mse_threshold (float): The minimum mean squared error (MSE) between two images to consider them changed; requires the package numpy to be installed (default: 2.5).

Note

If you pass a url or filename to the differ, it will detect changes only if the url or filename changes, not if the image behind the url/filename does. To detect changes in an image when the url or filename doesn’t change, build a job that captures the image itself encoded in Ascii85 (preferably, see the ascii85 filter or Base64 and set data_type accordingly.

Required packages

To run jobs with this differ, you need to first install additional Python packages as follows:

pip install --upgrade webchanges[imagediff]

In addition, you can only run it with a default configuration of :program:webchanges:, which installsthe httpx HTTP Client library; requests is not supported.

table

Similar to unified, it performs a line-by-line comparison and reports lines that have been added, deleted, or changed, but the HTML table format produced by Python’s difflib.HtmlDiff. Example output:

f 1 This line is the same f 1 This line is the same
n 2 This line is in the left file but not the right n
3 Another line that is the same 2 Another line that is the same
t t 3 This line is in the right file but not the left

For backwards compatibility, this is the default differ for an html reporter with the configuration setting diff (deprecated) set to html.

url: https://example.net/table.html
differ: table

Optional directives

  • tabsize: Tab stop spacing (default: 8).

Changed in version 3.21: Became a standalone differ (previously only accessible through configuration file settings). Added the tabsize directive.