webchanges.storage module

Storage: jobs files, config files, hooks file, and cache database engines.

class webchanges.storage.BaseFileStorage(filename)

Bases: BaseStorage, ABC

Base class for file storage.

Parameters:

filename (str | Path) – The filename or directory name to storage.

class webchanges.storage.BaseStorage

Bases: ABC

Base class for storage.

class webchanges.storage.BaseTextualFileStorage(filename)

Bases: BaseFileStorage, ABC

Base class for textual files.

Parameters:

filename (str | Path) – The filename or directory name to storage.

abstractmethod load(*args)

Load from storage.

Parameters:

args (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

abstractmethod save(*args, **kwargs)

Save to storage.

Parameters:
  • args (Any) – Specified by the subclass.

  • kwargs (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

abstractmethod classmethod parse(filename)

Parse storage contents.

Parameters:

filename (Path) – The filename.

Returns:

Specified by the subclass.

Return type:

Any

edit()

Edit file.

Returns:

None if edit is successful, 1 otherwise.

Return type:

int

remove_remark_keys(_dict)

Recursively removes all keys starting with ‘_’ from a dict.

Parameters:

_dict (T) – The dict.

Returns:

The dict with all keys starting with ‘_’ removed.

Return type:

T

class webchanges.storage.BaseYamlFileStorage(filename)

Bases: BaseTextualFileStorage, ABC

Base class for YAML textual files storage.

Parameters:

filename (str | Path) – The filename or directory name to storage.

edit()

Edit file.

Returns:

None if edit is successful, 1 otherwise.

Return type:

int

abstractmethod load(*args)

Load from storage.

Parameters:

args (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

remove_remark_keys(_dict)

Recursively removes all keys starting with ‘_’ from a dict.

Parameters:

_dict (T) – The dict.

Returns:

The dict with all keys starting with ‘_’ removed.

Return type:

T

abstractmethod save(*args, **kwargs)

Save to storage.

Parameters:
  • args (Any) – Specified by the subclass.

  • kwargs (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

classmethod parse(filename)

Return contents of YAML file if it exists

Parameters:

filename (Path) – The filename Path.

Returns:

Specified by the subclass.

Return type:

Any

class webchanges.storage.JobsBaseFileStorage(filename)

Bases: BaseTextualFileStorage, ABC

Class for jobs textual files storage.

Class for jobs textual files storage.

Parameters:

filename (list[Path]) – The filenames of the jobs file.

filename: list[Path]
load_secure()

Load the jobs from a text file checking that the file is secure (i.e. belongs to the current UID and only the owner can write to it - Linux only).

Returns:

List of JobBase objects.

Return type:

list[JobBase]

edit()

Edit file.

Returns:

None if edit is successful, 1 otherwise.

Return type:

int

abstractmethod load(*args)

Load from storage.

Parameters:

args (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

abstractmethod classmethod parse(filename)

Parse storage contents.

Parameters:

filename (Path) – The filename.

Returns:

Specified by the subclass.

Return type:

Any

remove_remark_keys(_dict)

Recursively removes all keys starting with ‘_’ from a dict.

Parameters:

_dict (T) – The dict.

Returns:

The dict with all keys starting with ‘_’ removed.

Return type:

T

abstractmethod save(*args, **kwargs)

Save to storage.

Parameters:
  • args (Any) – Specified by the subclass.

  • kwargs (Any) – Specified by the subclass.

Returns:

Specified by the subclass.

Return type:

Any

class webchanges.storage.SsdbDirStorage(dirname)

Bases: SsdbStorage

Class for snapshots stored as individual textual files in a directory ‘dirname’.

Parameters:
  • filename – The filename or directory name to storage.

  • dirname (str | Path)

close()
Return type:

None

get_guids()
Return type:

list[str]

load(guid)
Parameters:

guid (str)

Return type:

Snapshot

get_history_data(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

dict[str | bytes, float]

get_history_snapshots(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

list[Snapshot]

save(*args, guid, snapshot, **kwargs)
Parameters:
  • args (Any)

  • guid (str)

  • snapshot (Snapshot)

  • kwargs (Any)

Return type:

None

delete(guid)
Parameters:

guid (str)

Return type:

None

delete_latest(guid, delete_entries=1, **kwargs)

For the given ‘guid’, delete the latest entry and keep all other (older) ones.

Parameters:
  • guid (str) – The guid.

  • delete_entries (int) – The number of most recent entries to delete.

  • kwargs (Any)

Raises:

NotImplementedError – This function is not implemented for ‘textfiles’ databases.

Return type:

int

delete_all()

Delete all entries; used for testing only.

Raises:

NotImplementedError – This function is not implemented for ‘textfiles’ databases.

Return type:

int

clean(guid, keep_entries=1)

Removes the entries for guid except the latest n keep_entries.

Parameters:
  • guid (str) – The guid.

  • keep_entries (int) – The number of most recent entries to keep.

Returns:

Number of records deleted.

Return type:

int

move(guid, new_guid)

Moves the data from guid to new_guid.

Parameters:
  • guid (str) – The guid.

  • new_guid (str) – The new guid.

Returns:

Number of records moved.

Return type:

int

rollback(timestamp)

Rolls back the database to timestamp.

Parameters:

timestamp (float) – The timestamp.

Returns:

Number of records deleted.

Raises:

NotImplementedError for those classes where this method is not implemented.

Return type:

None

flushdb()

Delete all entries of the database. Use with care, there is no undo!

Return type:

None

backup()

Return the most recent entry for each ‘guid’.

Returns:

A generator of tuples, each consisting of (guid, data, timestamp, tries, etag, mime_type)

Return type:

Iterator[tuple[str, str | bytes, float, int, str, str, ErrorData]]

clean_ssdb(known_guids, keep_entries=1)

Convenience function to clean the cache.

If self.clean_all is present, runs clean_all(). Otherwise, runs clean() on all known_guids, one at a time. Prints the number of snapshots deleted.

Parameters:
  • known_guids (Iterable[str]) – An iterable of guids

  • keep_entries (int) – Number of entries to keep after deletion.

Return type:

None

gc(known_guids, keep_entries=1)

Garbage collect the database: delete all guids not included in known_guids and keep only last n snapshot for the others.

Parameters:
  • known_guids (Iterable[str]) – The guids to keep.

  • keep_entries (int) – Number of entries to keep after deletion for the guids to keep.

Return type:

None

restore(entries)

Save multiple entries into the database.

Parameters:

entries (Iterable[tuple[str, str | bytes, float, int, str, str, ErrorData]]) – An iterator of tuples WHERE each consists of (guid, data, timestamp, tries, etag, mime_type)

Return type:

None

class webchanges.storage.SsdbRedisStorage(filename)

Bases: SsdbStorage

Class for storing snapshots using redis.

Parameters:

filename (str | Path) – The filename or directory name to storage.

close()
Return type:

None

get_guids()
Return type:

list[str]

load(guid)
Parameters:

guid (str)

Return type:

Snapshot

get_history_data(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

dict[str | bytes, float]

get_history_snapshots(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

list[Snapshot]

save(*args, guid, snapshot, **kwargs)
Parameters:
  • args (Any)

  • guid (str)

  • snapshot (Snapshot)

  • kwargs (Any)

Return type:

None

delete(guid)
Parameters:

guid (str)

Return type:

None

delete_latest(guid, delete_entries=1, **kwargs)

For the given ‘guid’, delete the latest ‘delete_entries’ entry and keep all other (older) ones.

Parameters:
  • guid (str) – The guid.

  • delete_entries (int) – The number of most recent entries to delete (only 1 is supported by this Redis code).

  • kwargs (Any)

Returns:

Number of records deleted.

Return type:

int

delete_all()

Delete all entries; used for testing only.

Returns:

Number of records deleted.

Return type:

int

clean(guid, keep_entries=1)

Removes the entries for guid except the latest n keep_entries.

Parameters:
  • guid (str) – The guid.

  • keep_entries (int) – The number of most recent entries to keep.

Returns:

Number of records deleted.

Return type:

int

move(guid, new_guid)

Replace uuid in records matching the ‘guid’ with the ‘new_guid’ value.

If there are existing records with ‘new_guid’, they will not be overwritten and the job histories will be merged.

Returns:

Number of records searched for replacement.

Parameters:
  • guid (str)

  • new_guid (str)

Return type:

int

rollback(timestamp)

Rolls back the database to timestamp.

Raises:

NotImplementedError: This function is not implemented for ‘redis’ database engine.

Parameters:

timestamp (float)

Return type:

None

flushdb()

Delete all entries of the database. Use with care, there is no undo!

Return type:

None

backup()

Return the most recent entry for each ‘guid’.

Returns:

A generator of tuples, each consisting of (guid, data, timestamp, tries, etag, mime_type)

Return type:

Iterator[tuple[str, str | bytes, float, int, str, str, ErrorData]]

clean_ssdb(known_guids, keep_entries=1)

Convenience function to clean the cache.

If self.clean_all is present, runs clean_all(). Otherwise, runs clean() on all known_guids, one at a time. Prints the number of snapshots deleted.

Parameters:
  • known_guids (Iterable[str]) – An iterable of guids

  • keep_entries (int) – Number of entries to keep after deletion.

Return type:

None

gc(known_guids, keep_entries=1)

Garbage collect the database: delete all guids not included in known_guids and keep only last n snapshot for the others.

Parameters:
  • known_guids (Iterable[str]) – The guids to keep.

  • keep_entries (int) – Number of entries to keep after deletion for the guids to keep.

Return type:

None

restore(entries)

Save multiple entries into the database.

Parameters:

entries (Iterable[tuple[str, str | bytes, float, int, str, str, ErrorData]]) – An iterator of tuples WHERE each consists of (guid, data, timestamp, tries, etag, mime_type)

Return type:

None

class webchanges.storage.SsdbSQLite3Storage(filename, max_snapshots=4)

Bases: SsdbStorage

Handles storage of the snapshot as a SQLite database in the ‘filename’ file using Python’s built-in sqlite3 module and the msgpack package.

A temporary database is created by __init__ and will be written by the ‘save()’ function (unless temporary=False). This data will be written to the permanent one by the ‘close()’ function, which is called at the end of program execution.

The database contains the ‘webchanges’ table with the following columns:

  • guid: unique hash of the “location”, i.e. the URL/command; indexed

  • timestamp: the Unix timestamp of when then the snapshot was taken; indexed

  • msgpack_data: a msgpack blob containing ‘data’, ‘tries’, ‘etag’ and ‘mime_type’ in a dict of keys ‘d’, ‘t’, ‘e’ and ‘m’

Parameters:
  • filename (Path) – The full filename of the database file

  • max_snapshots (int) – The maximum number of snapshots to retain in the database for each ‘guid’

close()

Writes the temporary database to the permanent one, purges old entries if required, and closes all database connections.

Return type:

None

get_guids()

Lists the unique ‘guid’s contained in the database.

Returns:

A list of guids.

Return type:

list[str]

load(guid)

Return the most recent entry matching a ‘guid’.

Parameters:

guid (str) – The guid.

Returns:

A tuple (data, timestamp, tries, etag) WHERE

  • data is the data;

  • timestamp is the timestamp;

  • tries is the number of tries;

  • etag is the ETag.

Return type:

Snapshot

get_history_data(guid, count=None)

Return max ‘count’ (None = all) records of data and timestamp of successful runs for a ‘guid’.

Parameters:
  • guid (str) – The guid.

  • count (int | None) – The maximum number of entries to return; if None return all.

Returns:

A dict (key: value) WHERE

  • key is the snapshot data;

  • value is the most recent timestamp for such snapshot.

Return type:

dict[str | bytes, float]

get_history_snapshots(guid, count=None)

Return max ‘count’ (None = all) entries of all data (including from error runs) saved for a ‘guid’.

Parameters:
  • guid (str) – The guid.

  • count (int | None) – The maximum number of entries to return; if None return all.

Returns:

A list of Snapshot tuples (data, timestamp, tries, etag). WHERE the values are:

  • data: The data (str, could be empty);

  • timestamp: The timestamp (float);

  • tries: The number of tries (int);

  • etag: The ETag (str, could be empty).

Return type:

list[Snapshot]

save(*args, guid, snapshot, temporary=True, **kwargs)

Save the data from a job.

By default, it is saved into the temporary database. Call close() to transfer the contents of the temporary database to the permanent one.

Note: the logic is such that any attempts that end in an exception will have tries >= 1, and we replace the data with the one from the most recent successful attempt.

Parameters:
  • guid (str) – The guid.

  • data – The data.

  • timestamp – The timestamp.

  • tries – The number of tries.

  • etag – The ETag (could be empty string).

  • temporary (bool | None) – If true, saved to temporary database (default).

  • args (Any)

  • snapshot (Snapshot)

  • kwargs (Any)

Return type:

None

delete(guid)

Delete all entries matching a ‘guid’.

Parameters:

guid (str) – The guid.

Return type:

None

delete_latest(guid, delete_entries=1, temporary=False, **kwargs)

For the given ‘guid’, delete the latest ‘delete_entries’ number of entries and keep all other (older) ones.

Parameters:
  • guid (str) – The guid.

  • delete_entries (int) – The number of most recent entries to delete.

  • temporary (bool | None) – If False, deleted from permanent database (default).

  • kwargs (Any)

Returns:

Number of records deleted.

Return type:

int

delete_all()

Delete all entries; used for testing only.

Returns:

Number of records deleted.

Return type:

int

clean(guid, keep_entries=1)

For the given ‘guid’, keep only the latest ‘keep_entries’ number of entries and delete all other (older) ones. To delete older entries from all guids, use clean_all() instead.

Parameters:
  • guid (str) – The guid.

  • keep_entries (int) – Number of entries to keep after deletion.

Returns:

Number of records deleted.

Return type:

int

move(guid, new_guid)

Replace uuid in records matching the ‘guid’ with the ‘new_guid’ value.

If there are existing records with ‘new_guid’, they will not be overwritten and the job histories will be merged.

Returns:

Number of records searched for replacement.

Parameters:
  • guid (str)

  • new_guid (str)

Return type:

int

clean_all(keep_entries=1)

Delete all older entries for each ‘guid’ (keep only keep_entries).

Returns:

Number of records deleted.

Parameters:

keep_entries (int)

Return type:

int

keep_latest(keep_entries=1)

Delete all older entries keeping only the ‘keep_num’ per guid.

Parameters:

keep_entries (int) – Number of entries to keep after deletion.

Returns:

Number of records deleted.

Return type:

int

rollback(timestamp, count=False)

Rollback database to the entries present at timestamp.

Parameters:
  • timestamp (float) – The timestamp.

  • count (bool) – If set to true, only count the number that would be deleted without doing so.

Returns:

Number of records deleted (or to be deleted).

Return type:

int

backup()

Return the most recent entry for each ‘guid’.

Returns:

A generator of tuples, each consisting of (guid, data, timestamp, tries, etag, mime_type)

Return type:

Iterator[tuple[str, str | bytes, float, int, str, str, ErrorData]]

clean_ssdb(known_guids, keep_entries=1)

Convenience function to clean the cache.

If self.clean_all is present, runs clean_all(). Otherwise, runs clean() on all known_guids, one at a time. Prints the number of snapshots deleted.

Parameters:
  • known_guids (Iterable[str]) – An iterable of guids

  • keep_entries (int) – Number of entries to keep after deletion.

Return type:

None

gc(known_guids, keep_entries=1)

Garbage collect the database: delete all guids not included in known_guids and keep only last n snapshot for the others.

Parameters:
  • known_guids (Iterable[str]) – The guids to keep.

  • keep_entries (int) – Number of entries to keep after deletion for the guids to keep.

Return type:

None

migrate_from_minidb(minidb_filename)

Migrate the data of a legacy minidb database to the current database.

Parameters:

minidb_filename (str | Path) – The filename of the legacy minidb database.

Return type:

None

restore(entries)

Save multiple entries into the database.

Parameters:

entries (Iterable[tuple[str, str | bytes, float, int, str, str, ErrorData]]) – An iterator of tuples WHERE each consists of (guid, data, timestamp, tries, etag, mime_type)

Return type:

None

flushdb()

Delete all entries of the database. Use with care, there is no undo!

Return type:

None

class webchanges.storage.SsdbStorage(filename)

Bases: BaseFileStorage

Base class for snapshots storage.

Parameters:

filename (str | Path) – The filename or directory name to storage.

abstractmethod close()
Return type:

None

abstractmethod get_guids()
Return type:

list[str]

abstractmethod load(guid)
Parameters:

guid (str)

Return type:

Snapshot

abstractmethod get_history_data(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

dict[str | bytes, float]

abstractmethod get_history_snapshots(guid, count=None)
Parameters:
  • guid (str)

  • count (int | None)

Return type:

list[Snapshot]

abstractmethod save(*args, guid, snapshot, **kwargs)
Parameters:
  • args (Any)

  • guid (str)

  • snapshot (Snapshot)

  • kwargs (Any)

Return type:

None

abstractmethod delete(guid)
Parameters:

guid (str)

Return type:

None

abstractmethod delete_latest(guid, delete_entries=1, **kwargs)

For the given ‘guid’, delete only the latest ‘delete_entries’ entries and keep all other (older) ones.

Parameters:
  • guid (str) – The guid.

  • delete_entries (int) – The number of most recent entries to delete.

  • kwargs (Any)

Returns:

Number of records deleted.

Return type:

int

abstractmethod delete_all()

Delete all entries; used for testing only.

Returns:

Number of records deleted.

Return type:

int

abstractmethod clean(guid, keep_entries=1)

Removes the entries for guid except the latest n keep_entries.

Parameters:
  • guid (str) – The guid.

  • keep_entries (int) – The number of most recent entries to keep.

Returns:

Number of records deleted.

Return type:

int

abstractmethod move(guid, new_guid)

Replace uuid in records matching the ‘guid’ with the ‘new_guid’ value.

If there are existing records with ‘new_guid’, they will not be overwritten and the job histories will be merged.

Returns:

Number of records searched for replacement.

Parameters:
  • guid (str)

  • new_guid (str)

Return type:

int

abstractmethod rollback(timestamp)

Rolls back the database to timestamp.

Parameters:

timestamp (float) – The timestamp.

Returns:

Number of records deleted.

Raises:

NotImplementedError for those classes where this method is not implemented.

Return type:

int | None

backup()

Return the most recent entry for each ‘guid’.

Returns:

A generator of tuples, each consisting of (guid, data, timestamp, tries, etag, mime_type)

Return type:

Iterator[tuple[str, str | bytes, float, int, str, str, ErrorData]]

restore(entries)

Save multiple entries into the database.

Parameters:

entries (Iterable[tuple[str, str | bytes, float, int, str, str, ErrorData]]) – An iterator of tuples WHERE each consists of (guid, data, timestamp, tries, etag, mime_type)

Return type:

None

gc(known_guids, keep_entries=1)

Garbage collect the database: delete all guids not included in known_guids and keep only last n snapshot for the others.

Parameters:
  • known_guids (Iterable[str]) – The guids to keep.

  • keep_entries (int) – Number of entries to keep after deletion for the guids to keep.

Return type:

None

clean_ssdb(known_guids, keep_entries=1)

Convenience function to clean the cache.

If self.clean_all is present, runs clean_all(). Otherwise, runs clean() on all known_guids, one at a time. Prints the number of snapshots deleted.

Parameters:
  • known_guids (Iterable[str]) – An iterable of guids

  • keep_entries (int) – Number of entries to keep after deletion.

Return type:

None

abstractmethod flushdb()

Delete all entries of the database. Use with care, there is no undo!

Return type:

None

class webchanges.storage.YamlConfigStorage(filename)

Bases: BaseYamlFileStorage

Class for configuration file (is a YAML textual file).

Parameters:

filename (str | Path) – The filename or directory name to storage.

config: _Config = {}
static dict_deep_difference(d1, d2, ignore_underline_keys=False)

Recursively find elements in the first dict that are not in the second.

Parameters:
  • d1 (_Config) – The first dict.

  • d2 (_Config) – The second dict.

  • ignore_underline_keys (bool) – If true, keys starting with _ are ignored (treated as remarks)

Returns:

A dict with all the elements on the first dict that are not in the second.

Return type:

_Config

static dict_deep_merge(source, destination)

Recursively deep merges source dict into destination dict.

Parameters:
  • source (_Config) – The first dict.

  • destination (_Config) – The second dict.

Returns:

The deep merged dict.

Return type:

_Config

static remove_deprecated_keys(config)

Remove deprecated keys from config.

Parameters:

config (_Config)

Return type:

_Config

check_for_unrecognized_keys(config)

Test if config has keys not in DEFAULT_CONFIG (bad keys, e.g. typos); if so, raise ValueError. Also cleanup deprecated keys in config.

Parameters:

config (_Config) – The configuration.

Raises:

ValueError – If the configuration has keys not in DEFAULT_CONFIG (bad keys, e.g. typos)

Return type:

None

static replace_none_keys(config)

Fixes None keys in loaded config that should be empty dicts instead.

Parameters:

config (_Config)

Return type:

None

load(*args)

Load configuration file from self.filename into self.config, adding missing keys from DEFAULT_CONFIG.

Parameters:

args (Any) – None used.

Return type:

None

save(*args, **kwargs)

Save self.config into self.filename using YAML.

Parameters:
  • args (Any) – None used.

  • kwargs (Any) – None used.

Return type:

None

classmethod write_default_config(filename)

Write default configuration to file.

Parameters:

filename (Path) – The filename.

Return type:

None

edit()

Edit file.

Returns:

None if edit is successful, 1 otherwise.

Return type:

int

classmethod parse(filename)

Return contents of YAML file if it exists

Parameters:

filename (Path) – The filename Path.

Returns:

Specified by the subclass.

Return type:

Any

remove_remark_keys(_dict)

Recursively removes all keys starting with ‘_’ from a dict.

Parameters:

_dict (T) – The dict.

Returns:

The dict with all keys starting with ‘_’ removed.

Return type:

T

class webchanges.storage.YamlJobsStorage(filename)

Bases: BaseYamlFileStorage, JobsBaseFileStorage

Class for jobs file (is a YAML textual file).

Class for jobs textual files storage.

Parameters:

filename (list[Path]) – The filenames of the jobs file.

classmethod parse(filename)

Parse the contents of a jobs YAML file and return a list of jobs.

Parameters:

filename (Path) – The filename Path.

Returns:

A list of JobBase objects.

Return type:

list[JobBase]

load(*args)

Parse the contents of the jobs YAML file(s) and return a list of jobs.

Returns:

A list of JobBase objects.

Parameters:

args (Any)

Return type:

list[JobBase]

save(jobs)

Save jobs to the job YAML file.

Parameters:

jobs (Sized[JobBase]) – An iterable of JobBase objects to be written.

Return type:

None

edit()

Edit file.

Returns:

None if edit is successful, 1 otherwise.

Return type:

int

load_secure()

Load the jobs from a text file checking that the file is secure (i.e. belongs to the current UID and only the owner can write to it - Linux only).

Returns:

List of JobBase objects.

Return type:

list[JobBase]

remove_remark_keys(_dict)

Recursively removes all keys starting with ‘_’ from a dict.

Parameters:

_dict (T) – The dict.

Returns:

The dict with all keys starting with ‘_’ removed.

Return type:

T

filename: list[Path]