Distributed caching layer

CachedDataSet can be used to temporarily store your data on a local fast storage device (i.e. SSD), if your source data is stored on a slower device (i.e. NFS or a slower spinning disk).

class libertem.io.dataset.cached.Cache(stats: CacheStats, strategy: CacheStrategy)[source]

Cache object, to be used on a worker node. The interface used by Partitions to manage the cache. May directly remove files, directories, etc.

collect_orphans(base_path: str)[source]

Check the filesystem structure and record all partitions that are missing in the db as orphans, to be deleted on demand.

evict(cache_key: str, size: int)[source]

Make place for size bytes which will be used by the dataset identified by the cache_key.

record_hit(cache_item: CacheItem)[source]
record_miss(cache_item: CacheItem)[source]
class libertem.io.dataset.cached.CacheItem(dataset: str, partition: int, size: int, path: str)[source]

A CacheItem describes a single unit of data that is cached, in this case a partition of the CachedDataSet.

classmethod from_row(row)[source]
class libertem.io.dataset.cached.CacheStats(db_path)[source]

Return dataset cache stats as dict mapping partition ids to dicts of their properties (keys: size, last_access, hits)

maybe_orphan(orphan: OrphanItem)[source]

Create an entry for a file we don’t have any statistics about, after checking the stats table for the given path. Getting a conflict here means concurrently running maybe_orphan processes, so we can safely ignore it.

query(sql, args=None)[source]

Custom sqlite query, returns a sqlite3 Cursor object

record_eviction(cache_item: Union[CacheItem, OrphanItem])[source]
record_hit(cache_item: CacheItem)[source]
record_miss(cache_item: CacheItem)[source]
remove_orphan(path: str)[source]
class libertem.io.dataset.cached.CacheStrategy[source]
get_victim_list(cache_key: str, size: int, stats: CacheStats)[source]

Return a list of CacheItem`s that should be deleted to make a new item with size in bytes `size.

class libertem.io.dataset.cached.CachedDataSet(source_ds, cache_path, strategy, io_backend=None)[source]

Cached DataSet.

Assumes the source DataSet is significantly slower than the cache location (otherwise, it may result in memory pressure, as we don’t use direct I/O to write to the cache.)

  • source_ds (DataSet) – DataSet on slower file system

  • cache_path (str) – Where should the cache be written to? Should be a directory on a fast local device (i.e. NVMe SSD if possible, local hard drive if it is faster than network) A subdirectory will be created for each dataset.

  • strategy (CacheStrategy) – A class implementing a cache eviction strategy, for example LRUCacheStrategy


check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.

property dtype

The “native” data type (either one matching the data on disk, or one that is closest)

classmethod get_msg_converter()[source]

Return a generator over all Partitions in this DataSet. Should only be called on the master node.


Perform possibly expensive initialization, like pre-loading metadata.

This is run on the master node, but can execute parts on workers, for example if they need to access the data stored on worker nodes, using the passed executor instance.

If you need the executor around for later operations, for example when creating the partitioning, save a reference here!

Should return the possibly modified DataSet instance (if a method running on a worker is changing self, these changes won’t automatically be transferred back to the master node)

property shape

The shape of the DataSet, as it makes sense for the application domain (for example, 4D for pixelated STEM)

class libertem.io.dataset.cached.CachedPartition(source_part, cluster_part, meta, partition_slice, cache_key, cache_strategy, db_path, idx, io_backend, decoder)[source]

returns locations where this partition is cached

get_tiles(tiling_scheme, dest_dtype='float32', roi=None)[source]
class libertem.io.dataset.cached.LRUCacheStrategy(capacity: int)[source]
get_available(stats: CacheStats)[source]

available cache capacity in bytes

get_used(stats: CacheStats)[source]

used cache capacity in bytes

get_victim_list(cache_key: str, size: int, stats: CacheStats)[source]

Return a list of CacheItem`s that should be deleted to make place for `partition.

sufficient_space_for(size: int, stats: CacheStats)[source]
class libertem.io.dataset.cached.OrphanItem(path, size)[source]

An orphan, a file in the cache structure, which we don’t know much about (only path and size)

classmethod from_row(row)[source]
class libertem.io.dataset.cached.VerboseRow[source]

sqlite3.Row with a __repr__