Cluster data set
ClusterDataSet is a distributed data set, storing individual partitions
using raw binary files.
- class libertem.io.dataset.cluster.ClusterDSPartition(path, workers, *args, **kwargs)
delete this partition. needs to run on all nodes that have this partition
Get a handle to write to this partition. Current rules:
You can only write a complete partition at once, which is then immutable afterwards. If you want to change something, you need to write the whole partition again.
Once the with-block is exited successfully, the data is written down to disk. If there is an error while writing the data, the partition will not be moved into its final place and it will be missing from the data set. There cannot be a “parially written” partition.
>>> with dest_part.get_write_handle() as wh: ... for tile in wh.write_tiles(source_part.get_tiles()): ... pass # do something with `tile`
- class libertem.io.dataset.cluster.ClusterDataSet(path, structure=None, io_backend=None)
ClusterDataSet: a distributed RAW data set
to be used for the cache, for live acquisition, and for simulation integration
each node has a directory for a ClusterDataSet
the directory contains partitions, each its own raw file
information about the structure is saved as a json sidecar file
A ClusterDataSet dataset can be incomplete, that is, it can miss complete partitions (but partitions themselves are guaranteed to be complete once they have their final filename)
- use cases for incomplete datasets:
each node only caches the partitions it is responsible for
partial acquisitions support
missing partitions can later be written
file names and structure/partitioning are deterministic
assumption: all workers on a single host share the dataset
check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.
- classmethod detect_params(path, executor)
Guess if path can be opened using this DataSet implementation and detect parameters.
returns dict of detected parameters if path matches this dataset type, returns False if path is most likely not of a matching type.
- property dtype
The “native” data type (either one matching the data on disk, or one that is closest)
Get relevant diagnostics for this dataset, as a list of dicts with keys name, value, where value may be string or a list of dicts itself. Subclasses should override this method.
- classmethod get_msg_converter()
Return a generator over all Partitions in this DataSet. Should only be called on the master node.
Initialize is running on the master node, but we have access to the executor.
- property shape
The shape of the DataSet, as it makes sense for the application domain (for example, 4D for pixelated STEM)