Cluster data set

ClusterDataSet is a distributed data set, storing individual partitions using raw binary files.

class, workers, *args, **kwargs)[source]

delete this partition. needs to run on all nodes that have this partition


Get a handle to write to this partition. Current rules:

  1. You can only write a complete partition at once, which is then immutable afterwards. If you want to change something, you need to write the whole partition again.

  2. Once the with-block is exited successfully, the data is written down to disk. If there is an error while writing the data, the partition will not be moved into its final place and it will be missing from the data set. There cannot be a “parially written” partition.


>>> with dest_part.get_write_handle() as wh:  
...     for tile in wh.write_tiles(source_part.get_tiles()):
...         pass  # do something with `tile`
class, structure=None, io_backend=None)[source]

ClusterDataSet: a distributed RAW data set

  • to be used for the cache, for live acquisition, and for simulation integration

  • each node has a directory for a ClusterDataSet

  • the directory contains partitions, each its own raw file

  • information about the structure is saved as a json sidecar file

  • A ClusterDataSet dataset can be incomplete, that is, it can miss complete partitions (but partitions themselves are guaranteed to be complete once they have their final filename)

  • use cases for incomplete datasets:
    • each node only caches the partitions it is responsible for

    • partial acquisitions support

  • missing partitions can later be written

  • file names and structure/partitioning are deterministic

  • assumption: all workers on a single host share the dataset

  • path (str) – Absolute filesystem base path, pointing to an existing directory. Assumes a uniform setup (same absolute path used on all nodes)

  • structure (PartitionStructure) – Partitioning structure instance. Must be specified when creating a new dataset.


check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.

classmethod detect_params(path, executor)[source]

Guess if path can be opened using this DataSet implementation and detect parameters.

returns dict of detected parameters if path matches this dataset type, returns False if path is most likely not of a matching type.

property dtype

The “native” data type (either one matching the data on disk, or one that is closest)


Get relevant diagnostics for this dataset, as a list of dicts with keys name, value, where value may be string or a list of dicts itself. Subclasses should override this method.

classmethod get_msg_converter()[source]

Return a generator over all Partitions in this DataSet. Should only be called on the master node.


Initialize is running on the master node, but we have access to the executor.

property shape

The shape of the DataSet, as it makes sense for the application domain (for example, 4D for pixelated STEM)