# Data Set API¶

This API allows to load and handle data on a distributed system efficiently. Note that you should not directly use most dataset methods, but rather use the more high-level tools available, for example user-defined functions.

## Formats¶

### Merlin Medipix (MIB)¶

class libertem.io.dataset.mib.MIBDataSet(path, tileshape=None, scan_size=None, disable_glob=False)[source]

MIB data sets consist of one or more .mib files, and optionally a .hdr file. The HDR file is used to automatically set the scan_size parameter from the fields “Frames per Trigger” and “Frames in Acquisition.” When loading a MIB data set, you can either specify the path to the HDR file, or choose one of the MIB files. The MIB files are assumed to follow a naming pattern of some non-numerical prefix, and a sequential numerical suffix.

Note that, as of the current version, no gain correction or hot/cold pixel removal is done yet: processing is done on the RAW data, though you can do pre-processing in your own UDF.

Examples

>>> # both examples look for files matching /path/to/default*.mib:

Parameters
• path (str) – Path to either the .hdr file or one of the .mib files

• scan_size (tuple of int, optional) – A tuple (y, x) that specifies the size of the scanned region. It is automatically read from the .hdr file if you specify one as path.

### Raw binary files¶

class libertem.io.dataset.raw.RawFileDataSet(path, scan_size, dtype, detector_size=None, enable_direct=False, detector_size_raw=None, crop_detector_to=None, tileshape=None)[source]

Read raw data from a single file of raw binary data. This reader assumes the following format:

• only raw data (no file header)

• dtype supported by numpy

Examples

>>> ds = ctx.load("raw", path=path_to_raw, scan_size=(16, 16),
...               dtype="float32", detector_size=(128, 128))

Parameters
• path (str) – Path to the file

• scan_size (tuple of int, optional) – A n-tuple that specifies the size of the scanned region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

• dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.

• detector_size (tuple of int) – Common case: (height, width); but can be any dimensionality

• enable_direct (bool) – Enable direct I/O. This bypasses the filesystem cache and is useful for systems with very fast I/O and for data sets that are much larger than the main memory.

### Digital Micrograph (DM3, DM4) files¶

class libertem.io.dataset.dm.DMDataSet(files=None, scan_size=None, same_offset=False)[source]

Reader for stacks of DM3/DM4 files. Each file should contain a single frame.

Note

This DataSet is not supported in the GUI yet, as the file dialog needs to be updated to properly handle opening series.

Note

Single-file 4D DM files are not yet supported. The use-case would be to read DM4 files from the conversion of K2 data, but those data sets are actually transposed (nav/sig are swapped).

That means the data would have to be transposed back into the usual shape, which is slow, or algorithms would have to be adapted to work directly on transposed data. As an example, applying a mask in the conventional layout corresponds to calculating a weighted sum frame along the navigation dimension in the transposed layout.

Since the transposed layout corresponds to a TEM tilt series, support for transposed 4D STEM data could have more general applications beyond supporting 4D DM4 files. Please contact us if you have a use-case for single-file 4D DM files or other applications that process stacks of TEM files, and we may add support!

Note

You can use the PyPi package natsort to sort the filenames by their numerical components, this is especially useful for filenames without leading zeros.

Parameters
• files (List[str]) – List of paths to the files that should be loaded. The order is important, as it determines the order in the navigation axis.

• scan_size (Tuple[int] or None) – By default, the files are loaded as a 3D stack. You can change this by specifying the scan size, which reshapes the navigation dimensions. Raises a DataSetException if the shape is incompatible with the data that is loaded.

• same_offset (bool) – When reading a stack of dm3/dm4 files, it can be expensive to read in all the metadata from all files, which we currently only use for getting the offsets and sizes of the main data in each file. If you absolutely know that the offsets and sizes are the same for all files, you can set this parameter and we will skip reading all metadata but the one from the first file.

class libertem.io.dataset.empad.EMPADDataSet(path, scan_size=None)[source]

Read data from EMPAD detector. EMPAD data sets consist of two files, one .raw and one .xml file. Note that the .xml file contains the file name of the .raw file, so if the raw file was renamed at some point, opening using the .xml file will fail.

Parameters
• path (str) – Path to either the .xml or the .raw file. If the .xml file given, the scan_size parameter can be left out

• scan_size (tuple of int) – A tuple (y, x) that specifies the size of the scanned region. It is automatically read from the .xml file if you specify one as path.

### K2IS¶

class libertem.io.dataset.k2is.K2ISDataSet(path)[source]

Read raw K2IS data sets. They consist of 8 .bin files and one .gtg file.

Parameters

path (str) – Path to one of the files of the data set (either one of the .bin files or the .gtg file)

### FRMS6¶

class libertem.io.dataset.frms6.FRMS6DataSet(path, enable_offset_correction=True, gain_map_path=None, dest_dtype=None)[source]

Read PNDetector FRMS6 files. FRMS6 data sets consist of multiple .frms6 files and a .hdr file. The first .frms6 file (matching *_000.frms6) contains dark frames, which are subtracted if enable_offset_correction is true.

Parameters
• path (string) – Path to one of the files of the FRMS6 dataset (either .hdr or .frms6)

• enable_offset_correction (boolean) – Subtract dark frames when reading data

• gain_map_path (string) – Path to a gain map to apply (.mat format)

### BLO¶

class libertem.io.dataset.blo.BloDataSet(path, tileshape=None, endianess='<')[source]

Examples

>>> ds = ctx.load("blo", path="/path/to/file.blo")

Parameters
• path (str) – Path to the file

• endianess (str) – either ‘<’ or ‘>’ for little or big endian

### SER¶

class libertem.io.dataset.ser.SERDataSet(path, emipath=None)[source]

Parameters

path (str) – Path to the .ser file

### HDF5¶

class libertem.io.dataset.hdf5.H5DataSet(path, ds_path, tileshape=None, target_size=536870912, min_num_partitions=None, sig_dims=2)[source]

Read data from a HDF5 data set.

Examples

>>> ds = ctx.load("hdf5", path=path_to_hdf5, ds_path="/data")

Parameters
• path (str) – Path to the file

• ds_path (str) – Path to the HDF5 data set inside the file

• sig_dims (int) – Number of dimensions that should be considered part of the signal (for example 2 when dealing with 2D image data)

• target_size (int) – Target partition size, in bytes. Usually doesn’t need to be changed.

• min_num_partitions (int) – Minimum number of partitions, set to number of cores if not specified. Usually doesn’t need to be specified.

### Norpix SEQ¶

class libertem.io.dataset.seq.SEQDataSet(path: str, scan_size: Tuple[int])[source]

Read data from Norpix SEQ files.

Parameters
• path – Path to the .seq file

• scan_size – A tuple that specifies the size of the scanned region/line/…

### MRC¶

class libertem.io.dataset.mrc.MRCDataSet(path, sig_shape=None)[source]

Parameters

path (str) – Path to the .mrc file

### Memory data set¶

class libertem.io.dataset.memory.MemoryDataSet(tileshape=None, num_partitions=None, data=None, sig_dims=2, check_cast=True, tiledelay=None, datashape=None, base_shape=None, force_need_decode=False)[source]

This dataset is constructed from a NumPy array in memory for testing purposes. It is not recommended for production use since it performs poorly with a distributed executor.

Examples

>>> from libertem.io.dataset.memory import MemoryDataSet
>>>
>>> data = np.zeros((2, 2, 128, 128))
>>> ds = MemoryDataSet(data=data)

__init__(tileshape=None, num_partitions=None, data=None, sig_dims=2, check_cast=True, tiledelay=None, datashape=None, base_shape=None, force_need_decode=False)[source]

Initialize self. See help(type(self)) for accurate signature.

## Internal DataSet API¶

class libertem.io.dataset.base.BasePartition(meta: libertem.io.dataset.base.meta.DataSetMeta, partition_slice: libertem.common.slice.Slice, fileset: libertem.io.dataset.base.fileset.FileSet, start_frame: int, num_frames: int)[source]

Base class with default implementations

__init__(meta: libertem.io.dataset.base.meta.DataSetMeta, partition_slice: libertem.common.slice.Slice, fileset: libertem.io.dataset.base.fileset.FileSet, start_frame: int, num_frames: int)[source]
Parameters
• meta – The DataSet’s DataSetMeta instance

• partition_slice – The partition slice in non-flattened form

• fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)

• start_frame – The index of the first frame of this partition (global coords)

• num_frames – How many frames this partition should contain

adjust_tileshape(tileshape)[source]

Final veto of the Partition in the tileshape negotiation process, make sure that corrections are taken into account!

get_base_shape()[source]
get_locations()[source]
get_macrotile(dest_dtype='float32', roi=None)[source]

Return a single tile for the entire partition.

This is useful to support process_partiton() in UDFs and to construct dask arrays from datasets.

get_tiles(tiling_scheme, dest_dtype='float32', roi=None)[source]

Return a generator over all DataTiles contained in this Partition.

Note

The DataSet may reuse the internal buffer of a tile, so you should directly process the tile and not accumulate a number of tiles and then work on them.

Parameters
• tiling_scheme – According to this scheme the data will be tiled

• dest_dtype (numpy dtype) – convert data to this dtype when reading

• roi (numpy.ndarray) – Boolean array that matches the dataset navigation shape to limit the region to work on. With a ROI, we yield tiles from a “compressed” navigation axis, relative to the beginning of the dataset. Compressed means, only frames that have a 1 in the ROI are considered, and the resulting tile slices are from a coordinate system that has the shape (np.count_nonzero(roi),).

need_decode(read_dtype, roi)[source]
set_corrections(corrections: libertem.corrections.corrset.CorrectionSet)[source]
class libertem.io.dataset.base.DataSet[source]
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

check_valid()[source]

check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.

classmethod detect_params(path, executor)[source]

Guess if path can be opened using this DataSet implementation and detect parameters.

returns dict of detected parameters if path matches this dataset type, returns False if path is most likely not of a matching type.

property diagnostics

Diagnostics common for all DataSet implementations

property dtype

the destination data type

get_cache_key()[source]
get_correction_data()[source]

Correction parameters that are part of this DataSet. This should only be called after the DataSet is initialized.

Returns

correction parameters that are part of this DataSet

Return type

CorrectionSet

get_diagnostics()[source]

Get relevant diagnostics for this dataset, as a list of dicts with keys name, value, where value may be string or a list of dicts itself. Subclasses should override this method.

classmethod get_msg_converter() → Type[libertem.web.messageconverter.MessageConverter][source]
get_partitions()[source]

Return a generator over all Partitions in this DataSet. Should only be called on the master node.

classmethod get_supported_extensions() → Set[str][source]

Return supported extensions as a set of strings.

Plain extensions only, no pattern!

initialize(executor)[source]

This is run on the master node, but can execute parts on workers, for example if they need to access the data stored on worker nodes, using the passed executor instance.

If you need the executor around for later operations, for example when creating the partitioning, save a reference here!

Should return the possibly modified DataSet instance (if a method running on a worker is changing self, these changes won’t automatically be transferred back to the master node)

partition_shape(dtype, target_size, min_num_partitions=None)[source]

Calculate partition shape for the given target_size

Parameters
• dtype (numpy.dtype or str) – data type of the dataset

• target_size (int) – target size in bytes - how large should each partition be?

• min_num_partitions (int) – minimum number of partitions desired. Defaults to the number of workers in the cluster.

Returns

the shape calculated from the given parameters

Return type

Tuple[int]

property raw_dtype

the underlying data type

set_num_cores(cores)[source]
property shape

The shape of the DataSet, as it makes sense for the application domain (for example, 4D for pixelated STEM)

exception libertem.io.dataset.base.DataSetException[source]
class libertem.io.dataset.base.DataSetMeta(shape: libertem.common.shape.Shape, raw_dtype=None, dtype=None, metadata=None)[source]
__init__(shape: libertem.common.shape.Shape, raw_dtype=None, dtype=None, metadata=None)[source]
shape

“native” dataset shape, can have any dimensionality

raw_dtypenp.dtype

dtype used internally in the data set for reading

dtypenp.dtype

Best-fitting output dtype. This can be different from raw_dtype, for example if there are post-processing steps done as part of reading, which need a different dtype. Assumed equal to raw_dtype if not given

Any metadata offered by the DataSet, not specified yet

class libertem.io.dataset.base.DataTile(input_array, tile_slice, scheme_idx)[source]
property flat_data

Flatten the data.

The result is a 2D array where each row contains pixel data from a single frame. It is just a reshape, so it is a view into the original data.

reshape(shape, order='C')[source]

Returns an array containing the same data with a new shape.

Refer to numpy.reshape for full documentation.

numpy.reshape()

equivalent function

Notes

Unlike the free function numpy.reshape, this method on ndarray allows the elements of the shape parameter to be passed in as separate arguments. For example, a.reshape(10, 11) is equivalent to a.reshape((10, 11)).

class libertem.io.dataset.base.Decoder[source]
do_clear()[source]
get_decode(native_dtype, read_dtype)[source]
get_native_dtype(inp_native_dtype, read_dtype)[source]
class libertem.io.dataset.base.DtypeConversionDecoder[source]
get_decode(native_dtype, read_dtype)[source]
get_native_dtype(inp_native_dtype, read_dtype)[source]
class libertem.io.dataset.base.File(path, start_idx, end_idx, native_dtype, sig_shape, frame_footer=0, frame_header=0, file_header=0)[source]
__init__(path, start_idx, end_idx, native_dtype, sig_shape, frame_footer=0, frame_header=0, file_header=0)[source]
Parameters
• file_header (int) – Number of bytes to ignore at the beginning of the file

• frame_header (int) – Number of bytes to ignore before each frame

• frame_footer (int) – Number of bytes to ignore after each frame

close()[source]
property end_idx
property file_header_bytes
fileno()[source]
property native_dtype
property num_frames
open()[source]
property sig_shape
property start_idx
class libertem.io.dataset.base.FileSet(files: List[libertem.io.dataset.base.file.File], frame_header_bytes: int = 0, frame_footer_bytes: int = 0)[source]
__init__(files: List[libertem.io.dataset.base.file.File], frame_header_bytes: int = 0, frame_footer_bytes: int = 0)[source]
Parameters

files – files that are part of a partition or dataset

files_from(start)[source]
get_as_arr()[source]
get_for_range(start, stop)[source]

return new FileSet filtered for files having frames in the [start, stop) range

get_read_ranges(start_at_frame: int, stop_before_frame: int, dtype, tiling_scheme: libertem.io.dataset.base.tiling.TilingScheme, roi: Optional[numpy.ndarray] = None)[source]
class libertem.io.dataset.base.FileTree(low: int, high: int, value: Any, idx: int, left: Union[None, FileTree], right: Union[None, FileTree])[source]
__init__(low: int, high: int, value: Any, idx: int, left: Union[None, FileTree], right: Union[None, FileTree])[source]

Construct a FileTree node

Parameters
• low – First frame contained in this file

• high – First index of the next file

• value – The corresponding file object

• idx – The index of the file object in the fileset

• left – Nodes with a lower low

• right – Nodes with a higher low

classmethod make(files)[source]

build a balanced binary tree by bisecting the files list

search_start(value)[source]

search a node that has start_idx <= value && end_idx > value

to_string(depth=0)[source]
class libertem.io.dataset.base.LocalFSMMapBackend(decoder=None, corrections: libertem.corrections.corrset.CorrectionSet = None)[source]
__init__(decoder=None, corrections: libertem.corrections.corrset.CorrectionSet = None)[source]

Initialize self. See help(type(self)) for accurate signature.

get_read_and_decode(decode)[source]
get_tiles(tiling_scheme, fileset, read_ranges, roi, native_dtype, read_dtype)[source]
need_copy(roi, native_dtype, read_dtype, tiling_scheme=None, fileset=None)[source]
preprocess(data, tile_slice)[source]
class libertem.io.dataset.base.LocalFile(path, start_idx, end_idx, native_dtype, sig_shape, frame_footer=0, frame_header=0, file_header=0)[source]
close()[source]
fileno()[source]
mmap()[source]
open()[source]
raw_mmap()[source]
class libertem.io.dataset.base.Negotiator[source]

Tile shape negotiator. The main functionality is in get_scheme, which, given a udf, partition and read_dtype will generate a TilingScheme that is compatible with both the UDF and the DataSet, possibly even optimal.

get_scheme(udfs, partition, read_dtype: numpy.dtype, roi: numpy.ndarray, corrections: libertem.corrections.corrset.CorrectionSet = None)[source]

Generate a TilingScheme instance that is compatible with both the given udf and the :class:~libertem.io.dataset.base.DataSet.

Parameters
• udfs (List[UDF]) – The concrete UDF to optimize the tiling scheme for. Depending on the method (tile, frame, partition) and preferred total input size and depth.

• partition (Partition) – The TilingScheme is created specifically for the given Partition, so it can adjust even in the face of different partition sizes/shapes.

• read_dtype – The dtype in which the data will be fed into the UDF

• roi (np.ndarray) – Region of interest

• corrections (CorrectionSet) – Correction set to consider in negotiation

validate(shape, partition, size, itemsize, base_shape)[source]
class libertem.io.dataset.base.Partition(meta: libertem.io.dataset.base.meta.DataSetMeta, partition_slice: libertem.common.slice.Slice)[source]
__init__(meta: libertem.io.dataset.base.meta.DataSetMeta, partition_slice: libertem.common.slice.Slice)[source]
Parameters
• meta – The DataSet’s DataSetMeta instance

• partition_slice – The partition slice in non-flattened form

• fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)

• start_frame – The index of the first frame of this partition (global coords)

• num_frames – How many frames this partition should contain

adjust_tileshape(tileshape)[source]

Final veto of the Partition in the tileshape negotiation process, make sure that corrections are taken into account!

property dtype
get_base_shape()[source]
get_locations()[source]
get_macrotile(dest_dtype='float32', roi=None)[source]
get_tiles(tiling_scheme, dest_dtype='float32', roi=None)[source]
classmethod make_slices(shape, num_partitions)[source]

partition a 3D dataset (“list of frames”) along the first axis, yielding the partition slice, and additionally start and stop frame indices for each partition.

need_decode(read_dtype, roi)[source]
set_corrections(corrections: libertem.corrections.corrset.CorrectionSet)[source]
property shape

the shape of the partition; dimensionality depends on format

validate_tiling_scheme(tiling_scheme)[source]
class libertem.io.dataset.base.PartitionStructure(shape, slices, dtype)[source]
SCHEMA = {'$id': 'http://libertem.org/PartitionStructure.schema.json', '$schema': 'http://json-schema.org/draft-07/schema#', 'properties': {'dtype': {'type': 'string'}, 'shape': {'items': {'minimum': 1, 'type': 'number'}, 'minItems': 2, 'type': 'array'}, 'sig_dims': {'type': 'number'}, 'slices': {'items': {'items': {'maxItems': 2, 'minItems': 2, 'type': 'number'}, 'type': 'array'}, 'minItems': 1, 'type': 'array'}, 'version': {'const': 1}}, 'required': ['version', 'slices', 'shape', 'sig_dims', 'dtype'], 'title': 'PartitionStructure', 'type': 'object'}
__init__(shape, slices, dtype)[source]

Structure of the dataset.

Assumed to be contiguous on the flattened navigation axis.

Parameters
• slices (List[Tuple[Int]]) – List of tuples [start_idx, end_idx) that partition the data set by the flattened navigation axis

• shape (Shape) – shape of the whole dataset

• dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.

classmethod from_ds(ds)[source]
classmethod from_json(data)[source]
serialize()[source]
class libertem.io.dataset.base.TilingScheme(slices: , tileshape: libertem.common.shape.Shape, dataset_shape: libertem.common.shape.Shape, debug=None)[source]
__init__(slices: , tileshape: libertem.common.shape.Shape, dataset_shape: libertem.common.shape.Shape, debug=None)[source]

Initialize self. See help(type(self)) for accurate signature.

property dataset_shape
property depth
classmethod make_for_shape(tileshape: libertem.common.shape.Shape, dataset_shape: libertem.common.shape.Shape, debug=None)[source]

Make a TilingScheme from tileshape and dataset_shape.

Note that both in signal and navigation direction there are border effects, i.e. if the depth doesn’t evenly divide the number of frames in the partition (simplified, ROI also applies…), or if the signal dimensions of tileshape doesn’t evenly divide the signal dimensions of the dataset_shape.

Parameters
• tileshape – Uniform shape of all tiles. Should have flat navigation axis (meaning tileshape.nav.dims == 1) and be contiguous in signal dimensions.

• dataset_shape – Shape of the whole data set. Only the signal part is used.

property shape

tileshape. note that some border tiles can be smaller!

property slices

signal-only slices for all possible positions

property slices_array

Returns the slices from the schema as a numpy ndarray a of shape (n, 2, sig_dims) with: a[i, 0] are origins for slice i a[i, 1] are shapes for slice i

class libertem.io.dataset.base.WritableDataSet[source]
class libertem.io.dataset.base.WritablePartition[source]
delete()[source]
get_write_handle()[source]
libertem.io.dataset.base.decode_swap_2(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
libertem.io.dataset.base.decode_swap_4(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
libertem.io.dataset.base.default_get_read_ranges(start_at_frame, stop_before_frame, roi, depth, slices_arr, fileset_arr, sig_shape, bpp, extra=None, frame_header_bytes=0, frame_footer_bytes=0)
libertem.io.dataset.base.make_get_read_ranges(px_to_bytes=CPUDispatcher(<function _default_px_to_bytes>), read_ranges_tile_block=CPUDispatcher(<function _default_read_ranges_tile_block>)) → Tuple[numpy.ndarray, numpy.ndarray][source]

Translate the TilingScheme combined with the roi into (pixel)-read-ranges, together with their tile slices.

Parameters
• start_at_frame – Dataset-global first frame index to read

• stop_before_frame – Stop before this frame index

• tiling_scheme – Description on how the data should be tiled

• fileset_arr – Array of shape (number_of_files, 3) where the last dimension contains the following values: (start_idx, end_idx, file_idx), where [start_idx, end_idx) defines which frame indices are contained in the file.

• roi – Region of interest (for the full dataset)

Returns

read_ranges is an ndarray with shape (number_of_tiles, depth, 3) where the last dimension contains: file index, start_byte, stop_byte

Return type