Data Set API

This API allows to load and handle data on a distributed system efficiently. Note that you should not directly use most dataset methods, but rather use the more high-level tools available, for example user-defined functions.

See our documentation on loading data for a high-level introduction.

Formats

Merlin Medipix (MIB)

class libertem.io.dataset.mib.MIBDataSet(path, tileshape=None, scan_size=None, disable_glob=False, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

MIB data sets consist of one or more .mib files, and optionally a .hdr file. The HDR file is used to automatically set the nav_shape parameter from the fields “Frames per Trigger” and “Frames in Acquisition.” When loading a MIB data set, you can either specify the path to the HDR file, or choose one of the MIB files. The MIB files are assumed to follow a naming pattern of some non-numerical prefix, and a sequential numerical suffix.

Note that if you are using a per-pixel or per-scan trigger setup, LiberTEM won’t be able to deduce the x scanning dimension - in that case, you will need to specify the nav_shape yourself.

Currently, we support all integer formats, and most RAW formats. Especially, the following configurations are not yet supported for RAW files:

  • Non-2x2 layouts with more than one chip

  • 24bit with more than one chip

New in version 0.9.0: Support for the raw quad format was added

Examples

>>> # both examples look for files matching /path/to/default*.mib:
>>> ds1 = ctx.load("mib", path="/path/to/default.hdr")  
>>> ds2 = ctx.load("mib", path="/path/to/default64.mib")  
Parameters:
  • path (str) – Path to either the .hdr file or one of the .mib files

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Common case: (height, width); but can be any dimensionality

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • disable_glob (bool, default False) – Usually, MIB data sets are stored as a series of .mib files, and we can reliably guess the whole set from a single path. If you instead save your data set into a single .mib file, and have multiple of these in a single directory with the same prefix (for example, a.mib, a1.mib and a2.mib), loading a.mib would include a1.mib and a2.mib in the data set. Setting disable_glob to True will only load the single .mib file specified as path.

Raw binary files

class libertem.io.dataset.raw.RawFileDataSet(path, dtype, scan_size=None, detector_size=None, enable_direct=False, detector_size_raw=None, crop_detector_to=None, tileshape=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read raw data from a single file of raw binary data. This reader assumes the following format:

  • only raw data (no file header)

  • frames are stored in C-order without additional frame headers

  • dtype supported by numpy

Examples

>>> ds = ctx.load("raw", path=path_to_raw, nav_shape=(16, 16), sig_shape=(128, 128),
...               sync_offset=0, dtype="float32",)
Parameters:
  • path (str) – Path to the file

  • nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int) – Common case: (height, width); but can be any dimensionality

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.

Raw binary files in sparse CSR format

class libertem.io.dataset.raw_csr.RawCSRDataSet(path: str, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int = 0, io_backend: IOBackend | None = None)[source]

Read sparse data in compressed sparse row (CSR) format from a triple of files that contain the index pointers, the coordinates and the values. See Wikipedia article on the CSR format for more information on the format.

The necessary parameters are specified in a TOML file like this:

[params]

filetype = "raw_csr"
nav_shape = [512, 512]
sig_shape = [516, 516]

[raw_csr]

indptr_file = "rowind.dat"
indptr_dtype = "<i4"

indices_file = "coords.dat"
indices_dtype = "<i4"

data_file = "values.dat"
data_dtype = "<i4"`

Both the navigation and signal axis are flattened in the file, so that existing CSR libraries like scipy.sparse can be used directly by memory-mapping or reading the file contents.

Parameters:
  • path (str) – Path to the TOML file with file names and other parameters for the sparse dataset.

  • nav_shape (Tuple[int, int], optional) – A nav_shape to apply to the dataset overriding the shape value read from the TOML file, by default None. This can be used to read a subset of the data, or reshape the contained data.

  • sig_shape (Tuple[int, int], optional) – A sig_shape to apply to the dataset overriding the shape value read from the TOML file, by default None.

  • sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • io_backend (IOBackend, optional) – The I/O backend to use, see I/O Backends, by default None.

Examples

>>> ds = ctx.load("raw_csr", path='./path_to.toml')  

NumPy files (NPY)

class libertem.io.dataset.npy.NPYDataSet(path: str, sig_dims: int | None = 2, nav_shape: tuple[int, int] | None = None, sig_shape: tuple[int, int] | None = None, sync_offset: int = 0, io_backend: IOBackend | None = None)[source]

New in version 0.10.0.

Read data stored in a NumPy .npy binary file. Dataset shape and dtype are inferred from the file header unless overridden by the arguments to this class.

As of this time Fortran-ordered .npy files are not supported

Parameters:
  • path (str) – The path to the .npy file

  • sig_dims (int, optional, by default 2) – The number of dimensions from the end of the full shape to interpret as signal dimensions. If None will be inferred from the sig_shape argument when present.

  • nav_shape (Tuple[int, int], optional) – A nav_shape to apply to the dataset overriding the shape value read from the .npy header, by default None. This can be used to read a subset of the .npy file, or reshape the contained data. Frames are read in C-order from the beginning of the file.

  • sig_shape (Tuple[int, int], optional) – A sig_shape to apply to the dataset overriding the shape value read from the .npy header, by default None. Pixels are read in C-order from the beginning of the file.

  • sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • io_backend (IOBackend, optional) – The I/O backend to use, see I/O Backends, by default None.

Raises:
  • DataSetException – If sig_dims is not an integer and cannot be inferred from sig_shape

  • DataSetException – If the supplied nav_shape + sig_shape describe an array larger than the contents of the .npy file

  • DataSetException – If the .npy file is Fortran-ordered

Examples

>>> ds = ctx.load("npy", path='./path_to_file.npy')  

Digital Micrograph (DM3, DM4) files

There are currently two Digital Micrograph dataset implementations, SingleDMDataSet for a single-file, C-ordered .dm4 nD-dataset, and StackedDMDataSet for a stack of individual .dm3 or .dm4 image files which altogether comprise a nD-dataset.

Both forms can be created using the following call to the Context:

ctx.load('dm', ...)

and where possible the choice of reader (single-file or stacked) will be inferred from the parameters.

class libertem.io.dataset.dm_single.SingleDMDataSet(*args, **kwargs)[source]

Reader for a single DM3/DM4 file. Handles 4D-STEM, 3D-Spectrum Images, and TEM image stacks stored in a single-file format. Where possible the structure will be inferred from the file metadata.

New in version 0.11.0.

Note

Single-file DM data can be stored on disk using either normal C-ordering, which is an option in recent versions of GMS, or an alternative F/C-hybrid ordering depending on the imaging mode and dimensionality. The reading of F/C-hybrid files is currently not supported for performance reasons.

The DataSet will try to infer the ordering from the file metadata and read accordingly. If the file uses the older hybrid F/C-ordering (flat_sig, flat_nav) then the dataset will raise an exception unless the force_c_order argument. is set to true.

A converter for F/C-hybrid files is provided as convert_dm4_transposed().

Note

In the Web-GUI a 2D-image or 3D-stack/spectrum image will have extra singleton navigation dimensions prepended to allow them to display. DM files containing multiple datasets are supported via the dataset_index argument.

While capable of reading 2D/3D files, LiberTEM is not particularly well-adapted to processing these data and the user should consider other tools. Individual spectra or vectors (1D data) are not supported.

Parameters:
  • path (PathLike) – The path to the .dm3/.dm4 file

  • nav_shape (Tuple[int, ...], optional) – Over-ride the nav_shape provided by the file metadata. This can be used to adjust the total number of frames.

  • sig_shape (Tuple[int, ...], optional) – Over-ride the sig_shape provided by the file metadata. Data are read sequentially in all cases, therefore this is typically only interesting if the total number of sig pixels remains constant.

  • sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • io_backend (IOBackend, optional) – A specific IOBackend implementation to over-ride the platform default.

  • force_c_order (bool, optional, by default False) – Force the data to be interpreted as a C-ordered array regardless of the tag information. This will lead to incorrect results on an hybrid C/F-ordered file.

  • dataset_index (int, optional) – In the case of a multi-dataset DM file this can be used to open a specific dataset index. Note that the datasets in a DM-file often begin with a thumbnail which occupies the 0 dataset index. If not provided the first compatible dataset found in the file is used.

class libertem.io.dataset.dm.StackedDMDataSet(*args, **kwargs)[source]

Reader for stacks of DM3/DM4 files.

Note

This DataSet is not supported in the GUI yet, as the file dialog needs to be updated to properly handle opening series.

Note

Single-file 3/4D DM datasets are supported through the SingleDMDataSet class.

Note

You can use the PyPI package natsort to sort the filenames by their numerical components, this is especially useful for filenames without leading zeros.

Parameters:
  • files (List[str]) – List of paths to the files that should be loaded. The order is important, as it determines the order in the navigation axis.

  • nav_shape (Tuple[int, ...] or None) – By default, the files are loaded as a 3D stack. You can change this by specifying the nav_shape, which reshapes the navigation dimensions. Raises a DataSetException if the shape is incompatible with the data that is loaded.

  • sig_shape (Tuple[int, ...], optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • same_offset (bool) – When reading a stack of dm3/dm4 files, it can be expensive to read in all the metadata from all files, which we currently only use for getting the offsets and sizes of the main data in each file. If you absolutely know that the offsets and sizes are the same for all files, you can set this parameter and we will skip reading all metadata but the one from the first file.

DM4 datsets stored in a transposed format (sig, nav) can be converted to C-ordered data compatible with LiberTEM using the contrib function convert_dm4_transposed().

EMPAD

class libertem.io.dataset.empad.EMPADDataSet(path, scan_size=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read data from EMPAD detector. EMPAD data sets consist of two files, one .raw and one .xml file. Note that the .xml file contains the file name of the .raw file, so if the raw file was renamed at some point, opening using the .xml file will fail.

Parameters:
  • path (str) – Path to either the .xml or the .raw file. If the .xml file given, the nav_shape parameter can be left out

  • nav_shape (tuple of int, optional) – A tuple (y, x) or (num_images,) that specifies the size of the scanned region or number of frames in the series. It is automatically read from the .xml file if you specify one as path.

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

Examples

>>> ds = ctx.load("empad", path='./path_to_file.xml', ...)  

K2IS

class libertem.io.dataset.k2is.K2ISDataSet(path, nav_shape=None, sig_shape=None, sync_offset=None, io_backend=None)[source]

Read raw K2IS data sets. They consist of 8 .bin files and one .gtg file. Currently, data acquired using the STEMx unit is supported, metadata about the nav_shape is read from the .gtg file.

Parameters:
  • path (str) – Path to one of the files of the data set (either one of the .bin files or the .gtg file)

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

Examples

>>> ds = ctx.load("k2is", path='./path_to_file.bin', ...)  

FRMS6

class libertem.io.dataset.frms6.FRMS6DataSet(path, enable_offset_correction=True, gain_map_path=None, dest_dtype=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read PNDetector FRMS6 files. FRMS6 data sets consist of multiple .frms6 files and a .hdr file. The first .frms6 file (matching *_000.frms6) contains dark frames, which are subtracted if enable_offset_correction is true.

Parameters:
  • path (string) – Path to one of the files of the FRMS6 dataset (either .hdr or .frms6)

  • enable_offset_correction (boolean) – Subtract dark frames when reading data

  • gain_map_path (string) – Path to a gain map to apply (.mat format)

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

Examples

>>> ds = ctx.load("frms6", path='./path_to_file.hdr', ...)  

BLO

class libertem.io.dataset.blo.BloDataSet(path, tileshape=None, endianess='<', nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read Nanomegas .blo files

Examples

>>> ds = ctx.load("blo", path="/path/to/file.blo")  
Parameters:
  • path (str) – Path to the file

  • endianess (str) – either ‘<’ or ‘>’ for little or big endian

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

SER

class libertem.io.dataset.ser.SERDataSet(path, emipath=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read TIA SER files.

Examples

>>> ds = ctx.load("ser", path="/path/to/file.ser")  
Parameters:
  • path (str) – Path to the .ser file

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

HDF5

class libertem.io.dataset.hdf5.H5DataSet(path, ds_path=None, tileshape=None, nav_shape=None, sig_shape=None, target_size=None, min_num_partitions=None, sig_dims=2, io_backend=None, sync_offset: int = 0)[source]

Read data from a HDF5 data set.

Examples

>>> ds = ctx.load("hdf5", path=path_to_hdf5, ds_path="/data")
Parameters:
  • path (str) – Path to the file

  • ds_path (str) – Path to the HDF5 data set inside the file

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the shape of the navigation / scan grid. By default this is inferred from the HDF5 dataset.

  • sig_shape (tuple of int, optional) – A n-tuple that specifies the shape of the signal / frame grid. This parameter is currently unsupported and will raise an error if provided and not matching the underlying data sig shape. By default the sig_shape is inferred from the HDF5 dataset via the sig_dims parameter.

  • sig_dims (int) – Number of dimensions that should be considered part of the signal (for example 2 when dealing with 2D image data)

  • sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

  • target_size (int) – Target partition size, in bytes. Usually doesn’t need to be changed.

  • min_num_partitions (int) – Minimum number of partitions, set to number of cores if not specified. Usually doesn’t need to be specified.

Note

If the HDF5 file to be loaded contains compressed data using a custom compression filter (other than GZIP, LZF or SZIP), the associated HDF5 filter library must be imported on the workers before accessing the file. See the h5py documentation on filter pipelines for more information.

The library hdf5plugin is preloaded automatically if it is installed. Other filter libraries may have to be specified for preloading by the user.

Preloads for a local DaskJobExecutor can be specified through the preload argument of either make_local() or libertem.executor.dask.cluster_spec(). For the libertem.executor.inline.InlineJobExecutor, the plugins can simply be imported in the main script.

For the web GUI or for running LiberTEM in a cluster with existing workers (e.g. by running libertem-worker or dask-worker on nodes), necessary imports can be specified as --preload arguments to the launch command, for example with libertem-server --preload hdf5plugin resp. libertem-worker --preload hdf5plugin tcp://scheduler_ip:port. --preload can be specified multiple times.

Norpix SEQ

class libertem.io.dataset.seq.SEQDataSet(path: str, scan_size: tuple[int, ...] | None = None, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int = 0, io_backend=None)[source]

Read data from Norpix SEQ files.

Examples

>>> ds = ctx.load("seq", path="/path/to/file.seq", nav_shape=(1024, 1024))  
Parameters:
  • path – Path to the .seq file

  • nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start. If negative, number of blank frames to insert at start.

Note

Dark and gain reference are loaded from MRC files with the same root as the SEQ file and the extensions .dark.mrc and .gain.mrc, i.e. /path/to/file.dark.mrc and /path/to/file.gain.mrc if they are present.

New in version 0.8.0.

Dead pixels are read from an XML file with the same root as the SEQ file and the extension .Config.Metadata.xml, i.e. /path/to/file.Config.Metadata.xml in the above example if both this file and /path/to/file.metadata are present.

See Corrections for more information on how to change or disable corrections.

FIXME find public documentation of the XML format and dark/gain maps.

MRC

class libertem.io.dataset.mrc.MRCDataSet(path, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]

Read MRC files.

Examples

>>> ds = ctx.load("mrc", path="/path/to/file.mrc")  
Parameters:
  • path (str) – Path to the .mrc file

  • nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int, optional) – Signal/detector size (height, width)

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start

TVIPS

class libertem.io.dataset.tvips.TVIPSDataSet(path, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int | None = None, io_backend: IOBackend | None = None)[source]

Read data from one or more .tvips files. You can specify the path to any file that is part of a set - the whole data set will be loaded. We will try to guess nav_shape and sync_offset from the image headers for 4D STEM data, but you may need to specify these parameters in case the guessing logic fails.

New in version 0.9.0.

Examples

>>> ds = ctx.load(
...     "tvips",
...     path="./path/to/file_000.tvips",
...     nav_shape=(16, 16)
... )  
Parameters:
  • path (str) – Path to the file

  • nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)

  • sig_shape (tuple of int) – Common case: (height, width); but can be any dimensionality

  • sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start If not given, we try to automatically determine the sync_offset from the scan metadata in the image headers.

Memory data set

class libertem.io.dataset.memory.MemoryDataSet(tileshape=None, num_partitions=None, data=None, sig_dims=None, check_cast=True, tiledelay=None, datashape=None, base_shape=None, force_need_decode=False, io_backend=None, nav_shape=None, sig_shape=None, sync_offset=0, array_backends=None)[source]

This dataset is constructed from a NumPy array in memory for testing purposes. It is not recommended for production use since it performs poorly with a distributed executor.

Examples

>>> data = np.zeros((2, 2, 64, 64), dtype=np.float32)
>>> ds = ctx.load('memory', data=data, sig_dims=2)

Dask

class libertem.io.dataset.dask.DaskDataSet(dask_array, *, sig_dims, preserve_dimensions=True, min_size=None, io_backend=None)[source]

New in version 0.9.0.

Wraps a Dask.array.array such that it can be processed by LiberTEM. Partitions are created to be aligned with the array chunking. When the array chunking is not compatible with LiberTEM the wrapper merges chunks until compatibility is achieved.

The best-case scenario is for the original array to be chunked in the leftmost navigation dimension. If instead another navigation dimension is chunked then the user can set preserve_dimension=False to re-order the navigation shape to achieve better chunking for LiberTEM. If more than one navigation dimension is chunked, the class will do its best to merge chunks without creating partitions which are too large.

LiberTEM requires that a partition contains only whole signal frames, so any signal dimension chunking is immediately merged by this class.

This wrapper is most useful when the Dask array was created using lazy I/O via dask.delayed, or via dask.array operations. The major assumption is that the chunks in the array can each be individually evaluated without having to read or compute more data than the chunk itself contains. If this is not the case then this class could perform very poorly due to read amplification, or even crash the Dask workers.

As the class performs rechunking using a merge-only strategy it will never split chunks which were present in the original array. If the array is originally very lightly chunked, then the corresponding LiberTEM partitions will be very large. In addition, overly-chunked arrays (for example one chunk per frame) can incurr excessive Dask task graph overheads and should be avoided where possible.

Parameters:
  • dask_array (dask.array.array) – A Dask array

  • sig_dims (int) – Number of dimensions in dask_array.shape counting from the right to treat as signal dimensions

  • preserve_dimensions (bool, optional) – If False, allow optimization of the dask_arry chunking by re-ordering the nav_shape to put the most chunked dimensions first. This can help when more than one nav dimension is chunked.

  • min_size (float, optional) – The minimum partition size in bytes if the array chunking allows an order-preserving merge strategy. The default min_size is 128 MiB.

  • io_backend (bool, optional) – For compatibility, accept an unused io_backend argument.

Example

>>> import dask.array as da
>>>
>>> d_arr = da.ones((4, 4, 64, 64), chunks=(2, -1, -1, -1))
>>> ds = ctx.load('dask', dask_array=d_arr, sig_dims=2)

Will create a dataset with 5 partitions split along the zeroth dimension.

Converters

libertem.contrib.convert_transposed.convert_dm4_transposed(dm4_path: PathLike, out_path: PathLike, ctx: Context | None = None, num_cpus: int | None = None, dataset_index: int | None = None, progress: bool = False)[source]

Convenience function to convert a transposed Gatan Digital Micrograph (.dm4) STEM dataset into a numpy (.npy) file with standard ordering for processing with LiberTEM.

Transposed .dm4 files are stored in (sig, nav) order, i.e. all frame values for a given signal pixel are stored as blocks, which means that extracting a single frame requires traversal of the whole file. LiberTEM requires (nav, sig) order for processing using the UDF interface, i.e. each frame is stored sequentially.

New in version 0.13.0.

Parameters:
  • dm4_path (PathLike) – The path to the .dm4 file

  • out_path (PathLike) – The path to the output .npy file

  • ctx (libertem.api.Context, optional) – The Context to use to perform the conversion, by default None in which case a Dask-based context will be created (optionally) following the num_cpus argument.

  • num_cpus (int, optional) – When ctx is not supplied, this argument limits the number of CPUs to perform the conversion. This can be important as conversion is a RAM-intensive operation and limiting the number of CPUs can help reduce bottlenecking.

  • dataset_index (int, optional) – If the .dm4 file contains multiple datasets, this can be used to select the dataset to convert (see SingleDMDataSet) for more information.

  • progress (bool, optional) – Whether to display a progress bar during conversion, by default False

Raises:

Internal DataSet API

class libertem.io.dataset.base.BasePartition(meta: DataSetMeta, partition_slice: Slice, fileset: FileSet, start_frame: int, num_frames: int, io_backend: IOBackend, decoder: Decoder | None = None)[source]

Base class with default implementations

Parameters:
  • meta – The DataSet’s DataSetMeta instance

  • partition_slice – The partition slice in non-flattened form

  • fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)

  • start_frame – The index of the first frame of this partition (global coords)

  • num_frames – How many frames this partition should contain

  • io_backend – The I/O backend to use for accessing this partition

get_io_backend()[source]
get_locations()[source]
get_max_io_size()[source]
get_tiles(tiling_scheme: TilingScheme, dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrixcupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]

Return a generator over all DataTiles contained in this Partition.

Note

The DataSet may reuse the internal buffer of a tile, so you should directly process the tile and not accumulate a number of tiles and then work on them.

Parameters:
  • tiling_scheme – According to this scheme the data will be tiled

  • dest_dtype (numpy dtype) – convert data to this dtype when reading

  • roi (numpy.ndarray) – Boolean array that matches the dataset navigation shape to limit the region to work on. With a ROI, we yield tiles from a “compressed” navigation axis, relative to the beginning of the dataset. Compressed means, only frames that have a 1 in the ROI are considered, and the resulting tile slices are from a coordinate system that has the shape (np.count_nonzero(roi),).

  • array_backend (ArrayBackend) –

    Specify array backend to use. By default the first entry in the list of supported backends is used.

    New in version 0.11.0.

set_corrections(corrections: CorrectionSet | None)[source]
set_worker_context(worker_context: WorkerContext)[source]
class libertem.io.dataset.base.BufferedBackend(max_buffer_size=16777216)[source]

I/O backend using a buffered reading strategy. Useful for slower media like HDDs, where seeks cause performance drops. Used by default on Windows.

This does not perform optimally on SSDs under all circumstances, for better best-case performance, try using MMapBackend instead.

Parameters:

max_buffer_size (int) – Maximum buffer size, in bytes. This is passed to the tileshape negotiation to select the right depth.

classmethod from_json(msg)[source]

Construct an instance from the already-decoded msg.

get_impl()[source]
id_: str | None = 'buffered'
class libertem.io.dataset.base.DataSet(io_backend: IOBackend | None = None)[source]
MAX_PARTITION_SIZE = 536870912
adjust_tileshape(tileshape: tuple[int, ...], roi: ndarray | None) tuple[int, ...][source]

Final veto of the DataSet in the tileshape negotiation process, make sure that corrections are taken into account!

property array_backends: Sequence[Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrixcupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix']]

The array backends the dataset can return data as.

Defaults to only NumPy arrays

New in version 0.11.0.

check_valid() bool[source]

check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.

classmethod detect_params(path: str, executor: JobExecutor)[source]

Guess if path can be opened using this DataSet implementation and detect parameters.

returns dict of detected parameters if path matches this dataset type, returns False if path is most likely not of a matching type.

property diagnostics

Diagnostics common for all DataSet implementations

property dtype: nt.DTypeLike

The “native” data type (either one matching the data on disk, or one that is closest)

get_base_shape(roi: ndarray | None) tuple[int, ...][source]
get_cache_key() str[source]
get_correction_data() CorrectionSet[source]

Correction parameters that are part of this DataSet. This should only be called after the DataSet is initialized.

Returns:

correction parameters that are part of this DataSet

Return type:

CorrectionSet

get_decoder() Decoder | None[source]
classmethod get_default_io_backend() IOBackend[source]
get_diagnostics()[source]

Get relevant diagnostics for this dataset, as a list of dicts with keys name, value, where value may be string or a list of dicts itself. Subclasses should override this method.

get_io_backend() IOBackend[source]
get_max_io_size() int | None[source]

Override this method to implement a custom maximum I/O size (in bytes)

get_min_sig_size() int[source]

minimum signal size, in number of elements

classmethod get_msg_converter() type[MessageConverter][source]
get_num_partitions() int[source]

Returns the number of partitions the dataset should be split into.

The default implementation sizes partition such that they fit into 512MB of float data in memory, regardless of their native dtype. At least self._cores partitions are created.

get_partitions() Generator[Partition, None, None][source]

Return a generator over all Partitions in this DataSet. Should only be called on the master node.

get_slices()[source]

Return the partition slices for the dataset

classmethod get_supported_extensions() set[str][source]

Return supported extensions as a set of strings.

Plain extensions only, no pattern!

classmethod get_supported_io_backends() list[str][source]

Get the supported I/O backends as list of their IDs. Some DataSet implementations with a custom backend may return an empty list here.

get_sync_offset_info()[source]

Check sync_offset specified and returns number of frames skipped and inserted

get_task_comm_handler() TaskCommHandler[source]
initialize(executor) DataSet[source]

Perform possibly expensive initialization, like pre-loading metadata.

This is run on the master node, but can execute parts on workers, for example if they need to access the data stored on worker nodes, using the passed executor instance.

If you need the executor around for later operations, for example when creating the partitioning, save a reference here!

Should return the possibly modified DataSet instance (if a method running on a worker is changing self, these changes won’t automatically be transferred back to the master node)

property meta: DataSetMeta | None
need_decode(read_dtype: nt.DTypeLike, roi: ndarray | None, corrections: CorrectionSet | None) bool[source]
partition_shape(dtype: nt.DTypeLike, target_size: int, min_num_partitions: int | None = None, containing_shape: Shape | None = None) tuple[int, ...][source]

Calculate partition shape for the given target_size

Parameters:
  • dtype (numpy.dtype or str) – data type of the dataset

  • target_size (int) – target size in bytes - how large should each partition be?

  • min_num_partitions (int) – minimum number of partitions desired. Defaults to the number of workers in the cluster.

Returns:

the shape calculated from the given parameters

Return type:

Tuple[int, …]

set_num_cores(cores: int) None[source]
property shape: Shape

The shape of the DataSet, as it makes sense for the application domain (for example, 4D for pixelated STEM)

supports_correction()[source]
exception libertem.io.dataset.base.DataSetException[source]
class libertem.io.dataset.base.DataSetMeta(shape: Shape, array_backends: Sequence[Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrixcupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix']] | None = None, image_count: int = 0, raw_dtype: nt.DTypeLike | None = None, dtype: nt.DTypeLike | None = None, metadata: Any | None = None, sync_offset: int = 0)[source]
shape

“native” dataset shape, can have any dimensionality

array_backends: Optional[Sequence[ArrayBackend]]

raw_dtypenp.dtype

dtype used internally in the data set for reading

dtypenp.dtype

Best-fitting output dtype. This can be different from raw_dtype, for example if there are post-processing steps done as part of reading, which need a different dtype. Assumed equal to raw_dtype if not given

sync_offset: int, optional

If positive, number of frames to skip from start If negative, number of blank frames to insert at start

image_count

Total number of frames in the dataset

metadata

Any metadata offered by the DataSet, not specified yet

class libertem.io.dataset.base.DataTile(data, tile_slice, scheme_idx)[source]
property c_contiguous
property data
property dtype
property flat_data: ndarray

Flatten the data.

The result is a 2D array where each row contains pixel data from a single frame. It is just a reshape, so it is a view into the original data.

property shape
property size
class libertem.io.dataset.base.Decoder[source]
do_clear()[source]
get_decode(native_dtype, read_dtype)[source]
get_native_dtype(inp_native_dtype, read_dtype)[source]
class libertem.io.dataset.base.DirectBackend(max_buffer_size=16777216)[source]

I/O backend using a direct I/O reading strategy. This currently works on Linux and Windows, Mac OS X is not yet supported.

Use this backend if your data is much larger than your RAM, and you have fast enough storage (NVMe RAID, for example). In these cases, the MMapBackend or BufferedBackend is not efficient, as the system is constantly under memory pressure. In that case, this backend can perform much better.

Parameters:

max_buffer_size (int) – Maximum buffer size, in bytes. This is passed to the tileshape negotiation to select the right depth.

classmethod from_json(msg)[source]

Construct an instance from the already-decoded msg.

get_impl()[source]
id_: str | None = 'direct'
classmethod platform_supported()[source]
class libertem.io.dataset.base.DtypeConversionDecoder[source]
get_decode(native_dtype, read_dtype)[source]
get_native_dtype(inp_native_dtype, read_dtype)[source]
class libertem.io.dataset.base.File(path, start_idx, end_idx, native_dtype, sig_shape, frame_footer=0, frame_header=0, file_header=0)[source]

A description of a file that is part of a dataset. Contains information about the internal structure, like sizes of headers, frames, frame headers, frame footers, …

Parameters:
  • path (str) – The path of the file. Interpretation may be backend-specific

  • start_idx (int) – Start index of signal elements in this file (inclusive), in the flattened navigation axis

  • end_idx (int) – End index of signal elements in this file (exclusive), in the flattened navigation axis

  • native_dtype (np.dtype) – The dtype that is used for reading the data. This may match the “real” dtype of data, or in some cases, when no direct match is possible (decoding is necessary), it falls back to bytes (np.uint8)

  • sig_shape (Shape | Tuple[int, ...]) – The shape of each signal element

  • file_header (int) – Number of bytes to ignore at the beginning of the file

  • frame_header (int) – Number of bytes to ignore before each frame

  • frame_footer (int) – Number of bytes to ignore after each frame

property end_idx: int
property file_header_bytes: int
get_array_from_memview(mem: memoryview, slicing: OffsetsSizes) ndarray[source]

Convert a memoryview of the file’s data into an ndarray, cutting away frame headers and footers as defined by start and stop parameters.

Parameters:
  • mem – The input memoryview

  • start – Cut off frame headers of this size; usually start = frame_header_bytes // itemsize

  • stop – End index; usually stop = start + prod(sig_shape)

Returns:

The output array. Should have shape (num_frames, prod(sig_shape)) and native dtype

Return type:

np.ndarray

get_offsets_sizes(size: int) OffsetsSizes[source]

Get file and frame offsets/sizes

Parameters:

size (int) – len(memoryview) for the whole file

Returns:

The file/frame slicing

Return type:

slicing

property native_dtype: dtype
property num_frames: int
property path: str
property sig_shape: tuple[int, ...]
property start_idx: int
class libertem.io.dataset.base.FileSet(files: list[File], frame_header_bytes: int = 0, frame_footer_bytes: int = 0)[source]
Parameters:

files – files that are part of a partition or dataset

files_from(start)[source]
get_as_arr()[source]
get_for_range(start, stop)[source]

return new FileSet filtered for files having frames in the [start, stop) range

get_read_ranges(start_at_frame: int, stop_before_frame: int, dtype, tiling_scheme: TilingScheme, sync_offset: int = 0, roi: ndarray | None = None)[source]
class libertem.io.dataset.base.FileTree(low: int, high: int, value: Any, idx: int, left: None | FileTree, right: None | FileTree)[source]

Construct a FileTree node

Parameters:
  • low – First frame contained in this file

  • high – First index of the next file

  • value – The corresponding file object

  • idx – The index of the file object in the fileset

  • left – Nodes with a lower low

  • right – Nodes with a higher low

classmethod make(files)[source]

build a balanced binary tree by bisecting the files list

search_start(value)[source]

search a node that has start_idx <= value && end_idx > value

to_string(depth=0)[source]
class libertem.io.dataset.base.IOBackend[source]
classmethod from_json(msg)[source]

Construct an instance from the already-decoded msg.

classmethod get_cls_by_id(id_)[source]
get_impl() IOBackendImpl[source]
classmethod get_supported()[source]
id_: str | None = None
classmethod platform_supported()[source]
registry: dict[str, type[IOBackend]] = {'buffered': <class 'libertem.io.dataset.base.backend_buffered.BufferedBackend'>, 'direct': <class 'libertem.io.dataset.base.backend_direct.DirectBackend'>, 'mmap': <class 'libertem.io.dataset.base.backend_mmap.MMapBackend'>}
class libertem.io.dataset.base.MMapBackend(enable_readahead_hints=False)[source]

I/O backend using memory mapped files. Used by default on non-Windows systems.

Parameters:

enable_readahead_hints (bool) – Linux only. Try to influence readahead behavior (experimental).

classmethod from_json(msg)[source]

Construct an instance from the already-decoded msg.

get_impl()[source]
id_: str | None = 'mmap'
class libertem.io.dataset.base.Negotiator[source]

Tile shape negotiator. The main functionality is in get_scheme, which, given a udf, dataset and read_dtype will generate a TilingScheme that is compatible with both the UDF and the DataSet, possibly even optimal.

get_scheme(udfs: Sequence[UDFProtocol], dataset, read_dtype: nt.DTypeLike, approx_partition_shape: Shape, roi: ndarray | None = None, corrections: CorrectionSet | None = None) TilingScheme[source]

Generate a TilingScheme instance that is compatible with both the given udf and the :class:~`libertem.io.dataset.base.DataSet`.

Parameters:
  • udfs (Sequence[UDFProtocol]) – The concrete UDFs to optimize the tiling scheme for. Depending on the method (tile, frame, partition) and preferred total input size and depth.

  • dataset (DataSet) – The DataSet instance we generate the scheme for.

  • read_dtype – The dtype in which the data will be fed into the UDF

  • approx_partition_shape – The approximate partition shape that is likely to be used

  • roi (np.ndarray) – Region of interest

  • corrections (CorrectionSet) – Correction set to consider in negotiation

validate(shape: tuple[int, ...], ds_sig_shape: tuple[int, ...], size: int, io_max_size: int, itemsize: int, base_shape: tuple[int, ...], corrections: CorrectionSet | None)[source]
class libertem.io.dataset.base.Partition(meta: DataSetMeta, partition_slice: Slice, io_backend: IOBackend, decoder: Decoder | None)[source]
Parameters:
  • meta – The DataSet’s DataSetMeta instance

  • partition_slice – The partition slice in non-flattened form

  • fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)

  • io_backend – The I/O backend to use for accessing this partition

  • decoder – The decoder that needs to be used for decoding this partition’s data

property dtype
get_frame_count(roi: ndarray | None = None) int[source]
get_ident() str[source]
get_io_backend()[source]
get_locations()[source]
get_macrotile(dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrixcupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]

Return a single tile for the entire partition.

This is useful to support process_partiton() in UDFs and to construct dask arrays from datasets.

get_tiles(tiling_scheme, dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrixcupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]
classmethod make_slices(shape, num_partitions, sync_offset=0)[source]

partition a 3D dataset (“list of frames”) along the first axis, yielding the partition slice, and additionally start and stop frame indices for each partition.

set_corrections(corrections: CorrectionSet)[source]
set_idx(idx: int)[source]
set_io_backend(backend)[source]
set_worker_context(worker_context: WorkerContext)[source]
property shape: Shape

the shape of the partition; dimensionality depends on format

validate_tiling_scheme(tiling_scheme)[source]
class libertem.io.dataset.base.PartitionStructure(shape, slices, dtype)[source]

Structure of the dataset.

Assumed to be contiguous on the flattened navigation axis.

Parameters:
  • slices (List[Tuple[Int, ...]]) – List of tuples [start_idx, end_idx) that partition the data set by the flattened navigation axis

  • shape (Shape) – shape of the whole dataset

  • dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.

SCHEMA = {'$id': 'http://libertem.org/PartitionStructure.schema.json', '$schema': 'http://json-schema.org/draft-07/schema#', 'properties': {'dtype': {'type': 'string'}, 'shape': {'items': {'minimum': 1, 'type': 'number'}, 'minItems': 2, 'type': 'array'}, 'sig_dims': {'type': 'number'}, 'slices': {'items': {'items': {'maxItems': 2, 'minItems': 2, 'type': 'number'}, 'type': 'array'}, 'minItems': 1, 'type': 'array'}, 'version': {'const': 1}}, 'required': ['version', 'slices', 'shape', 'sig_dims', 'dtype'], 'title': 'PartitionStructure', 'type': 'object'}
classmethod from_ds(ds)[source]
classmethod from_json(data)[source]
serialize()[source]
class libertem.io.dataset.base.TilingScheme(slices: list[Slice], tileshape: Shape, dataset_shape: Shape, intent: Literal['partition'] | Literal['frame'] | Literal['tile'] | None = None, debug=None)[source]
adjust_for_partition(partition: Partition) TilingScheme[source]

If the intent is per-partition processing, the tiling scheme must match the partition shape exactly. If there is a mismatch, this method returns a new scheme that matches the partition.

Parameters:

partition – The Partition we want to adjust the tiling scheme to.

Returns:

The adjusted tiling scheme, or this one, if it matches exactly

Return type:

TilingScheme

property dataset_shape
property depth
property intent: Literal['partition'] | Literal['frame'] | Literal['tile'] | None
classmethod make_for_shape(tileshape: Shape, dataset_shape: Shape, intent: Literal['partition'] | Literal['frame'] | Literal['tile'] | None = None, debug=None) TilingScheme[source]

Make a TilingScheme from tileshape and dataset_shape.

Note that both in signal and navigation direction there are border effects, i.e. if the depth doesn’t evenly divide the number of frames in the partition (simplified, ROI also applies…), or if the signal dimensions of tileshape doesn’t evenly divide the signal dimensions of the dataset_shape.

Parameters:
  • tileshape – Uniform shape of all tiles. Should have flat navigation axis (meaning tileshape.nav.dims == 1) and be contiguous in signal dimensions.

  • dataset_shape – Shape of the whole data set. Only the signal part is used.

  • intent – The intent of this scheme (whole partitions, frames or tiles) Needs to be set for correct per-partition tiling!

property shape

tileshape. note that some border tiles can be smaller!

property slices

signal-only slices for all possible positions

property slices_array

Returns the slices from the schema as a numpy ndarray a of shape (n, 2, sig_dims) with: a[i, 0] are origins for slice i a[i, 1] are shapes for slice i

class libertem.io.dataset.base.WritableDataSet[source]
class libertem.io.dataset.base.WritablePartition[source]
delete()[source]
get_write_handle()[source]
libertem.io.dataset.base.decode_swap_2(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
libertem.io.dataset.base.decode_swap_4(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
libertem.io.dataset.base.default_get_read_ranges(start_at_frame, stop_before_frame, roi_nonzero, depth, slices_arr, fileset_arr, sig_shape, bpp, sync_offset=0, extra=None, frame_header_bytes=0, frame_footer_bytes=0)
libertem.io.dataset.base.get_coordinates(slice_: Slice, ds_shape: Shape, roi=None) ndarray[source]

Returns numpy.ndarray of coordinates that correspond to the frames in the actual navigation space which are part of the current tile or partition.

Parameters:
  • slice (Slice) – Describes the location within the dataset with navigation dimension flattened and reduced to the ROI.

  • ds_shape (Shape) – The original shape of the whole dataset, not influenced by the ROI

  • roi (numpy.ndarray, optional) – Array of type bool, matching the navigation shape of the dataset

libertem.io.dataset.base.make_get_read_ranges(px_to_bytes=CPUDispatcher(<function _default_px_to_bytes>), read_ranges_tile_block=CPUDispatcher(<function _default_read_ranges_tile_block>))[source]

Translate the TilingScheme combined with the roi into (pixel)-read-ranges, together with their tile slices.

Parameters:
  • start_at_frame – Dataset-global first frame index to read

  • stop_before_frame – Stop before this frame index

  • tiling_scheme – Description on how the data should be tiled

  • fileset_arr – Array of shape (number_of_files, 3) where the last dimension contains the following values: (start_idx, end_idx, file_idx), where [start_idx, end_idx) defines which frame indices are contained in the file.

  • roi – Region of interest (for the full dataset)

  • bpp (int) – Bits per pixel, including padding

Returns:

read_ranges is an ndarray with shape (number_of_tiles, depth, 3) where the last dimension contains: file index, start_byte, stop_byte

Return type:

(tile_slice, read_ranges)