Data Set API
This API allows to load and handle data on a distributed system efficiently. Note that you should not directly use most dataset methods, but rather use the more high-level tools available, for example user-defined functions.
See our documentation on loading data for a high-level introduction.
Formats
Merlin Medipix (MIB)
- class libertem.io.dataset.mib.MIBDataSet(path, tileshape=None, scan_size=None, disable_glob=False, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
MIB data sets consist of one or more .mib files, and optionally a .hdr file. The HDR file is used to automatically set the nav_shape parameter from the fields “Frames per Trigger” and “Frames in Acquisition.” When loading a MIB data set, you can either specify the path to the HDR file, or choose one of the MIB files. The MIB files are assumed to follow a naming pattern of some non-numerical prefix, and a sequential numerical suffix.
Note that if you are using a per-pixel or per-scan trigger setup, LiberTEM won’t be able to deduce the x scanning dimension - in that case, you will need to specify the nav_shape yourself.
Currently, we support all integer formats, and most RAW formats. Especially, the following configurations are not yet supported for RAW files:
Non-2x2 layouts with more than one chip
24bit with more than one chip
New in version 0.9.0: Support for the raw quad format was added
Examples
>>> # both examples look for files matching /path/to/default*.mib: >>> ds1 = ctx.load("mib", path="/path/to/default.hdr") >>> ds2 = ctx.load("mib", path="/path/to/default64.mib")
- Parameters:
path (str) – Path to either the .hdr file or one of the .mib files
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Common case: (height, width); but can be any dimensionality
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
disable_glob (bool, default False) – Usually, MIB data sets are stored as a series of .mib files, and we can reliably guess the whole set from a single path. If you instead save your data set into a single .mib file, and have multiple of these in a single directory with the same prefix (for example, a.mib, a1.mib and a2.mib), loading a.mib would include a1.mib and a2.mib in the data set. Setting
disable_glob
toTrue
will only load the single .mib file specified aspath
.
Raw binary files
- class libertem.io.dataset.raw.RawFileDataSet(path, dtype, scan_size=None, detector_size=None, enable_direct=False, detector_size_raw=None, crop_detector_to=None, tileshape=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read raw data from a single file of raw binary data. This reader assumes the following format:
only raw data (no file header)
frames are stored in C-order without additional frame headers
dtype supported by numpy
Examples
>>> ds = ctx.load("raw", path=path_to_raw, nav_shape=(16, 16), sig_shape=(128, 128), ... sync_offset=0, dtype="float32",)
- Parameters:
path (str) – Path to the file
nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int) – Common case: (height, width); but can be any dimensionality
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.
Raw binary files in sparse CSR format
- class libertem.io.dataset.raw_csr.RawCSRDataSet(path: str, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int = 0, io_backend: IOBackend | None = None)[source]
Read sparse data in compressed sparse row (CSR) format from a triple of files that contain the index pointers, the coordinates and the values. See Wikipedia article on the CSR format for more information on the format.
The necessary parameters are specified in a TOML file like this:
[params] filetype = "raw_csr" nav_shape = [512, 512] sig_shape = [516, 516] [raw_csr] indptr_file = "rowind.dat" indptr_dtype = "<i4" indices_file = "coords.dat" indices_dtype = "<i4" data_file = "values.dat" data_dtype = "<i4"`
Both the navigation and signal axis are flattened in the file, so that existing CSR libraries like scipy.sparse can be used directly by memory-mapping or reading the file contents.
- Parameters:
path (str) – Path to the TOML file with file names and other parameters for the sparse dataset.
nav_shape (Tuple[int, int], optional) – A nav_shape to apply to the dataset overriding the shape value read from the TOML file, by default None. This can be used to read a subset of the data, or reshape the contained data.
sig_shape (Tuple[int, int], optional) – A sig_shape to apply to the dataset overriding the shape value read from the TOML file, by default None.
sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
io_backend (IOBackend, optional) – The I/O backend to use, see I/O Backends, by default None.
Examples
>>> ds = ctx.load("raw_csr", path='./path_to.toml')
NumPy files (NPY)
- class libertem.io.dataset.npy.NPYDataSet(path: str, sig_dims: int | None = 2, nav_shape: tuple[int, int] | None = None, sig_shape: tuple[int, int] | None = None, sync_offset: int = 0, io_backend: IOBackend | None = None)[source]
New in version 0.10.0.
Read data stored in a NumPy .npy binary file. Dataset shape and dtype are inferred from the file header unless overridden by the arguments to this class.
As of this time Fortran-ordered .npy files are not supported
- Parameters:
path (str) – The path to the .npy file
sig_dims (int, optional, by default 2) – The number of dimensions from the end of the full shape to interpret as signal dimensions. If None will be inferred from the sig_shape argument when present.
nav_shape (Tuple[int, int], optional) – A nav_shape to apply to the dataset overriding the shape value read from the .npy header, by default None. This can be used to read a subset of the .npy file, or reshape the contained data. Frames are read in C-order from the beginning of the file.
sig_shape (Tuple[int, int], optional) – A sig_shape to apply to the dataset overriding the shape value read from the .npy header, by default None. Pixels are read in C-order from the beginning of the file.
sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
io_backend (IOBackend, optional) – The I/O backend to use, see I/O Backends, by default None.
- Raises:
DataSetException – If sig_dims is not an integer and cannot be inferred from sig_shape
DataSetException – If the supplied nav_shape + sig_shape describe an array larger than the contents of the .npy file
DataSetException – If the .npy file is Fortran-ordered
Examples
>>> ds = ctx.load("npy", path='./path_to_file.npy')
Digital Micrograph (DM3, DM4) files
There are currently two Digital Micrograph dataset implementations,
SingleDMDataSet
for a single-file,
C-ordered .dm4 nD-dataset, and StackedDMDataSet
for a stack of individual .dm3 or .dm4 image files which altogether
comprise a nD-dataset.
Both forms can be created using the following call to the Context
:
ctx.load('dm', ...)
and where possible the choice of reader (single-file or stacked) will be inferred from the parameters.
- class libertem.io.dataset.dm_single.SingleDMDataSet(*args, **kwargs)[source]
Reader for a single DM3/DM4 file. Handles 4D-STEM, 3D-Spectrum Images, and TEM image stacks stored in a single-file format. Where possible the structure will be inferred from the file metadata.
New in version 0.11.0.
Note
Single-file DM data can be stored on disk using either normal C-ordering, which is an option in recent versions of GMS, or an alternative F/C-hybrid ordering depending on the imaging mode and dimensionality. The reading of F/C-hybrid files is currently not supported for performance reasons.
The DataSet will try to infer the ordering from the file metadata and read accordingly. If the file uses the older hybrid F/C-ordering
(flat_sig, flat_nav)
then the dataset will raise an exception unless the force_c_order argument. is set to true.A converter for F/C-hybrid files is provided as
convert_dm4_transposed()
.Note
In the Web-GUI a 2D-image or 3D-stack/spectrum image will have extra singleton navigation dimensions prepended to allow them to display. DM files containing multiple datasets are supported via the dataset_index argument.
While capable of reading 2D/3D files, LiberTEM is not particularly well-adapted to processing these data and the user should consider other tools. Individual spectra or vectors (1D data) are not supported.
- Parameters:
path (PathLike) – The path to the .dm3/.dm4 file
nav_shape (Tuple[int, ...], optional) – Over-ride the nav_shape provided by the file metadata. This can be used to adjust the total number of frames.
sig_shape (Tuple[int, ...], optional) – Over-ride the sig_shape provided by the file metadata. Data are read sequentially in all cases, therefore this is typically only interesting if the total number of sig pixels remains constant.
sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
io_backend (IOBackend, optional) – A specific IOBackend implementation to over-ride the platform default.
force_c_order (bool, optional, by default False) – Force the data to be interpreted as a C-ordered array regardless of the tag information. This will lead to incorrect results on an hybrid C/F-ordered file.
dataset_index (int, optional) – In the case of a multi-dataset DM file this can be used to open a specific dataset index. Note that the datasets in a DM-file often begin with a thumbnail which occupies the 0 dataset index. If not provided the first compatible dataset found in the file is used.
- class libertem.io.dataset.dm.StackedDMDataSet(*args, **kwargs)[source]
Reader for stacks of DM3/DM4 files.
Note
This DataSet is not supported in the GUI yet, as the file dialog needs to be updated to properly handle opening series.
Note
Single-file 3/4D DM datasets are supported through the
SingleDMDataSet
class.Note
You can use the PyPI package natsort to sort the filenames by their numerical components, this is especially useful for filenames without leading zeros.
- Parameters:
files (List[str]) – List of paths to the files that should be loaded. The order is important, as it determines the order in the navigation axis.
nav_shape (Tuple[int, ...] or None) – By default, the files are loaded as a 3D stack. You can change this by specifying the nav_shape, which reshapes the navigation dimensions. Raises a DataSetException if the shape is incompatible with the data that is loaded.
sig_shape (Tuple[int, ...], optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
same_offset (bool) – When reading a stack of dm3/dm4 files, it can be expensive to read in all the metadata from all files, which we currently only use for getting the offsets and sizes of the main data in each file. If you absolutely know that the offsets and sizes are the same for all files, you can set this parameter and we will skip reading all metadata but the one from the first file.
DM4 datsets stored in a transposed format (sig, nav)
can
be converted to C-ordered data compatible with LiberTEM using the contrib function
convert_dm4_transposed()
.
EMPAD
- class libertem.io.dataset.empad.EMPADDataSet(path, scan_size=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read data from EMPAD detector. EMPAD data sets consist of two files, one .raw and one .xml file. Note that the .xml file contains the file name of the .raw file, so if the raw file was renamed at some point, opening using the .xml file will fail.
- Parameters:
path (str) – Path to either the .xml or the .raw file. If the .xml file given, the nav_shape parameter can be left out
nav_shape (tuple of int, optional) – A tuple (y, x) or (num_images,) that specifies the size of the scanned region or number of frames in the series. It is automatically read from the .xml file if you specify one as path.
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
Examples
>>> ds = ctx.load("empad", path='./path_to_file.xml', ...)
K2IS
- class libertem.io.dataset.k2is.K2ISDataSet(path, nav_shape=None, sig_shape=None, sync_offset=None, io_backend=None)[source]
Read raw K2IS data sets. They consist of 8 .bin files and one .gtg file. Currently, data acquired using the STEMx unit is supported, metadata about the nav_shape is read from the .gtg file.
- Parameters:
path (str) – Path to one of the files of the data set (either one of the .bin files or the .gtg file)
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
Examples
>>> ds = ctx.load("k2is", path='./path_to_file.bin', ...)
FRMS6
- class libertem.io.dataset.frms6.FRMS6DataSet(path, enable_offset_correction=True, gain_map_path=None, dest_dtype=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read PNDetector FRMS6 files. FRMS6 data sets consist of multiple .frms6 files and a .hdr file. The first .frms6 file (matching *_000.frms6) contains dark frames, which are subtracted if enable_offset_correction is true.
- Parameters:
path (string) – Path to one of the files of the FRMS6 dataset (either .hdr or .frms6)
enable_offset_correction (boolean) – Subtract dark frames when reading data
gain_map_path (string) – Path to a gain map to apply (.mat format)
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
Examples
>>> ds = ctx.load("frms6", path='./path_to_file.hdr', ...)
BLO
- class libertem.io.dataset.blo.BloDataSet(path, tileshape=None, endianess='<', nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read Nanomegas .blo files
Examples
>>> ds = ctx.load("blo", path="/path/to/file.blo")
- Parameters:
path (str) – Path to the file
endianess (str) – either ‘<’ or ‘>’ for little or big endian
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
SER
- class libertem.io.dataset.ser.SERDataSet(path, emipath=None, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read TIA SER files.
Examples
>>> ds = ctx.load("ser", path="/path/to/file.ser")
- Parameters:
path (str) – Path to the .ser file
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
HDF5
- class libertem.io.dataset.hdf5.H5DataSet(path, ds_path=None, tileshape=None, nav_shape=None, sig_shape=None, target_size=None, min_num_partitions=None, sig_dims=2, io_backend=None, sync_offset: int = 0)[source]
Read data from a HDF5 data set.
Examples
>>> ds = ctx.load("hdf5", path=path_to_hdf5, ds_path="/data")
- Parameters:
path (str) – Path to the file
ds_path (str) – Path to the HDF5 data set inside the file
nav_shape (tuple of int, optional) – A n-tuple that specifies the shape of the navigation / scan grid. By default this is inferred from the HDF5 dataset.
sig_shape (tuple of int, optional) – A n-tuple that specifies the shape of the signal / frame grid. This parameter is currently unsupported and will raise an error if provided and not matching the underlying data sig shape. By default the sig_shape is inferred from the HDF5 dataset via the
sig_dims
parameter.sig_dims (int) – Number of dimensions that should be considered part of the signal (for example 2 when dealing with 2D image data)
sync_offset (int, optional, by default 0) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
target_size (int) – Target partition size, in bytes. Usually doesn’t need to be changed.
min_num_partitions (int) – Minimum number of partitions, set to number of cores if not specified. Usually doesn’t need to be specified.
Note
If the HDF5 file to be loaded contains compressed data using a custom compression filter (other than GZIP, LZF or SZIP), the associated HDF5 filter library must be imported on the workers before accessing the file. See the h5py documentation on filter pipelines for more information.
The library hdf5plugin is preloaded automatically if it is installed. Other filter libraries may have to be specified for preloading by the user.
Preloads for a local
DaskJobExecutor
can be specified through thepreload
argument of eithermake_local()
orlibertem.executor.dask.cluster_spec()
. For thelibertem.executor.inline.InlineJobExecutor
, the plugins can simply be imported in the main script.For the web GUI or for running LiberTEM in a cluster with existing workers (e.g. by running
libertem-worker
ordask-worker
on nodes), necessary imports can be specified as--preload
arguments to the launch command, for example withlibertem-server --preload hdf5plugin
resp.libertem-worker --preload hdf5plugin tcp://scheduler_ip:port
.--preload
can be specified multiple times.
Norpix SEQ
- class libertem.io.dataset.seq.SEQDataSet(path: str, scan_size: tuple[int, ...] | None = None, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int = 0, io_backend=None)[source]
Read data from Norpix SEQ files.
Examples
>>> ds = ctx.load("seq", path="/path/to/file.seq", nav_shape=(1024, 1024))
- Parameters:
path – Path to the .seq file
nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start. If negative, number of blank frames to insert at start.
Note
Dark and gain reference are loaded from MRC files with the same root as the SEQ file and the extensions
.dark.mrc
and.gain.mrc
, i.e./path/to/file.dark.mrc
and/path/to/file.gain.mrc
if they are present.New in version 0.8.0.
Dead pixels are read from an XML file with the same root as the SEQ file and the extension
.Config.Metadata.xml
, i.e./path/to/file.Config.Metadata.xml
in the above example if both this file and/path/to/file.metadata
are present.See Corrections for more information on how to change or disable corrections.
FIXME find public documentation of the XML format and dark/gain maps.
MRC
- class libertem.io.dataset.mrc.MRCDataSet(path, nav_shape=None, sig_shape=None, sync_offset=0, io_backend=None)[source]
Read MRC files.
Examples
>>> ds = ctx.load("mrc", path="/path/to/file.mrc")
- Parameters:
path (str) – Path to the .mrc file
nav_shape (tuple of int, optional) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int, optional) – Signal/detector size (height, width)
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start
TVIPS
- class libertem.io.dataset.tvips.TVIPSDataSet(path, nav_shape: tuple[int, ...] | None = None, sig_shape: tuple[int, ...] | None = None, sync_offset: int | None = None, io_backend: IOBackend | None = None)[source]
Read data from one or more .tvips files. You can specify the path to any file that is part of a set - the whole data set will be loaded. We will try to guess
nav_shape
andsync_offset
from the image headers for 4D STEM data, but you may need to specify these parameters in case the guessing logic fails.New in version 0.9.0.
Examples
>>> ds = ctx.load( ... "tvips", ... path="./path/to/file_000.tvips", ... nav_shape=(16, 16) ... )
- Parameters:
path (str) – Path to the file
nav_shape (tuple of int) – A n-tuple that specifies the size of the navigation region ((y, x), but can also be of length 1 for example for a line scan, or length 3 for a data cube, for example)
sig_shape (tuple of int) – Common case: (height, width); but can be any dimensionality
sync_offset (int, optional) – If positive, number of frames to skip from start If negative, number of blank frames to insert at start If not given, we try to automatically determine the sync_offset from the scan metadata in the image headers.
Memory data set
- class libertem.io.dataset.memory.MemoryDataSet(tileshape=None, num_partitions=None, data=None, sig_dims=None, check_cast=True, tiledelay=None, datashape=None, base_shape=None, force_need_decode=False, io_backend=None, nav_shape=None, sig_shape=None, sync_offset=0, array_backends=None)[source]
This dataset is constructed from a NumPy array in memory for testing purposes. It is not recommended for production use since it performs poorly with a distributed executor.
Examples
>>> data = np.zeros((2, 2, 64, 64), dtype=np.float32) >>> ds = ctx.load('memory', data=data, sig_dims=2)
Dask
- class libertem.io.dataset.dask.DaskDataSet(dask_array, *, sig_dims, preserve_dimensions=True, min_size=None, io_backend=None)[source]
New in version 0.9.0.
Wraps a Dask.array.array such that it can be processed by LiberTEM. Partitions are created to be aligned with the array chunking. When the array chunking is not compatible with LiberTEM the wrapper merges chunks until compatibility is achieved.
The best-case scenario is for the original array to be chunked in the leftmost navigation dimension. If instead another navigation dimension is chunked then the user can set preserve_dimension=False to re-order the navigation shape to achieve better chunking for LiberTEM. If more than one navigation dimension is chunked, the class will do its best to merge chunks without creating partitions which are too large.
LiberTEM requires that a partition contains only whole signal frames, so any signal dimension chunking is immediately merged by this class.
This wrapper is most useful when the Dask array was created using lazy I/O via dask.delayed, or via dask.array operations. The major assumption is that the chunks in the array can each be individually evaluated without having to read or compute more data than the chunk itself contains. If this is not the case then this class could perform very poorly due to read amplification, or even crash the Dask workers.
As the class performs rechunking using a merge-only strategy it will never split chunks which were present in the original array. If the array is originally very lightly chunked, then the corresponding LiberTEM partitions will be very large. In addition, overly-chunked arrays (for example one chunk per frame) can incurr excessive Dask task graph overheads and should be avoided where possible.
- Parameters:
dask_array (dask.array.array) – A Dask array
sig_dims (int) – Number of dimensions in dask_array.shape counting from the right to treat as signal dimensions
preserve_dimensions (bool, optional) – If False, allow optimization of the dask_arry chunking by re-ordering the nav_shape to put the most chunked dimensions first. This can help when more than one nav dimension is chunked.
min_size (float, optional) – The minimum partition size in bytes if the array chunking allows an order-preserving merge strategy. The default min_size is 128 MiB.
io_backend (bool, optional) – For compatibility, accept an unused io_backend argument.
Example
>>> import dask.array as da >>> >>> d_arr = da.ones((4, 4, 64, 64), chunks=(2, -1, -1, -1)) >>> ds = ctx.load('dask', dask_array=d_arr, sig_dims=2)
Will create a dataset with 5 partitions split along the zeroth dimension.
Converters
- libertem.contrib.convert_transposed.convert_dm4_transposed(dm4_path: PathLike, out_path: PathLike, ctx: Context | None = None, num_cpus: int | None = None, dataset_index: int | None = None, progress: bool = False)[source]
Convenience function to convert a transposed Gatan Digital Micrograph (.dm4) STEM dataset into a numpy (.npy) file with standard ordering for processing with LiberTEM.
Transposed .dm4 files are stored in
(sig, nav)
order, i.e. all frame values for a given signal pixel are stored as blocks, which means that extracting a single frame requires traversal of the whole file. LiberTEM requires(nav, sig)
order for processing using the UDF interface, i.e. each frame is stored sequentially.New in version 0.13.0.
- Parameters:
dm4_path (PathLike) – The path to the .dm4 file
out_path (PathLike) – The path to the output .npy file
ctx (libertem.api.Context, optional) – The Context to use to perform the conversion, by default None in which case a Dask-based context will be created (optionally) following the
num_cpus
argument.num_cpus (int, optional) – When
ctx
is not supplied, this argument limits the number of CPUs to perform the conversion. This can be important as conversion is a RAM-intensive operation and limiting the number of CPUs can help reduce bottlenecking.dataset_index (int, optional) – If the .dm4 file contains multiple datasets, this can be used to select the dataset to convert (see
SingleDMDataSet
) for more information.progress (bool, optional) – Whether to display a progress bar during conversion, by default False
- Raises:
DataSetException – If the DM4 dataset is not stored as transposed
ValueError – If both
ctx
andnum_cpus
are supplied
Internal DataSet API
- class libertem.io.dataset.base.BasePartition(meta: DataSetMeta, partition_slice: Slice, fileset: FileSet, start_frame: int, num_frames: int, io_backend: IOBackend, decoder: Decoder | None = None)[source]
Base class with default implementations
- Parameters:
meta – The DataSet’s DataSetMeta instance
partition_slice – The partition slice in non-flattened form
fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)
start_frame – The index of the first frame of this partition (global coords)
num_frames – How many frames this partition should contain
io_backend – The I/O backend to use for accessing this partition
- get_tiles(tiling_scheme: TilingScheme, dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrix', 'scipy.sparse.coo_array', 'scipy.sparse.csr_array', 'scipy.sparse.csc_array', 'cupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]
Return a generator over all DataTiles contained in this Partition.
Note
The DataSet may reuse the internal buffer of a tile, so you should directly process the tile and not accumulate a number of tiles and then work on them.
- Parameters:
tiling_scheme – According to this scheme the data will be tiled
dest_dtype (numpy dtype) – convert data to this dtype when reading
roi (numpy.ndarray) – Boolean array that matches the dataset navigation shape to limit the region to work on. With a ROI, we yield tiles from a “compressed” navigation axis, relative to the beginning of the dataset. Compressed means, only frames that have a 1 in the ROI are considered, and the resulting tile slices are from a coordinate system that has the shape (np.count_nonzero(roi),).
array_backend (ArrayBackend) –
Specify array backend to use. By default the first entry in the list of supported backends is used.
New in version 0.11.0.
- set_corrections(corrections: CorrectionSet | None)[source]
- class libertem.io.dataset.base.BufferedBackend(max_buffer_size=16777216)[source]
I/O backend using a buffered reading strategy. Useful for slower media like HDDs, where seeks cause performance drops. Used by default on Windows.
This does not perform optimally on SSDs under all circumstances, for better best-case performance, try using
MMapBackend
instead.- Parameters:
max_buffer_size (int) – Maximum buffer size, in bytes. This is passed to the tileshape negotiation to select the right depth.
- class libertem.io.dataset.base.DataSet(io_backend: IOBackend | None = None)[source]
- MAX_PARTITION_SIZE = 536870912
- adjust_tileshape(tileshape: tuple[int, ...], roi: ndarray | None) tuple[int, ...] [source]
Final veto of the DataSet in the tileshape negotiation process, make sure that corrections are taken into account!
- property array_backends: Sequence[Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrix', 'scipy.sparse.coo_array', 'scipy.sparse.csr_array', 'scipy.sparse.csc_array', 'cupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix']]
The array backends the dataset can return data as.
Defaults to only NumPy arrays
New in version 0.11.0.
- check_valid() bool [source]
check validity of the DataSet. this will be executed (after initialize) on a worker node. should raise DataSetException in case of errors, return True otherwise.
- classmethod detect_params(path: str, executor: JobExecutor)[source]
Guess if path can be opened using this DataSet implementation and detect parameters.
returns dict of detected parameters if path matches this dataset type, returns False if path is most likely not of a matching type.
- property diagnostics
Diagnostics common for all DataSet implementations
- property dtype: nt.DTypeLike
The “native” data type (either one matching the data on disk, or one that is closest)
- get_correction_data() CorrectionSet [source]
Correction parameters that are part of this DataSet. This should only be called after the DataSet is initialized.
- Returns:
correction parameters that are part of this DataSet
- Return type:
- get_diagnostics()[source]
Get relevant diagnostics for this dataset, as a list of dicts with keys name, value, where value may be string or a list of dicts itself. Subclasses should override this method.
- get_max_io_size() int | None [source]
Override this method to implement a custom maximum I/O size (in bytes)
- get_num_partitions() int [source]
Returns the number of partitions the dataset should be split into.
The default implementation sizes partition such that they fit into 512MB of float data in memory, regardless of their native dtype. At least
self._cores
partitions are created.
- get_partitions() Generator[Partition, None, None] [source]
Return a generator over all Partitions in this DataSet. Should only be called on the master node.
- classmethod get_supported_extensions() set[str] [source]
Return supported extensions as a set of strings.
Plain extensions only, no pattern!
- classmethod get_supported_io_backends() list[str] [source]
Get the supported I/O backends as list of their IDs. Some DataSet implementations with a custom backend may return an empty list here.
- get_sync_offset_info()[source]
Check sync_offset specified and returns number of frames skipped and inserted
- initialize(executor) DataSet [source]
Perform possibly expensive initialization, like pre-loading metadata.
This is run on the master node, but can execute parts on workers, for example if they need to access the data stored on worker nodes, using the passed executor instance.
If you need the executor around for later operations, for example when creating the partitioning, save a reference here!
Should return the possibly modified DataSet instance (if a method running on a worker is changing self, these changes won’t automatically be transferred back to the master node)
- property meta: DataSetMeta | None
- need_decode(read_dtype: nt.DTypeLike, roi: ndarray | None, corrections: CorrectionSet | None) bool [source]
- partition_shape(dtype: nt.DTypeLike, target_size: int, min_num_partitions: int | None = None, containing_shape: Shape | None = None) tuple[int, ...] [source]
Calculate partition shape for the given
target_size
- Parameters:
dtype (numpy.dtype or str) – data type of the dataset
target_size (int) – target size in bytes - how large should each partition be?
min_num_partitions (int) – minimum number of partitions desired. Defaults to the number of workers in the cluster.
- Returns:
the shape calculated from the given parameters
- Return type:
Tuple[int, …]
- class libertem.io.dataset.base.DataSetMeta(shape: Shape, array_backends: Sequence[Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrix', 'scipy.sparse.coo_array', 'scipy.sparse.csr_array', 'scipy.sparse.csc_array', 'cupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix']] | None = None, image_count: int = 0, raw_dtype: nt.DTypeLike | None = None, dtype: nt.DTypeLike | None = None, metadata: Any | None = None, sync_offset: int = 0)[source]
- shape
“native” dataset shape, can have any dimensionality
array_backends: Optional[Sequence[ArrayBackend]]
- raw_dtypenp.dtype
dtype used internally in the data set for reading
- dtypenp.dtype
Best-fitting output dtype. This can be different from raw_dtype, for example if there are post-processing steps done as part of reading, which need a different dtype. Assumed equal to raw_dtype if not given
- sync_offset: int, optional
If positive, number of frames to skip from start If negative, number of blank frames to insert at start
- image_count
Total number of frames in the dataset
- metadata
Any metadata offered by the DataSet, not specified yet
- class libertem.io.dataset.base.DataTile(data, tile_slice, scheme_idx)[source]
- property c_contiguous
- property data
- property dtype
- property flat_data: ndarray
Flatten the data.
The result is a 2D array where each row contains pixel data from a single frame. It is just a reshape, so it is a view into the original data.
- property shape
- property size
- class libertem.io.dataset.base.DirectBackend(max_buffer_size=16777216)[source]
I/O backend using a direct I/O reading strategy. This currently works on Linux and Windows, Mac OS X is not yet supported.
Use this backend if your data is much larger than your RAM, and you have fast enough storage (NVMe RAID, for example). In these cases, the
MMapBackend
orBufferedBackend
is not efficient, as the system is constantly under memory pressure. In that case, this backend can perform much better.- Parameters:
max_buffer_size (int) – Maximum buffer size, in bytes. This is passed to the tileshape negotiation to select the right depth.
- class libertem.io.dataset.base.File(path, start_idx, end_idx, native_dtype, sig_shape, frame_footer=0, frame_header=0, file_header=0)[source]
A description of a file that is part of a dataset. Contains information about the internal structure, like sizes of headers, frames, frame headers, frame footers, …
- Parameters:
path (str) – The path of the file. Interpretation may be backend-specific
start_idx (int) – Start index of signal elements in this file (inclusive), in the flattened navigation axis
end_idx (int) – End index of signal elements in this file (exclusive), in the flattened navigation axis
native_dtype (np.dtype) – The dtype that is used for reading the data. This may match the “real” dtype of data, or in some cases, when no direct match is possible (decoding is necessary), it falls back to bytes (np.uint8)
sig_shape (Shape | Tuple[int, ...]) – The shape of each signal element
file_header (int) – Number of bytes to ignore at the beginning of the file
frame_header (int) – Number of bytes to ignore before each frame
frame_footer (int) – Number of bytes to ignore after each frame
- get_array_from_memview(mem: memoryview, slicing: OffsetsSizes) ndarray [source]
Convert a memoryview of the file’s data into an ndarray, cutting away frame headers and footers as defined by start and stop parameters.
- Parameters:
mem – The input memoryview
start – Cut off frame headers of this size; usually start = frame_header_bytes // itemsize
stop – End index; usually stop = start + prod(sig_shape)
- Returns:
The output array. Should have shape (num_frames, prod(sig_shape)) and native dtype
- Return type:
np.ndarray
- class libertem.io.dataset.base.FileSet(files: list[File], frame_header_bytes: int = 0, frame_footer_bytes: int = 0)[source]
- Parameters:
files – files that are part of a partition or dataset
- class libertem.io.dataset.base.FileTree(low: int, high: int, value: Any, idx: int, left: None | FileTree, right: None | FileTree)[source]
Construct a FileTree node
- Parameters:
low – First frame contained in this file
high – First index of the next file
value – The corresponding file object
idx – The index of the file object in the fileset
left – Nodes with a lower low
right – Nodes with a higher low
- class libertem.io.dataset.base.MMapBackend(enable_readahead_hints=False)[source]
I/O backend using memory mapped files. Used by default on non-Windows systems.
- Parameters:
enable_readahead_hints (bool) – Linux only. Try to influence readahead behavior (experimental).
- class libertem.io.dataset.base.Negotiator[source]
Tile shape negotiator. The main functionality is in get_scheme, which, given a udf, dataset and read_dtype will generate a TilingScheme that is compatible with both the UDF and the DataSet, possibly even optimal.
- get_scheme(udfs: Sequence[UDFProtocol], dataset, read_dtype: nt.DTypeLike, approx_partition_shape: Shape, roi: ndarray | None = None, corrections: CorrectionSet | None = None) TilingScheme [source]
Generate a
TilingScheme
instance that is compatible with both the given udf and the :class:~`libertem.io.dataset.base.DataSet`.- Parameters:
udfs (Sequence[UDFProtocol]) – The concrete UDFs to optimize the tiling scheme for. Depending on the method (tile, frame, partition) and preferred total input size and depth.
dataset (DataSet) – The DataSet instance we generate the scheme for.
read_dtype – The dtype in which the data will be fed into the UDF
approx_partition_shape – The approximate partition shape that is likely to be used
roi (np.ndarray) – Region of interest
corrections (CorrectionSet) – Correction set to consider in negotiation
- class libertem.io.dataset.base.Partition(meta: DataSetMeta, partition_slice: Slice, io_backend: IOBackend, decoder: Decoder | None)[source]
- Parameters:
meta – The DataSet’s DataSetMeta instance
partition_slice – The partition slice in non-flattened form
fileset – The files that are part of this partition (the FileSet may also contain files from the dataset which are not part of this partition, but that may harm performance)
io_backend – The I/O backend to use for accessing this partition
decoder – The decoder that needs to be used for decoding this partition’s data
- property dtype
- get_macrotile(dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrix', 'scipy.sparse.coo_array', 'scipy.sparse.csr_array', 'scipy.sparse.csc_array', 'cupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]
Return a single tile for the entire partition.
This is useful to support process_partiton() in UDFs and to construct dask arrays from datasets.
- get_tiles(tiling_scheme, dest_dtype='float32', roi=None, array_backend: Literal['numpy', 'numpy.matrix', 'cuda', 'cupy', 'sparse.COO', 'sparse.GCXS', 'sparse.DOK', 'scipy.sparse.coo_matrix', 'scipy.sparse.csr_matrix', 'scipy.sparse.csc_matrix', 'scipy.sparse.coo_array', 'scipy.sparse.csr_array', 'scipy.sparse.csc_array', 'cupyx.scipy.sparse.coo_matrix', 'cupyx.scipy.sparse.csr_matrix', 'cupyx.scipy.sparse.csc_matrix'] | None = None)[source]
- classmethod make_slices(shape, num_partitions, sync_offset=0)[source]
partition a 3D dataset (“list of frames”) along the first axis, yielding the partition slice, and additionally start and stop frame indices for each partition.
- set_corrections(corrections: CorrectionSet)[source]
- class libertem.io.dataset.base.PartitionStructure(shape, slices, dtype)[source]
Structure of the dataset.
Assumed to be contiguous on the flattened navigation axis.
- Parameters:
slices (List[Tuple[Int, ...]]) – List of tuples [start_idx, end_idx) that partition the data set by the flattened navigation axis
shape (Shape) – shape of the whole dataset
dtype (numpy dtype) – The dtype of the data as it is on disk. Can contain endian indicator, for example >u2 for big-endian 16bit data.
- SCHEMA = {'$id': 'http://libertem.org/PartitionStructure.schema.json', '$schema': 'http://json-schema.org/draft-07/schema#', 'properties': {'dtype': {'type': 'string'}, 'shape': {'items': {'minimum': 1, 'type': 'number'}, 'minItems': 2, 'type': 'array'}, 'sig_dims': {'type': 'number'}, 'slices': {'items': {'items': {'maxItems': 2, 'minItems': 2, 'type': 'number'}, 'type': 'array'}, 'minItems': 1, 'type': 'array'}, 'version': {'const': 1}}, 'required': ['version', 'slices', 'shape', 'sig_dims', 'dtype'], 'title': 'PartitionStructure', 'type': 'object'}
- class libertem.io.dataset.base.TilingScheme(slices: list[Slice], tileshape: Shape, dataset_shape: Shape, intent: Literal['partition'] | Literal['frame'] | Literal['tile'] | None = None, debug=None)[source]
- adjust_for_partition(partition: Partition) TilingScheme [source]
If the intent is per-partition processing, the tiling scheme must match the partition shape exactly. If there is a mismatch, this method returns a new scheme that matches the partition.
- Parameters:
partition – The Partition we want to adjust the tiling scheme to.
- Returns:
The adjusted tiling scheme, or this one, if it matches exactly
- Return type:
- property dataset_shape
- property depth
- classmethod make_for_shape(tileshape: Shape, dataset_shape: Shape, intent: Literal['partition'] | Literal['frame'] | Literal['tile'] | None = None, debug=None) TilingScheme [source]
Make a TilingScheme from tileshape and dataset_shape.
Note that both in signal and navigation direction there are border effects, i.e. if the depth doesn’t evenly divide the number of frames in the partition (simplified, ROI also applies…), or if the signal dimensions of tileshape doesn’t evenly divide the signal dimensions of the dataset_shape.
- Parameters:
tileshape – Uniform shape of all tiles. Should have flat navigation axis (meaning tileshape.nav.dims == 1) and be contiguous in signal dimensions.
dataset_shape – Shape of the whole data set. Only the signal part is used.
intent – The intent of this scheme (whole partitions, frames or tiles) Needs to be set for correct per-partition tiling!
- property shape
tileshape. note that some border tiles can be smaller!
- property slices
signal-only slices for all possible positions
- property slices_array
Returns the slices from the schema as a numpy ndarray a of shape (n, 2, sig_dims) with: a[i, 0] are origins for slice i a[i, 1] are shapes for slice i
- libertem.io.dataset.base.decode_swap_2(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
- libertem.io.dataset.base.decode_swap_4(inp, out, idx, native_dtype, rr, origin, shape, ds_shape)[source]
- libertem.io.dataset.base.default_get_read_ranges(start_at_frame, stop_before_frame, roi_nonzero, depth, slices_arr, fileset_arr, sig_shape, bpp, sync_offset=0, extra=None, frame_header_bytes=0, frame_footer_bytes=0)
- libertem.io.dataset.base.get_coordinates(slice_: Slice, ds_shape: Shape, roi=None) ndarray [source]
Returns numpy.ndarray of coordinates that correspond to the frames in the actual navigation space which are part of the current tile or partition.
- Parameters:
slice (Slice) – Describes the location within the dataset with navigation dimension flattened and reduced to the ROI.
ds_shape (Shape) – The original shape of the whole dataset, not influenced by the ROI
roi (numpy.ndarray, optional) – Array of type bool, matching the navigation shape of the dataset
- libertem.io.dataset.base.make_get_read_ranges(px_to_bytes=CPUDispatcher(<function _default_px_to_bytes>), read_ranges_tile_block=CPUDispatcher(<function _default_read_ranges_tile_block>))[source]
Translate the TilingScheme combined with the roi into (pixel)-read-ranges, together with their tile slices.
- Parameters:
start_at_frame – Dataset-global first frame index to read
stop_before_frame – Stop before this frame index
tiling_scheme – Description on how the data should be tiled
fileset_arr – Array of shape (number_of_files, 3) where the last dimension contains the following values: (start_idx, end_idx, file_idx), where [start_idx, end_idx) defines which frame indices are contained in the file.
roi – Region of interest (for the full dataset)
bpp (int) – Bits per pixel, including padding
- Returns:
read_ranges is an ndarray with shape (number_of_tiles, depth, 3) where the last dimension contains: file index, start_byte, stop_byte
- Return type:
(tile_slice, read_ranges)