.. _`advanced udf`:

User-defined functions: advanced topics
---------------------------------------

.. testsetup:: *

    import numpy as np
    from libertem import api
    from libertem.executor.inline import InlineJobExecutor
    from libertem.viz.base import Dummy2DPlot

    ctx = api.Context(executor=InlineJobExecutor(), plot_class=Dummy2DPlot)
    data = np.random.random((16, 16, 32, 32)).astype(np.float32)
    dataset = ctx.load("memory", data=data, sig_dims=2)
    roi = np.random.choice([True, False], dataset.shape.nav)

The UDF interface offers a wide range of features to help implement advanced
functionality and to optimize the performance of an UDF. These features are
optional in order to keep UDFs that don't need them simple.

See :ref:`user-defined functions` for an introduction to basic topics.

.. _tiled:

Tiled processing
----------------

Many operations can be significantly optimized by working on stacks of frames.
You can often perform `loop nest optimization
<https://en.wikipedia.org/wiki/Loop_nest_optimization>`_ to improve the
`locality of reference <https://en.wikipedia.org/wiki/Locality_of_reference>`_,
for example using `numba <https://numba.pydata.org/>`_, or using an optimized
NumPy function.

As an example, applying a gain map and subtracting dark frames can be up to an
order of magnitude faster when properly optimized compared to a naive NumPy
implementation. These optimizations are only possible if you have access to data
from more than one frame.

For very large frames, another problem arises: a stack of frames would be too
large to efficiently handle, as it would no longer fit into even the L3 cache,
which is the largest cache in most CPUs. For these cases, we support a tiled
reading and processing strategy. Tiled means we slice the frame into disjoint
rectangular regions. A tile then is the data from a single rectangular region
for multiple frames.

For example, in case of K2IS data, frames have a shape of :code:`(1860, 2048)`.
When reading them with the tiled strategy, a single tile will contain data from
16 subsequent frames, and each rectangle has a shape of :code:`(930, 16)`, which
is the natural block size for K2IS data. That means the tiles will have a shape
of :code:`(16, 930, 16)`, and processing 16 frames from the data set means
reading 256 individual tiles.

Loading a tile of this size as float32 data still fits comfortably into usual L3
CPU caches (~1MB per core), and thus enables efficient processing. As a comparison, a
whole :code:`(1860, 2048)` frame is about 15MB large, and accessing it
repeatedly means having to load data from the slower main memory.

.. note::
    You may have noticed that we talk about block sizes of 1MB as efficient in
    the L3 cache, but many CPUs have larger L3 caches. As the L3 cache is shared
    between cores, and LiberTEM tries to use multiple cores, the effectively
    available L3 cache has to be divided by number of cores.

.. _`slice example`:

Real-world example
~~~~~~~~~~~~~~~~~~

The :class:`libertem_blobfinder.udf.correlation.SparseCorrelationUDF` uses
:meth:`~libertem.udf.base.UDFTileMixin.process_tile` to implement a custom version of
a :class:`~libertem.udf.masks.ApplyMasksUDF` that works on log-scaled data. The
mask stack is stored in a :class:`libertem.common.container.MaskContainer` as part of
the task data. Note how the :code:`self.meta.slice` property of type
:class:`~libertem.common.slice.Slice` is used to extract the region from the mask
stack that matches the tile using the facilities of a
:class:`~libertem.common.container.MaskContainer`. After reshaping, transposing and log
scaling the tile data into the right memory layout, the mask stack is applied to
the data with a dot product. The result is *added* to the buffer in order to
merge it with the results of the other tiles because addition is the correct
merge function for a dot product. Other operations would require a different
merge function here, for example :meth:`numpy.max()` if a global maximum is to
be calculated.

.. testsetup::

    from libertem_blobfinder.base.correlation import log_scale

.. testcode::

    def process_tile(self, tile):
        tile_slice = self.meta.slice
        c = self.task_data.mask_container
        tile_t = np.zeros(
            (np.prod(tile.shape[1:]), tile.shape[0]),
            dtype=tile.dtype
        )
        log_scale(tile.reshape((tile.shape[0], -1)).T, out=tile_t)

        sl = c.get(key=tile_slice, transpose=False)
        self.results.corr[:] += sl.dot(tile_t).T

.. _`udf post processing`:

Partition processing
--------------------

Some algorithms can benefit from processing entire partitions, for example if
they require several passes over the data. In most cases, :ref:`tiled
processing<tiled>` will be faster because it uses the L3 cache more efficiently.
For that reason, per-partition processing should only be used if there are clear
indications for it. Implementing
:meth:`~libertem.udf.base.UDFPartitionMixin.process_partition` activates
per-partition processing for an UDF.

Precedence
----------

By default the UDF interface looks for methods in the order
:meth:`~libertem.udf.base.UDFTileMixin.process_tile`,
:meth:`~libertem.udf.base.UDFFrameMixin.process_frame`,
:meth:`~libertem.udf.base.UDFPartitionMixin.process_partition` and will use the first
matching implementation. :meth:`~libertem.udf.base.UDFTileMixin.process_tile` is
the most general method and should allow by-frame and by-partition processing as well.

If a UDF requires more than one implementation depending on how it is
parametrised or the data it is supplied with, the :meth:`~libertem.udf.base.UDF.get_method`
method can be overridden to influence how the UDF will be run. This method must
return a member of the :attr:`libertem.udf.base.UDF.UDF_METHOD` enum and have
the corresponding processing method implemented.

.. versionchanged:: 0.13.0
    :meth:`~libertem.udf.base.UDF.get_method()` was exposed to enable
    choice of processing method at runtime

Post-processing of partition results
------------------------------------

Post-processing allows to perform additional processing steps once the data of a
partition is completely processed with
:meth:`~libertem.udf.base.UDFFrameMixin.process_frame`,
:meth:`~libertem.udf.base.UDFTileMixin.process_tile` or
:meth:`~libertem.udf.base.UDFPartitionMixin.process_partition`. Post-processing is
particularly relevant for tiled processing since that allows to combine the
performance benefits of tiled processing for a first reduction step with
subsequent steps that require reduced data from complete frames or even a
complete partition.

Real-world example from
:class:`libertem_blobfinder.udf.correlation.SparseCorrelationUDF` which
evaluates the correlation maps that have been generated with the dot product in
the previous processing step and places the results in additional result
buffers:

.. testsetup::

    from libertem_blobfinder.base.correlation import evaluate_correlations

.. testcode::

    def postprocess(self):
        steps = 2 * self.params.steps + 1
        corrmaps = self.results.corr.reshape((
            -1,  # frames
            len(self.params.peaks),  # peaks
            steps,  # Y steps
            steps,  # X steps
        ))
        peaks = self.params.peaks
        (centers, refineds, peak_values, peak_elevations) = self.output_buffers()
        for f in range(corrmaps.shape[0]):
            evaluate_correlations(
                corrs=corrmaps[f], peaks=peaks, crop_size=self.params.steps,
                out_centers=centers[f], out_refineds=refineds[f],
                out_heights=peak_values[f], out_elevations=peak_elevations[f]
            )

The :meth:`libertem.udf.base.UDFPostprocessMixin.postprocess` method is called
for each partition on the worker process, before the results from different
partitions have been merged.

.. _`udf final post processing`:

Post-processing after merging
-----------------------------

If you want to implement a post-processing step that is run on the main node
after merging result buffers, you can override
:meth:`~libertem.udf.base.UDF.get_results`:

.. testsetup::

    from libertem.udf import UDF

.. testcode::

    class AverageUDF(UDF):
        """
        Like SumUDF, but also computes the average
        """
        def get_result_buffers(self):
            return {
                'sum': self.buffer(kind='sig', dtype=np.float32),
                'num_frames': self.buffer(kind='single', dtype=np.uint64),
                'average': self.buffer(kind='sig', dtype=np.float32, use='result_only'),
            }

        def process_frame(self, frame):
            self.results.sum[:] += frame
            self.results.num_frames[:] += 1

        def merge(self, dest, src):
            dest.sum[:] += src.sum
            dest.num_frames[:] += src.num_frames

        def get_results(self):
            return {
                # NOTE: 'sum' omitted here, will be returned unchanged
                'average': self.results.sum / self.results.num_frames,
            }

    ctx.run_udf(dataset=dataset, udf=AverageUDF())

Note that :meth:`~libertem.udf.base.UDF.get_result_buffers` returns a placeholder entry for the
:code:`average` result using :code:`use='result_only'`, which is then filled in
:meth:`~libertem.udf.base.UDF.get_results`.  We don't need to repeat those buffers that should be
returned unchanged; if you want to omit a buffer from the results completely,
you can declare it as private with :code:`self.buffer(..., use='private')` in
:meth:`~libertem.udf.base.UDF.get_result_buffers`.

:meth:`~libertem.udf.base.UDF.get_results` should return the results as a dictionary of numpy
arrays, with the keys matching those returned by
:meth:`~libertem.udf.base.UDF.get_result_buffers`.

When returned from :meth:`~libertem.api.Context.run_udf`, all results are wrapped into
:class:`~libertem.common.buffers.BufferWrapper` instances. This is done primarily to get convenient
access to a version of the result that is suitable for visualization using
:attr:`~libertem.common.buffers.BufferWrapper.data`, even if
a :code:`roi` was used, but still allow access to the raw result using
:attr:`~libertem.common.buffers.BufferWrapper.raw_data` attribute.

The detailed rules for buffer declarations, :code:`get_result_buffers` and :code:`get_results` are:

1) All buffers are declared in :code:`get_result_buffers`
2) If a buffer is only computed in :code:`get_results`, it should be marked via
   :code:`use='result_only'` so it isn't allocated on workers
3) If a buffer is only used as intermediary result, it should be marked via :code:`use='private'`
4) Not including a buffer in :code:`get_results` means it will either be passed on
   unchanged, or dropped if :code:`use='private'`
5) It's an error to omit an :code:`use='result_only'` buffer in :code:`get_results`
6) It's an error to include a :code:`use='private'` buffer in :code:`get_results`
7) All results are returned from :code:`Context.run_udf` as :code:`BufferWrapper` instances
8) By default, if :code:`get_results` is not implemented, :code:`use='private'` buffers are dropped,
   and others are passed through unchanged

.. versionadded:: 0.7.0
   :meth:`UDF.get_results` and the :code:`use` argument for :meth:`UDF.buffer` were added.

Pre-processing
---------------

Pre-processing allows to initialize result buffers before processing or merging.
This is particularly useful to set up :code:`dtype=object` buffers, for example
ragged arrays, or to initialize buffers for operations where the neutral element
is not 0. :meth:`libertem.udf.base.UDFPreprocessMixin.preprocess` is executed after
all buffers are allocated, but before the data is processed. On the worker nodes
it is executed with views set for the whole partition masked by the current ROI.
On the central node it is executed with views set for the whole dataset masked
by the ROI. 

.. versionadded:: 0.3.0

.. versionchanged:: 0.5.0
    :meth:`libertem.udf.base.UDFPreprocessMixin.preprocess` is executed on the main
    node, too. Views for aux data are set correctly on the main node. Previously,
    it was only executed on the worker nodes.

.. _`udf valid data masking`:

Valid data masking
------------------

In various intermediate states, result buffers may be only partially filled
with valid data. This comes from the fact that we split the dataset into
partitions and then merge results together only as they become available.
When doing live processing we also have the constraint that not all frames
have been acquired yet, and so naturally only positions where we have
received and processed data can be valid in the corresponding result buffer.

Handling partially valid data becomes important when running the UDF. For example,
live plotting needs to know what regions of the results to include. Also, when using
:meth:`~libertem.api.Context.run_udf_iter` you might only want to handle the already valid
parts of the results.

For UDF results, you can access
:attr:`~libertem.common.buffers.BufferWrapper.valid_mask` to access
the valid mask for that particular buffer. You can also use
:attr:`~libertem.common.buffers.BufferWrapper.masked_data` 
to access the data of the buffer as a numpy masked array.

For example:

.. testsetup::

    from libertem.udf.sumsigudf import SumSigUDF

.. testcode::

    # use `run_udf_iter` to get partial results while the computation is running:
    for res in ctx.run_udf_iter(dataset=dataset, udf=SumSigUDF()):

        # get the BufferWrapper for the 'intensity' buffer of the first UDF:
        buf = res.buffers[0]['intensity']

        # boolean mask that contains True for valid elements:
        mask = buf.valid_mask

        # get a numpy masked array that is limited to the valid elements:
        masked_data = buf.masked_data


By default, the valid masks of results depend on the buffer kind:

=================== ==================
Buffer :code:`kind` Default valid mask
=================== ==================
:code:`'single'`    All-valid
------------------- ------------------
:code:`'sig'`       All-valid
------------------- ------------------
:code:`'nav'`       Parts which have been merged using :meth:`~libertem.udf.UDF.merge`
=================== ==================

In your UDF, you can also customize the valid mask that should be returned.
This is useful if the dataset might contain invalid frames discovered at runtime,
or if intermediate results cannot be fully valid due to missing information (e.g.
calculating gradients across partition boundaries). Another example is when using
a :code:`kind='single'` buffer which is progressively computed; valid-by-default
might be incorrect if not enough information has been seen (e.g. statistical measures
or a count of values).

Customizing the valid data mask is done in :meth:`~libertem.udf.UDF.get_results`
with the :meth:`~libertem.udf.UDF.with_mask` method. For example:

.. testsetup::

    from libertem.udf import UDF
    from libertem.common.math import count_nonzero    


.. testcode::

    class CustomValidMask(UDF):
        def get_result_buffers(self):
            nav_shape = self.meta.dataset_shape.nav

            if self.meta.roi is not None:
                nav_shape = (count_nonzero(self.meta.roi),)

            return {
                # Emulate a kind='nav' result buffer using a kind='sig' buffer for demonstration.
                # In the real world, such result buffers can be found in the implementations of ptychography
                # and iCoM since they reconstruct the object in Fourier space
                'custom_2d': self.buffer(kind='single', dtype=np.float32, extra_shape=nav_shape),
            }

        def process_frame(self, frame):
            self.results.custom_2d[self.meta.coordinates] = np.sum(frame)

        def get_results(self):
            # in this case, we set the valid mask of the kind=single buffer to
            # match what a nav buffer would do.
            valid_nav_mask = self.meta.get_valid_nav_mask()

            return {
                'custom_2d': self.with_mask(
                    self.results.custom_2d,
                    mask=valid_nav_mask.reshape(self.results.custom_2d.shape)
                ),
            }

        def merge(self, dest, src):
            dest.custom_2d += src.custom_2d


    ctx.run_udf(dataset=dataset, udf=CustomValidMask())

This example also shows the :meth:`~libertem.udf.base.UDFMeta.get_valid_nav_mask`
method, which is available in :meth:`~libertem.udf.UDF.merge` and
:meth:`~libertem.udf.UDF.get_results`, and gives you a mask of the
already-processed navigation elements.

.. versionadded:: 0.14.0
    
   In previous versions, only :attr:`libertem.udf.base.UDFResults.damage` was
   available for inspecting the mask of already processed data.
   This did not account for differences between buffers, and was only available
   when iterating over results, i.e. with :meth:`~libertem.api.Context.run_udf_iter`.


AUX data
--------

If a parameter is an instance of :class:`~libertem.common.buffers.BufferWrapper`
that was created using the :meth:`~libertem.udf.base.UDF.aux_data` class method, the
UDF interface will interpret it as auxiliary data. It will set the views for
each tile/frame/partition accordingly so that accessing the parameter returns a
view of the auxiliary data matching the data portion that is currently being
processed. That way, it is possible to pass parameters individually for each
frame or to mask the signal dimension.

Note that the :class:`~libertem.common.buffers.BufferWrapper` instance for AUX
data should always be created using the :meth:`~libertem.udf.base.UDF.aux_data` class
method and not directly by instantiating a
:class:`~libertem.common.buffers.BufferWrapper` since
:meth:`~libertem.udf.base.UDF.aux_data` ensures that it is set up correctly.

For masks in the signal dimension that are used for dot products in combination
with per-tile processing, a :class:`~libertem.common.container.MaskContainer` allows
to use more advanced slicing and transformation methods targeted at preparing
mask stacks for optimal dot product performance.

Task data
---------

A UDF can generate task-specific intermediate data on the worker nodes by
defining a :meth:`~libertem.udf.base.UDF.get_task_data` method. The result is
available as an instance of :class:`~libertem.udf.base.UDFData` in
:code:`self.task_data`. Depending on the circumstances, this can be more
efficient than making the data available as a parameter since it avoids
pickling, network transport and unpickling.

This non-trivial example from
:class:`libertem_blobfinder.udf.correlation.SparseCorrelationUDF` creates
a :class:`~libertem.common.container.MaskContainer` based on the parameters in
:code:`self.params`. This :class:`~libertem.common.container.MaskContainer` is then
available as :code:`self.task_data.mask_container` within the processing
functions.

.. testsetup::

    from libertem.common.container import MaskContainer
    import libertem.masks as masks

.. testcode::

    def get_task_data(self):
        match_pattern = self.params.match_pattern
        crop_size = match_pattern.get_crop_size()
        size = (2 * crop_size + 1, 2 * crop_size + 1)
        template = match_pattern.get_mask(sig_shape=size)
        steps = self.params.steps
        peak_offsetY, peak_offsetX = np.mgrid[-steps:steps + 1, -steps:steps + 1]

        offsetY = self.params.peaks[:, 0, np.newaxis, np.newaxis] + peak_offsetY - crop_size
        offsetX = self.params.peaks[:, 1, np.newaxis, np.newaxis] + peak_offsetX - crop_size

        offsetY = offsetY.flatten()
        offsetX = offsetX.flatten()

        stack = functools.partial(
            masks.sparse_template_multi_stack,
            mask_index=range(len(offsetY)),
            offsetX=offsetX,
            offsetY=offsetY,
            template=template,
            imageSizeX=self.meta.dataset_shape.sig[1],
            imageSizeY=self.meta.dataset_shape.sig[0]
        )
        # CSC matrices in combination with transposed data are fastest
        container = MaskContainer(mask_factories=stack, dtype=np.float32,
            use_sparse='scipy.sparse.csc')

        kwargs = {
            'mask_container': container,
            'crop_size': crop_size,
        }
        return kwargs

.. testcleanup::

    from libertem_blobfinder.udf.correlation import SparseCorrelationUDF
    from libertem_blobfinder.common.patterns import RadialGradient

    class TestUDF(SparseCorrelationUDF):
        pass

    # Override methods with functions that are defined above

    TestUDF.process_tile = process_tile
    TestUDF.postprocess = postprocess
    TestUDF.get_task_data = get_task_data

    u = TestUDF(
        peaks=np.array([(8, 8)]),
        match_pattern=RadialGradient(2),
        steps=3
    )
    ctx.run_udf(dataset=dataset, udf=u)

Meta information
----------------

Advanced processing routines may require context information about the processed
data set, ROI and current data portion being processed. This information is
available as properties of the :attr:`libertem.udf.base.UDF.meta` attribute of type
:class:`~libertem.udf.base.UDFMeta`.

Input data shapes and types
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Common applications include allocating buffers with a :code:`dtype` or shape
that matches the dataset or partition via
:attr:`~libertem.udf.base.UDFMeta.dataset_dtype`,
:attr:`~libertem.udf.base.UDFMeta.input_dtype`,
:attr:`~libertem.udf.base.UDFMeta.dataset_shape` and
:attr:`~libertem.udf.base.UDFMeta.partition_shape`.

Device class
~~~~~~~~~~~~

.. versionadded:: 0.6.0

The currently used compute device class can be accessed through
:attr:`libertem.udf.base.UDFMeta.device_class`. It defaults to 'cpu' and can be 'cuda'
for UDFs that make use of :ref:`udf cuda` support.

ROI and current slice
~~~~~~~~~~~~~~~~~~~~~

For more advanced applications, the ROI and currently processed data portion are
available as :attr:`libertem.udf.base.UDFMeta.roi` and
:attr:`libertem.udf.base.UDFMeta.slice`. This allows to replace the built-in masking
behavior of :class:`~libertem.common.buffers.BufferWrapper` for result buffers
and aux data with a custom implementation. The :ref:`mask container for tiled
processing example<slice example>` makes use of these attributes to employ a
:class:`libertem.common.container.MaskContainer` instead of a :code:`shape="sig"`
buffer in order to optimize dot product performance and support sparse masks.

The slice is in the reference frame of the dataset, masked by the current ROI,
with flattened navigation dimension. This example illustrates the behavior by
implementing a custom version of the :ref:`simple "sum over sig" example
<sumsig>`. It allocates a custom result buffer that matches the navigation
dimension as it appears in processing:

.. testcode::

    import numpy as np

    from libertem.udf import UDF

    class PixelsumUDF(UDF):
        def get_result_buffers(self):
            if self.meta.roi is not None:
                navsize = np.count_nonzero(self.meta.roi)
            else:
                navsize = np.prod(self.meta.dataset_shape.nav)
            return {
                'pixelsum_nav_raw': self.buffer(
                    kind="single",
                    dtype=self.meta.dataset_dtype,
                    extra_shape=(navsize, ),
                )
            }

        def merge(self, dest, src):
            dest.pixelsum_nav_raw[:] += src.pixelsum_nav_raw

        def process_frame(self, frame):
            np_slice = self.meta.slice.get(nav_only=True)
            self.results.pixelsum_nav_raw[np_slice] = np.sum(frame)

.. testcleanup::

    pixelsum = PixelsumUDF()
    res = ctx.run_udf(dataset=dataset, udf=pixelsum, roi=roi)

    assert np.allclose(res['pixelsum_nav_raw'].data, dataset.data[roi].sum(axis=(1, 2)))

Coordinates
~~~~~~~~~~~

.. versionadded:: 0.6.0

The coordinates of the current frame, tile or partition within the true dataset
navigation dimension, as opposed to the current slice that is given in flattened
nav dimensions with applied ROI, is available through
:attr:`~libertem.udf.base.UDFMeta.coordinates`. The following UDF simply
collects the coordinate info for demonstration purposes. A real-world example
that uses the coordinates is `the UDF implementation of single side band
ptychography
<https://github.com/Ptychography-4-0/ptychography/blob/master/src/ptychography40/reconstruction/ssb/udf.py>`_.

.. testcode::

    import numpy as np

    from libertem.udf import UDF

    class CoordUDF(UDF):
        def get_result_buffers(self):
            # Declare a buffer that fits the coordinates,
            # i.e. one int per nav axis for each nav position
            nav_dims = len(self.meta.dataset_shape.nav)
            return {
                'coords': self.buffer(
                    kind="nav",
                    dtype=int,
                    extra_shape=(nav_dims, ),
                )
            }

        def process_tile(self, tile):
            # Simply copy the coordinates into
            # the result buffer
            self.results.coords[:] = self.meta.coordinates

    my_roi = np.zeros(dataset.shape.nav, dtype=bool)
    my_roi[7, 13] = True
    my_roi[11, 3] = True

    res = ctx.run_udf(
        dataset=dataset,
        udf=CoordUDF(),
        roi=my_roi
    )

    assert np.all(
        res['coords'].raw_data == np.array([(7, 13), (11, 3)])
    )

.. _`udf dtype`:

Preferred input dtype
---------------------

.. versionadded:: 0.4.0

UDFs can override :meth:`~libertem.udf.base.UDF.get_preferred_input_dtype` to
indicate a "lowest common denominator" compatible dtype. The actual input dtype
is determined by combining the indicated preferred dtype with the input
dataset's native dtype using :func:`numpy.result_type`. The default preferred
dtype is :attr:`numpy.float32`. Returning :attr:`UDF.USE_NATIVE_DTYPE`, which is
currently identical to :code:`numpy.bool`, will switch to the dataset's native
dtype since :code:`numpy.bool` behaves as a neutral element in
:func:`numpy.result_type`.

If an UDF requires a specific dtype rather than only preferring it, it should
override this method and additionally check the actual input type, throw an
error when used incorrectly and/or implement a meaningful conversion in its
processing routine since indicating a preferred dtype doesn't enforce it. That
way, unsafe conversions are performed explicitly in the UDF rather than
indirectly in the back-end.

.. _`udf cuda`:

CuPy support
------------

.. versionadded:: 0.6.0

LiberTEM can use CUDA devices through `CuPy <https://cupy.dev/>`_. Since
CuPy largely replicates the NumPy array interface, any UDF that uses NumPy for
its main processing can likely be ported to use both CPUs and CUDA devices in
parallel. Some adjustments are often necessary to account for minor differences
between NumPy and CuPy. CuPy is most beneficial for compute-heavy tasks with
good CUDA math library support such as large Fourier transforms or matrix
products.

In order to activate CuPy processing, a UDF can overwrite the
:meth:`~libertem.udf.base.UDF.get_backends` method. By default this returns
:code:`('numpy',)`, indicating only NumPy support. By returning :code:`('numpy',
'cupy')` or :code:`('cupy',)`, a UDF activates being run on both CUDA and CPU
workers, or exclusively on CUDA workers. Using :code:`cuda` instead of
:code:`cupy` schedules on CUDA workers, but without using the CuPy library. This
is useful for running code that uses CUDA in a different way, for example
integration of C++ CUDA code, and allows to skip installation of CuPy in this
situation.

The :attr:`libertem.udf.base.UDF.xp` property points to the :code:`numpy` or
:code:`cupy` module, depending which back-end is currently used. By using
:code:`self.xp` instead of the usual :code:`np` for NumPy, one can write UDFs
that use the same code for CUDA and CPU processing.

Result buffers can be declared as device arrays by setting
:code:`self.buffer(..., where='device')` in
:meth:`~libertem.udf.base.UDF.get_result_buffers`. That allows to keep data in
the device until a partition is completely processed and the result is exported
to the leader node.

The input argument for :code:`process_*()` functions is already provided as a
CuPy array instead of NumPy array if CuPy is used.

A UDF should only use one GPU at a time. If :code:`cupy` is used, the correct
device to use is set within CuPy in the back-end and should not be modified in
the UDF itself. If :code:`cuda` is used, it is the responsibility of the user to
set the device ID to the value returned by
:meth:`libertem.common.backend.get_use_cuda`. The environment variable
:code:`CUDA_VISIBLE_DEVICES` can be set `before` any CUDA library is loaded to
control which devices are visible.

The :meth:`~libertem.api.Context.run_udf` method allows setting the
:code:`backends` attribute to :code:`('numpy',)` :code:`('cupy',)` or :code:`('cuda',)` to
restrict execution to CPU-only or CUDA-only on a hybrid cluster. This is mostly
useful for testing.

.. _`sparse`:

Sparse arrays
-------------

.. versionadded:: 0.11.0

As an extension of :ref:`udf cuda`, LiberTEM also supports supplying UDFs with
tiles in sparse array formats for both CPU and GPU. A UDF specifies the
supported array backends by overwriting
:meth:`~libertem.udf.base.UDF.get_backends` to return an iterable with the
supported formats in order of preference. Each array format is associated with a
device class so that :ref:`udf cuda` works analogously for both dense and sparse
formats. Dense CPU and GPU arrays are specified with the backends described in
:ref:`udf cuda` so that the extension to sparse arrays is backwards-compatible.

The possible backends supported by LiberTEM are available as the following
constants: :attr:`libertem.udf.base.UDF.BACKEND_NUMPY`,
:attr:`~libertem.udf.base.UDF.BACKEND_CUPY`,
:attr:`~libertem.udf.base.UDF.BACKEND_CUDA`,
:attr:`~libertem.udf.base.UDF.BACKEND_SPARSE_COO`,
:attr:`~libertem.udf.base.UDF.BACKEND_SPARSE_GCXS`,
:attr:`~libertem.udf.base.UDF.BACKEND_SPARSE_DOK`,
:attr:`~libertem.udf.base.UDF.BACKEND_SCIPY_COO`,
:attr:`~libertem.udf.base.UDF.BACKEND_SCIPY_CSR`,
:attr:`~libertem.udf.base.UDF.BACKEND_SCIPY_CSC`,
:attr:`~libertem.udf.base.UDF.BACKEND_CUPY_SCIPY_COO`,
:attr:`~libertem.udf.base.UDF.BACKEND_CUPY_SCIPY_CSR`,
:attr:`~libertem.udf.base.UDF.BACKEND_CUPY_SCIPY_CSC`.

:attr:`~libertem.udf.base.UDF.BACKEND_ALL` is a tuple of all backends that are
recommended for use in UDFs.

.. versionadded:: 0.12.0
    The sets :attr:`~libertem.udf.base.UDF.CPU_BACKENDS`,
    :attr:`~libertem.udf.base.UDF.CUDA_BACKENDS`, and
    :attr:`~libertem.udf.base.UDF.CUPY_BACKENDS` categorize backends by device class.
    :attr:`~libertem.udf.base.UDF.SPARSE_BACKENDS` and
    :attr:`~libertem.udf.base.UDF.DENSE_BACKENDS` categorize between dense and
    sparse arrays.
    The backends in :attr:`~libertem.udf.base.UDF.D2_BACKENDS` only support 2D
    matrices, while the ones in :attr:`~libertem.udf.base.UDF.ND_BACKENDS` support
    n-dimensional arrays.

The frame, tile or partition is supplied with flattened
signal dimensions for 2D-only backends. Furthermore, frames include a nav
dimension of 1 with these backends.

The backend that is used for a partition is available through
:attr:`libertem.udf.base.UDF.meta.array_backend` at runtime. Please note that it can be
different between partitions.

Internally, LiberTEM calculates an execution plan that matches the
capabilities of all UDFs in a run with the capabilities of the dataset and the
device class so that conversion overheads are minimized. LiberTEM can process
data in sparse form from start to finish if a dataset that can produce tiles
in a sparse format, such as :ref:`raw csr`, is combined with a set of UDFs that
all support a sparse backend.

Since operations on sparse arrays often use an API modelled after NumPy, but
return data in various backends that may or may not allow direct assignment into
a result buffer, the :meth:`libertem.udf.base.UDF.forbuf` method converts an
array to a backend that is suitable for assigning into the specified result
buffer. It also takes care of reshaping from 2D to nD as necessary.

Simplified example implementation based on :class:`libertem.udf.sumsigudf.SumSigUDF` that
demonstrates how support for all array formats can be implemented:

.. testsetup::

    from libertem.udf import UDF

.. testcode::

    class SumSigUDF(UDF):
        def get_backends(self):
            # Support all recommended array backends in LiberTEM.
            # Please note that their APIs can differ so that comprehensive
            # tests with all supported backends are required. Also, the list of
            # supported backends in LiberTEM may grow in the future, which can
            # require adaptations in an UDF that uses this constant.
            return self.BACKEND_ALL

        def get_result_buffers(self):
            dtype = np.result_type(self.meta.input_dtype, np.float32)
            return {
                'intensity': self.buffer(
                    kind="nav", dtype=dtype, where='device'
                ),
            }

        def process_tile(self, tile):
            # Show the backend that is currently used
            print(self.meta.array_backend)

            # Note the following points:
            # * Using self.forbuf(arr, target) to make the result
            #   compatible with the result buffer.
            # * Preemptively flatten the sig dimensions so that
            #   2D and nD arrays work the same.
            # * Work around API peculiarities, such as the axis keyword
            #   that is only partially supported in cupyx.scipy.sparse.
            self.results.intensity[:] += self.forbuf(
                np.sum(
                    # Flatten and sum axis 1 for cupyx.scipy.sparse support
                    tile.reshape((tile.shape[0], -1)),
                    axis=1
                ),
                self.results.intensity
            )

    # Empty memory dataset for testing that returns SCIPY_CSR tiles
    ds = ctx.load(
        'memory',
        datashape=(23, 42, 17, 4),
        sig_dims=2,
        array_backends=(UDF.BACKEND_SCIPY_CSR, ),
        num_partitions=2,
    )

    ctx.run_udf(dataset=ds, udf=SumSigUDF())

.. testoutput::

    scipy.sparse.csr_matrix
    scipy.sparse.csr_matrix

See the implementation of :class:`libertem.udf.masks.ApplyMasksUDF` and
:class:`libertem.udf.stddev.StdDevUDF` for non-trivial examples of UDFs
that support a wide range of array formats!

.. note::

    The underlying library for array type detection and conversion is
    available independent of LiberTEM at https://github.com/LiberTEM/sparseconverter/.

.. _`threading`:

Threading
---------

By default, LiberTEM uses multiprocessing with one process per CPU core for
offline processing, using the class :class:`~libertem.executor.dask.DaskJobExecutor`. 
In that scenario, UDFs should only use a single thread to avoid oversubscription.

However, when running with a single-process single-thread executor like
:class:`~libertem.executor.inline.InlineJobExecutor`, multiple threads can be
used. In some cases this might be advantageous in combination with the inline executor.
The thread count for many common numerics
libraries is set automatically by LiberTEM, see
:attr:`~libertem.udf.base.UDFMeta.threads_per_worker`. For other cases the
thread count on a worker should be set by the user according to
:code:`self.meta.threads_per_worker`.

Multithreaded executors are introduced with release 0.9.0, see :ref:`executors`.
They run UDF functions in parallel threads within the same process. This can
introduce issues with thread safety, for example shared objects being changed
concurrently by multiple threads. The LiberTEM internals and core UDFs are
tested to work with these executors, but user code may break unexpectedly.
`PyFFTW interface caching is a known issue of this category
<https://github.com/LiberTEM/LiberTEM-blobfinder/issues/35>`_. For that reason,
the threaded executors should be considererd experimental for the time being.
Furthermore, setting and re-setting any global variable, for example the thread
count of an external library, should be protected with a reentrant locking
mechanism.

The pyFFTW cache is disabled with threaded executors because of this known bug.
That can have a negative impact on performance. For performance optimization
with pyFFTW, users could use the `builder interface of PyFFTW
<https://pyfftw.readthedocs.io/en/latest/source/pyfftw/builders/builders.html>`_
or `use the native FFTW object interface
<https://pyfftw.readthedocs.io/en/latest/source/pyfftw/pyfftw.html#pyfftw.FFTW>`_.

Running multiple LiberTEM :class:`~libertem.api.Context` objects or executors
in parallel threads is not tested and can lead to unexpected interactions.

.. _auto UDF:

Auto UDF
--------

The :class:`~libertem.udf.AutoUDF` class and :meth:`~libertem.api.Context.map`
method allow to run simple functions that accept a frame as the only parameter
with an auto-generated :code:`kind="nav"` result buffer over a dataset ad-hoc
without defining an UDF class. For more advanced processing, such as custom
merge functions, post-processing or performance optimization through tiled
processing, defining an UDF class is required.

As an alternative to Auto UDF, you can use the
:meth:`~libertem.contrib.daskadapter.make_dask_array` method to create
a `dask.array <https://docs.dask.org/en/latest/array.html>`_ from
a :class:`~libertem.io.dataset.base.DataSet` to perform calculations. See
:ref:`Integration with Dask arrays<daskarray>` for more details.

The :class:`~libertem.udf.AutoUDF` class determines the output shape and type
by calling the function with a mock-up frame of the same type and shape as
a real detector frame and converting the return value to a NumPy array. The
:code:`extra_shape` and :code:`dtype` parameters for the result buffer are
derived automatically from this NumPy array.

Additional constant parameters can be passed to the function via
:meth:`functools.partial`, for example. The return value should be much smaller
than the input size for this to work efficiently.

Example: Calculate sum over the last signal axis.

.. testcode::

    import functools

    result = ctx.map(
        dataset=dataset,
        f=functools.partial(np.sum, axis=-1)
    )

    # or alternatively:
    from libertem.udf import AutoUDF

    udf = AutoUDF(f=functools.partial(np.sum, axis=-1))
    result = ctx.run_udf(dataset=dataset, udf=udf)

.. _udf merge_all:

One-step merge (`merge_all`)
----------------------------

.. versionadded:: 0.9.0

.. note::
    The interface described here is experimental and therefore subject to change.
    See :ref:`dask merge_all` for more information.

With the release of the :class:`~libertem.executor.delayed.DelayedJobExecutor`
UDFs have the option to define a method :code:`merge_all()` which is
used by this executor to perform a one-step merge of the results from all
partitions. This is only applied by this executor, and as a result is only
used when computing with :code:`dask.delayed`.

In the case of :code:`kind='nav'` result buffers, only, and no custom merge logic,
the :code:`merge_all` implementation is automatically provided by the base
UDF class. If no :code:`merge_all` is provided, the standard :code:`merge`
function is used via a different mechanism.

.. note::
    When using :code:`merge_all`, no attempt is made to verify that it functions
    identically to the `merge` function, which remains a requirement of the
    UDF implementation if implementing a custom merge and using other executors.

The :code:`merge_all` function must have the following signature:

.. code-block:: python

    def merge_all(self, ordered_results):
        ...
        return {result_name: merged_result_array, ...}

where :code:`ordered_results` is an ordered dictionary mapping between the
:class:`~libertem.common.slice.Slice` for each partition and a dictionary
of partial results keyed by result_name.

An example :code:`merge_all` to `sum` all partial results for a :code:`kind='sig'`
result buffer named :code:`'intensity'` would be the following:

.. code-block:: python

    def merge_all(self, ordered_results):
        intensity = np.stack([b.intensity for b in ordered_results.values()]).sum(axis=0)
        return {
            'intensity': intensity
        }

The ordering of the partial results is such that the :class:`~libertem.common.slice.Slice`
objects which are the dictionary keys are sequential in the flattened navigation
dimension. The user can therefore safely concatenate the partial results
for a given result buffer to get a whole-analysis-sized array with a flat
navigation dimension. In the case of :code:`kind='nav'` buffers the returned
arrays must be in this same flat navigation shape.

When an ROI has been provided while running the UDF, the number of elements in
the partial results will correspond to the number of valid ROI pixels; the concatenated
result will be reshaped into a full dataset-sized array by LiberTEM
after :code:`merge_all` has been called.

.. note::
    When using :class:`~libertem.executor.delayed.DelayedJobExecutor` and
    in particular :code:`merge_all`, the user is operating directly on
    `dask.array.Array <https://docs.dask.org/en/stable/array.html>`_
    objects built for lazy computation. The Dask API is largely compatible
    with numpy, and will lazily build a task graph from normal numpy
    functions (e.g. :code:`np.stack(arrays).sum()` above). However, care
    must be taken to avoid triggering eager execution accidentally, for
    example by casting using Python builtins such as :code:`int(dask_value)`.
    For a degree of certainty, the user is encouraged to consider the
    `Dask Array API <https://docs.dask.org/en/latest/array-api.html>`_ when
    building `merge_all` functions. The same advice applies to any
    post-processing applied after merging (:ref:`udf final post processing`).

The return value from :code:`merge_all` must be a dictionary of merged result arrays
with the keys matching the declared result buffers. There is, however, no requirement
to return merged results for all existing buffers, though any that are missing will not
contain results from the computation and are likely to be filled with zeros.

Dynamically updating UDF parameters
-----------------------------------

.. versionadded:: 0.12.0

    This is an experimental API, which we make explicit by naming
    the method :code:`update_parameters_experimental`. It is not
    guaranteed to work with all UDFs.

Running UDFs in a feedback loop, for example on a live data stream using
LiberTEM-live, is not only helpful for adjusting microscope parameters on the
fly to match the reconstruction, but also for adjusting and fine tuning UDF
parameters. Usually, this would be implemented like this:


.. code-block:: python

    while True:
        udf = YourUDF(**get_new_parameters())
        ctx.run_udf(dataset=ds, udf=udf, plots=True)


This works, but limits the feedback loop to update once per complete run,
which can be annoying especially for longer dwell times.

Instead, the feedback loop can be shortened by updating parameters as soon as
the first results are available. This is accomplished by using
:meth:`~libertem.api.Context.run_udf_iter` instead of
:meth:`~libertem.api.Context.run_udf`, and calling the
:meth:`~libertem.api.ResultGenerator.update_parameters_experimental` method:

.. code-block:: python

    udf = YourUDF(**parameters)
    while True:
        result_iter = ctx.run_udf_iter(dataset=ds, udf=[udf])
        for part_res in result_iter:
            result_iter.update_parameters_experimental([
                {'some_parameter': get_new_value()},
            ])


Note that this works when passing multiple UDFs into
:meth:`~libertem.api.Context.run_udf_iter` as a list -
we need to pass a :code:`dict` for each of them into
:meth:`~libertem.api.ResultGenerator.update_parameters_experimental`.
The :code:`dict` only needs to contain items for the parameters you want to update,
and can be an empty :code:`dict` if you don't want to update any parameters
for the corresponding UDF.

The call to :code:`get_new_value` here can be any mechanism to get an updated
value, be it querying a graphical user interface, or automatically calculating
a new value based on the latest result from the UDFs.


.. note::

    This API is implemented for the following executor types: :code:`'inline'`,
    :code:`'dask'`, :code:`'pipelined'`, and :code:`'threads'`, as specified
    in :func:`~libertem.api.Context.make_with`. Meaning that if you don't
    specify an executor, with either :class:`~libertem.api.Context` or
    :class:`~libertem_live.api.LiveContext` from LiberTEM-live, it should just
    work.

    Note also that different executors schedule tasks differently, so especially
    in the offline case, there may be a bit more latency before the new values for
    parameters are available in the UDFs.