.. _`user-defined functions`: User-defined functions (UDFs) ============================= .. testsetup:: * import numpy as np from libertem import api from libertem.executor.inline import InlineJobExecutor from libertem.udf import UDF as RealUDF # We override UDF in such a way that it can be used # without implementing all methods class UDF(RealUDF): def get_result_buffers(self): return {} def process_frame(self, frame): pass class YourUDF(RealUDF): def get_result_buffers(self): return {'buf1': self.buffer(kind="nav")} def process_frame(self, frame): self.results.buf1[:] = 42 ctx = api.Context(executor=InlineJobExecutor()) data = np.random.random((16, 16, 32, 32)).astype(np.float32) dataset = ctx.load("memory", data=data, sig_dims=2) roi = np.random.choice([True, False], dataset.shape.nav) udf = YourUDF() A common case for analyzing big EM data sets is running a reduction operation on each individual detector frame or other small subsets of a data set and then combining the results of these reductions to form the complete result. This should cover a wide range of use cases, from simple mathematical operations, for example statistics, to complex image processing and analysis, like feature extraction. The user-defined functions (UDF) interface of LiberTEM allows you to define and run your own reduction functions easily, without having to worry about parallelizing, I/O, or the details of buffer management. This corresponds to a simplified `MapReduce programming model `_, where the intermediate re-keying and shuffling step is omitted. LiberTEM ships with some :ref:`utility UDFs ` that implement general functionality: * :ref:`Sum ` * :ref:`Logsum ` * :ref:`StdDev ` * :ref:`SumSig ` * :ref:`Masks and other linear operations ` * :ref:`Pick ` Also, LiberTEM includes :ref:`ready-to-use application-specific UDFs `. It can be helpful to review :ref:`some general concepts ` before reading the following sections. Getting started --------------- The easiest way of running a function over your data is using the :meth:`~libertem.api.Context.map` method of the LiberTEM API. For example, to calculate the sum over the last signal axis: .. testcode:: autoudf import functools import numpy as np result = ctx.map( dataset=dataset, f=functools.partial(np.sum, axis=-1) ) # access the result as NumPy array: np.array(result) # or, alternatively: result.data The function specified via the :code:`f` parameter is called for each frame / diffraction pattern. See :ref:`auto UDF` for more details. This is most suited for simple functions; once you have parameters or want to re-use some data across function calls, you should create a :class:`~libertem.udf.base.UDF` subclass instead. :func:`functools.partial` is a higher-order function that allows to create a new function by wrapping an existing function and passing additional parameters to it. In this case, the resulting call to :func:`numpy.sum` within :code:`ctx.map(...)` is :code:`numpy.sum(frame, axis=-1)`. See https://docs.python.org/3/library/functools.html#functools.partial for more details. Example notebook ---------------- See the following notebook for a demonstration of basic UDF functionality. It can be downloaded `from our example collection on GitHub `_. .. toctree:: udf/introduction .. _`how UDFs work`: How UDFs works -------------- .. image:: ./images/udf-diagram.svg :width: 450 :alt: Diagram with the data flow of a UDF. To allow for parallel processing, data is first divided into partitions along the navigation axes, which are worked on by different worker processes. Then, for each frame of a partition, a user-defined function :meth:`~libertem.udf.base.UDFFrameMixin.process_frame` is called, which is free to do any imaginable processing. As a result of splitting the data set into partitions, the results then need to be merged back together. This is accomplished by calling the :meth:`~libertem.udf.base.UDF.merge` method after all frames of a partition are processed. In pseudocode, data is processed in the following way: .. code-block:: python result = empty for partition in get_partitions(dataset): partition_result = empty for frame, frame_slice in get_frames(partition): frame_result = process_frame(frame) partition_result[frame_slice] = frame_result merge(dest=result, src=partition_result) In reality, the loop over partitions is run in parallel using multiple worker processes, potentially :ref:`on multiple computers `. The loop over individual frames is run in the worker processes, and the merge function is run in the main process, accumulating the results, every time the results for a partition are available. In addition to :meth:`~libertem.udf.base.UDFFrameMixin.process_frame`, there are two more methods available for overriding, to work on larger/different units of data at the same time: :meth:`~libertem.udf.base.UDFTileMixin.process_tile` and :meth:`~libertem.udf.base.UDFPartitionMixin.process_partition`. They can be used for optimizing some operations, and are documented in the :ref:`advanced topics ` section. More about UDFs --------------- Now would be a good time to :ref:`read more about implementing UDFs ` and :ref:`advanced UDF functionality `. The :ref:`general section on debugging ` helps with resolving issues. Once you have your UDF working, you can proceed to :ref:`UDF profiling ` to gain insights into the efficiency of your UDF. .. toctree:: udf/basic udf/advanced udf/profiling .. seealso:: :ref:`udf reference` API documentation for UDFs