Profiling UDFs

To bring your UDF to production, it may be necessary to look into the performance and efficiency of your function. There may be some low-hanging fruit to optimize away, but they only really become visible once you profile your code. Just looking at code and guessing which parts are expensive can be misleading and cause you to spend time optimizing the wrong part of your code. Always measure!

Profiling means instrumenting the execution of a program and finding out which parts of your code use how much of a given resource (for example, CPU time, or memory).

Prerequisite: `InlineJobExecutor`

By default, your UDF will be run in a multi-process or multi-threaded environment. This makes profiling a challenge, since the usual profiling tools will only capture information on the main process that performs the final reduction and not the processes that do most of the work.

So, in order to use the default Python profiling mechanisms, all parts of the UDF should be executed in the same single thread. LiberTEM provides what it calls InlineJobExecutor for this purpose. Improving the execution time in a single-threaded executor will almost always improve the multi-process execution time. Nevertheless, you should confirm the impact of changes in your production environment.

To use the InlineJobExecutor, pass it to the Context on construction:

from libertem.executor.inline import InlineJobExecutor
ctx = Context(executor=InlineJobExecutor())

Then, you can continue as usual, loading data, executing your UDF, etc.

Using progress bar

A progress bar is a graphical tool that shows a far a process has progressed. It can be used to assess the partitioning of data, since the progress is indicated on a per partition basis in the following way:

res = ctx.run_udf(udf=fit_udf, dataset=ds, roi=roi, progress=True)
> 100%|██████████| 18/18 [01:17<00:00,  4.29s/it]

This run is using 18 partitions, and took 4.29s per partition.

Line profiling using line_profiler

Profiling comes in different forms, some will show you how much time each function of a whole program took, others can give you a line-by-line overview for functions you are interested in, like line_profiler. For most UDFs this will be sufficient for performance analysis.

First, you need to install the line_profiler package via pip, having your conda environment or virtualenv activated:

(libertem) $ python -m pip install line_profiler

Then the easiest way to get started is to use the IPython/Jupyter integration of line_profiler. Put %load_ext line_profiler somewhere in your notebook, then you can use the %lprun magic command to run your UDF while profiling:

%lprun -f YourUDF.get_task_data -f YourUDF.process_frame ctx.run_udf(dataset=dataset, YourUDF())

Note the repeated -f arguments - you can list any methods of your UDF you are interested in, or even other Python functions you are directly or indirectly using. If you had some complicated code in YourUDF.merge, you would include -f YourUDF.merge in the %lprun call.

This is how the output could look like:

Timer unit: 1e-06 s

Total time: 0.002283 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: get_task_data at line 145

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                             def get_task_data(self):
                                                 """
                                                 Updates `task_data`
   148
                                                 Returns
                                                 -------
                                                 kwargs : dict
                                                 A dictionary with the following keys:
                                                     kwargs['aperture'] : array-like
                                                     Side band filter aperture (mask)
                                                     kwargs['slice'] : slice
                                                     Slice for slicing FFT of the hologram
                                                 """
   158
       2         48.0     24.0      2.1          out_shape = self.params.out_shape
       2         51.0     25.5      2.2          sy, sx = self.meta.partition_shape.sig
       2          5.0      2.5      0.2          oy, ox = out_shape
       2          7.0      3.5      0.3          f_sampling = (1. / oy, 1. / ox)
       2        292.0    146.0     12.8          sb_size = self.params.sb_size * np.mean(f_sampling)
       2        261.0    130.5     11.4          sb_smoothness = sb_size * self.params.sb_smoothness * np.mean(f_sampling)
   165
       2       1172.0    586.0     51.3          f_freq = freq_array(out_shape)
       2        263.0    131.5     11.5          aperture = aperture_function(f_freq, sb_size, sb_smoothness)
   168
       2         64.0     32.0      2.8          y_min = int(sy / 2 - oy / 2)
       2         37.0     18.5      1.6          y_max = int(sy / 2 + oy / 2)
       2         32.0     16.0      1.4          x_min = int(sx / 2 - ox / 2)
       2         30.0     15.0      1.3          x_max = int(sx / 2 + oy / 2)
       2          8.0      4.0      0.4          slice_fft = (slice(y_min, y_max), slice(x_min, x_max))
   174
                                                 kwargs = {
       2          4.0      2.0      0.2              'aperture': aperture,
       2          6.0      3.0      0.3              'slice': slice_fft
                                                 }
       2          3.0      1.5      0.1          return kwargs

Total time: 63.748 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: process_frame at line 181

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
                                             def process_frame(self, frame):
                                                 """
                                                 Reconstructs holograms outputting results into 'wave'
   184
                                                 Parameters
                                                 ----------
                                                 frame
                                                    single frame (hologram) of the data
                                                 """
      16        154.0      9.6      0.0          if not self.params.precision:
                                                     frame = frame.astype(np.float32)
                                                 # size_x, size_y = self.params.out_shape
      16         81.0      5.1      0.0          frame_size = self.meta.partition_shape.sig
      16         58.0      3.6      0.0          sb_pos = self.params.sb_position
      16         66.0      4.1      0.0          aperture = self.task_data.aperture
      16         52.0      3.2      0.0          slice_fft = self.task_data.slice
   197
      16   59291808.0 3705738.0     93.0          fft_frame = fft2(frame) / np.prod(frame_size)
      16    2189960.0 136872.5      3.4          fft_frame = np.roll(fft_frame, sb_pos, axis=(0, 1))
   200
      16    2258700.0 141168.8      3.5          fft_frame = fftshift(fftshift(fft_frame)[slice_fft])
   202
      16        816.0     51.0      0.0          fft_frame = fft_frame * aperture
   204
      16       5957.0    372.3      0.0          wav = ifft2(fft_frame) * np.prod(frame_size)
      16        364.0     22.8      0.0          self.results.wave[:] = wav

Things to note:

get_task_data takes a very small amount of time, compared to process_frame. It does not make sense to concentrate on optimizing get_task_data at all, in this case!

In process_frame, the fft2 call takes up most time, so that is where we should direct our efforts. Improving, for example, the calls to fftshift would give us a max speed-up of a few percent - and only, if we manage to dramatically improve their execution time!

line_profiler doesn’t give information about individual expressions - sometimes you have to put expressions on their own line to see their individual contributions to the execution time. See the fft2 and np.prod calls on the hottest line in the profile!

After successfully improving on the profiled times, always re-run with profiling disabled and without InlineJobExecutor and measure the total time, for example using %%time. This makes sure that your optimizations actually work in a production environment!

The usual benchmarking rules apply - for example, try to run the profiling on an otherwise idle system, otherwise you can get noisy results.

Single-threaded execution can be quite slow compared to using LiberTEM in production - if it is too slow for your taste, you can run your UDF on a subset of your data using a region of interest.

Profiling UDFs

Prerequisite: InlineJobExecutor

Using progress bar

Line profiling using line_profiler

Prerequisite: `InlineJobExecutor`