Profiling UDFs

To bring your UDF to production, it may be necessary to look into the performance and efficiency of your function. There may be some low-hanging fruit to optimize away, but they only really become visible once you profile your code. Just looking at code and guessing which parts are expensive can be misleading and cause you to spend time optimizing the wrong part of your code. Always measure!

Profiling means instrumenting the execution of a program and finding out which parts of your code use how much of a given resource (for example, CPU time, or memory).

Prerequisite: InlineJobExecutor

By default, your UDF will be run in a multi-process or multi-threaded environment. This makes profiling a challenge, since the usual profiling tools will only capture information on the main process that performs the final reduction and not the processes that do most of the work.

So, in order to use the default Python profiling mechanisms, all parts of the UDF should be executed in the same single thread. LiberTEM provides what it calls InlineJobExecutor for this purpose. Improving the execution time in a single-threaded executor will almost always improve the multi-process execution time. Nevertheless, you should confirm the impact of changes in your production environment.

To use the InlineJobExecutor, pass it to the Context on construction:

from libertem.executor.inline import InlineJobExecutor
ctx = Context(executor=InlineJobExecutor())

Then, you can continue as usual, loading data, executing your UDF, etc.

Using progress bar

A progress bar is a graphical tool that shows a far a process has progressed. It can be used to assess the partitioning of data, since the progress is indicated on a per partition basis in the following way:

res = ctx.run_udf(udf=fit_udf, dataset=ds, roi=roi, progress=True)
> 100%|██████████| 18/18 [01:17<00:00,  4.29s/it]

This run is using 18 partitions, and took 4.29s per partition.

Line profiling using line_profiler

Profiling comes in different forms, some will show you how much time each function of a whole program took, others can give you a line-by-line overview for functions you are interested in, like line_profiler. For most UDFs this will be sufficient for performance analysis.

First, you need to install the line_profiler package via pip, having your conda environment or virtualenv activated:

(libertem) $ python -m pip install line_profiler

Then the easiest way to get started is to use the IPython/Jupyter integration of line_profiler. Put %load_ext line_profiler somewhere in your notebook, then you can use the %lprun magic command to run your UDF while profiling:

%lprun -f YourUDF.get_task_data -f YourUDF.process_frame ctx.run_udf(dataset=dataset, YourUDF())

Note the repeated -f arguments - you can list any methods of your UDF you are interested in, or even other Python functions you are directly or indirectly using. If you had some complicated code in YourUDF.merge, you would include -f YourUDF.merge in the %lprun call.

This is how the output could look like:

Timer unit: 1e-06 s

Total time: 0.002283 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: get_task_data at line 145

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   145                                               def get_task_data(self):
   146                                                   """
   147                                                   Updates `task_data`
   148
   149                                                   Returns
   150                                                   -------
   151                                                   kwargs : dict
   152                                                   A dictionary with the following keys:
   153                                                       kwargs['aperture'] : array-like
   154                                                       Side band filter aperture (mask)
   155                                                       kwargs['slice'] : slice
   156                                                       Slice for slicing FFT of the hologram
   157                                                   """
   158
   159         2         48.0     24.0      2.1          out_shape = self.params.out_shape
   160         2         51.0     25.5      2.2          sy, sx = self.meta.partition_shape.sig
   161         2          5.0      2.5      0.2          oy, ox = out_shape
   162         2          7.0      3.5      0.3          f_sampling = (1. / oy, 1. / ox)
   163         2        292.0    146.0     12.8          sb_size = self.params.sb_size * np.mean(f_sampling)
   164         2        261.0    130.5     11.4          sb_smoothness = sb_size * self.params.sb_smoothness * np.mean(f_sampling)
   165
   166         2       1172.0    586.0     51.3          f_freq = freq_array(out_shape)
   167         2        263.0    131.5     11.5          aperture = aperture_function(f_freq, sb_size, sb_smoothness)
   168
   169         2         64.0     32.0      2.8          y_min = int(sy / 2 - oy / 2)
   170         2         37.0     18.5      1.6          y_max = int(sy / 2 + oy / 2)
   171         2         32.0     16.0      1.4          x_min = int(sx / 2 - ox / 2)
   172         2         30.0     15.0      1.3          x_max = int(sx / 2 + oy / 2)
   173         2          8.0      4.0      0.4          slice_fft = (slice(y_min, y_max), slice(x_min, x_max))
   174
   175                                                   kwargs = {
   176         2          4.0      2.0      0.2              'aperture': aperture,
   177         2          6.0      3.0      0.3              'slice': slice_fft
   178                                                   }
   179         2          3.0      1.5      0.1          return kwargs

Total time: 63.748 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: process_frame at line 181

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   181                                               def process_frame(self, frame):
   182                                                   """
   183                                                   Reconstructs holograms outputting results into 'wave'
   184
   185                                                   Parameters
   186                                                   ----------
   187                                                   frame
   188                                                      single frame (hologram) of the data
   189                                                   """
   190        16        154.0      9.6      0.0          if not self.params.precision:
   191                                                       frame = frame.astype(np.float32)
   192                                                   # size_x, size_y = self.params.out_shape
   193        16         81.0      5.1      0.0          frame_size = self.meta.partition_shape.sig
   194        16         58.0      3.6      0.0          sb_pos = self.params.sb_position
   195        16         66.0      4.1      0.0          aperture = self.task_data.aperture
   196        16         52.0      3.2      0.0          slice_fft = self.task_data.slice
   197
   198        16   59291808.0 3705738.0     93.0          fft_frame = fft2(frame) / np.prod(frame_size)
   199        16    2189960.0 136872.5      3.4          fft_frame = np.roll(fft_frame, sb_pos, axis=(0, 1))
   200
   201        16    2258700.0 141168.8      3.5          fft_frame = fftshift(fftshift(fft_frame)[slice_fft])
   202
   203        16        816.0     51.0      0.0          fft_frame = fft_frame * aperture
   204
   205        16       5957.0    372.3      0.0          wav = ifft2(fft_frame) * np.prod(frame_size)
   206        16        364.0     22.8      0.0          self.results.wave[:] = wav

Things to note:

  • get_task_data takes a very small amount of time, compared to process_frame. It does not make sense to concentrate on optimizing get_task_data at all, in this case!

  • In process_frame, the fft2 call takes up most time, so that is where we should direct our efforts. Improving, for example, the calls to fftshift would give us a max speed-up of a few percent - and only, if we manage to dramatically improve their execution time!

  • line_profiler doesn’t give information about individual expressions - sometimes you have to put expressions on their own line to see their individual contributions to the execution time. See the fft2 and np.prod calls on the hottest line in the profile!

  • After successfully improving on the profiled times, always re-run with profiling disabled and without InlineJobExecutor and measure the total time, for example using %%time. This makes sure that your optimizations actually work in a production environment!

  • The usual benchmarking rules apply - for example, try to run the profiling on an otherwise idle system, otherwise you can get noisy results.

  • Single-threaded execution can be quite slow compared to using LiberTEM in production - if it is too slow for your taste, you can run your UDF on a subset of your data using a region of interest.

See also

Python Data Science Handbook

The Python Data Science Handbook has a section on profiling and timing, including line_profiler.

Official documentation for line_profiler

All information on how to use line_profiler, including using it from different contexts.

Profiling long-running tests

Information on how to profile the execution time of test cases.

Debugging UDFs or other Python code

Using the InlineJobExecutor to debug problems in your UDF.