Profiling UDFs
To bring your UDF to production, it may be necessary to look into the performance and efficiency of your function. There may be some low-hanging fruit to optimize away, but they only really become visible once you profile your code. Just looking at code and guessing which parts are expensive can be misleading and cause you to spend time optimizing the wrong part of your code. Always measure!
Profiling means instrumenting the execution of a program and finding out which parts of your code use how much of a given resource (for example, CPU time, or memory).
Prerequisite: InlineJobExecutor
By default, your UDF will be run in a multi-process or multi-threaded environment. This makes profiling a challenge, since the usual profiling tools will only capture information on the main process that performs the final reduction and not the processes that do most of the work.
So, in order to use the default Python profiling mechanisms, all parts of the UDF
should be executed in the same single thread. LiberTEM provides what it calls
InlineJobExecutor
for this purpose. Improving
the execution time in a single-threaded executor will almost always improve the
multi-process execution time. Nevertheless, you should confirm the impact of
changes in your production environment.
To use the InlineJobExecutor
, pass it to the
Context
on construction:
from libertem.executor.inline import InlineJobExecutor
ctx = Context(executor=InlineJobExecutor())
Then, you can continue as usual, loading data, executing your UDF, etc.
Using progress bar
A progress bar is a graphical tool that shows a far a process has progressed. It can be used to assess the partitioning of data, since the progress is indicated on a per partition basis in the following way:
res = ctx.run_udf(udf=fit_udf, dataset=ds, roi=roi, progress=True)
> 100%|██████████| 18/18 [01:17<00:00, 4.29s/it]
This run is using 18 partitions, and took 4.29s per partition.
Line profiling using line_profiler
Profiling comes in different forms, some will show you how much time each function of a whole program took, others can give you a line-by-line overview for functions you are interested in, like line_profiler. For most UDFs this will be sufficient for performance analysis.
First, you need to install the line_profiler package via pip, having your conda environment or virtualenv activated:
(libertem) $ python -m pip install line_profiler
Then the easiest way to get started is to use the IPython/Jupyter integration of
line_profiler. Put %load_ext line_profiler
somewhere in your notebook,
then you can use the %lprun
magic command to run your UDF while profiling:
%lprun -f YourUDF.get_task_data -f YourUDF.process_frame ctx.run_udf(dataset=dataset, YourUDF())
Note the repeated -f
arguments - you can list any methods of your UDF you are
interested in, or even other Python functions you are directly or indirectly using. If you
had some complicated code in YourUDF.merge
, you would include -f YourUDF.merge
in the %lprun
call.
This is how the output could look like:
Timer unit: 1e-06 s
Total time: 0.002283 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: get_task_data at line 145
Line # Hits Time Per Hit % Time Line Contents
==============================================================
145 def get_task_data(self):
146 """
147 Updates `task_data`
148
149 Returns
150 -------
151 kwargs : dict
152 A dictionary with the following keys:
153 kwargs['aperture'] : array-like
154 Side band filter aperture (mask)
155 kwargs['slice'] : slice
156 Slice for slicing FFT of the hologram
157 """
158
159 2 48.0 24.0 2.1 out_shape = self.params.out_shape
160 2 51.0 25.5 2.2 sy, sx = self.meta.partition_shape.sig
161 2 5.0 2.5 0.2 oy, ox = out_shape
162 2 7.0 3.5 0.3 f_sampling = (1. / oy, 1. / ox)
163 2 292.0 146.0 12.8 sb_size = self.params.sb_size * np.mean(f_sampling)
164 2 261.0 130.5 11.4 sb_smoothness = sb_size * self.params.sb_smoothness * np.mean(f_sampling)
165
166 2 1172.0 586.0 51.3 f_freq = freq_array(out_shape)
167 2 263.0 131.5 11.5 aperture = aperture_function(f_freq, sb_size, sb_smoothness)
168
169 2 64.0 32.0 2.8 y_min = int(sy / 2 - oy / 2)
170 2 37.0 18.5 1.6 y_max = int(sy / 2 + oy / 2)
171 2 32.0 16.0 1.4 x_min = int(sx / 2 - ox / 2)
172 2 30.0 15.0 1.3 x_max = int(sx / 2 + oy / 2)
173 2 8.0 4.0 0.4 slice_fft = (slice(y_min, y_max), slice(x_min, x_max))
174
175 kwargs = {
176 2 4.0 2.0 0.2 'aperture': aperture,
177 2 6.0 3.0 0.3 'slice': slice_fft
178 }
179 2 3.0 1.5 0.1 return kwargs
Total time: 63.748 s
File: /home/clausen/source/libertem/src/libertem/udf/holography.py
Function: process_frame at line 181
Line # Hits Time Per Hit % Time Line Contents
==============================================================
181 def process_frame(self, frame):
182 """
183 Reconstructs holograms outputting results into 'wave'
184
185 Parameters
186 ----------
187 frame
188 single frame (hologram) of the data
189 """
190 16 154.0 9.6 0.0 if not self.params.precision:
191 frame = frame.astype(np.float32)
192 # size_x, size_y = self.params.out_shape
193 16 81.0 5.1 0.0 frame_size = self.meta.partition_shape.sig
194 16 58.0 3.6 0.0 sb_pos = self.params.sb_position
195 16 66.0 4.1 0.0 aperture = self.task_data.aperture
196 16 52.0 3.2 0.0 slice_fft = self.task_data.slice
197
198 16 59291808.0 3705738.0 93.0 fft_frame = fft2(frame) / np.prod(frame_size)
199 16 2189960.0 136872.5 3.4 fft_frame = np.roll(fft_frame, sb_pos, axis=(0, 1))
200
201 16 2258700.0 141168.8 3.5 fft_frame = fftshift(fftshift(fft_frame)[slice_fft])
202
203 16 816.0 51.0 0.0 fft_frame = fft_frame * aperture
204
205 16 5957.0 372.3 0.0 wav = ifft2(fft_frame) * np.prod(frame_size)
206 16 364.0 22.8 0.0 self.results.wave[:] = wav
Things to note:
get_task_data
takes a very small amount of time, compared toprocess_frame
. It does not make sense to concentrate on optimizingget_task_data
at all, in this case!In
process_frame
, thefft2
call takes up most time, so that is where we should direct our efforts. Improving, for example, the calls tofftshift
would give us a max speed-up of a few percent - and only, if we manage to dramatically improve their execution time!line_profiler doesn’t give information about individual expressions - sometimes you have to put expressions on their own line to see their individual contributions to the execution time. See the
fft2
andnp.prod
calls on the hottest line in the profile!After successfully improving on the profiled times, always re-run with profiling disabled and without
InlineJobExecutor
and measure the total time, for example using%%time
. This makes sure that your optimizations actually work in a production environment!The usual benchmarking rules apply - for example, try to run the profiling on an otherwise idle system, otherwise you can get noisy results.
Single-threaded execution can be quite slow compared to using LiberTEM in production - if it is too slow for your taste, you can run your UDF on a subset of your data using a region of interest.
See also
- Python Data Science Handbook
The Python Data Science Handbook has a section on profiling and timing, including line_profiler.
- Official documentation for line_profiler
All information on how to use line_profiler, including using it from different contexts.
- Profiling long-running tests
Information on how to profile the execution time of test cases.
- Debugging UDFs or other Python code
Using the
InlineJobExecutor
to debug problems in your UDF.