Debugging

There are different parts of LiberTEM which can be debugged with different tools and methods.

Debugging the Web GUI

For debugging the GUI, you can use all standard debugging tools for web development. Most useful in this context are the Chrome DevTools or Firefox Developer Tools, which can be accessed by pressing F12. You can extend these with additional panels for React and for Redux.

These tools can be used for inspecting all frontend-related processes, from network traffic up to rendering behavior. Take note of the /api/events/ websocket connection, where all asynchronous notification and results will be transferred.

Note that you should always debug using the development build of the GUI, using npm start, as described in the contributing section. Otherwise the debugging experience may be painful, for example worse error output from react, minified source and minified component names, …

Debugging the API server

If the API server returns a server error (500), the detailed exception should be logged in the output of libertem-server. You can also try enabling the debug mode of tornado (there is currently no command line flag for this, so you need to change libertem.web.server accordingly.)

If an analysis based on the exception alone is inconclusive, you can try to reproduce the problem using the Python API and follow the instructions below.

Debugging UDFs or other Python code

If you are trying to write a UDF, or debug other Python parts of LiberTEM, you can instruct LiberTEM to use simple single-threaded execution using the InlineJobExecutor.

from libertem import api as lt

ctx = lt.Context.make_with('inline')

ctx.run_udf(dataset=dataset, udf=udf)

You can then use all usual debugging facilities, including pdb and the %pdb magic of ipython/Jupyter.

The libertem.executor.inline.InlineJobExecutor uses a single CPU core by default. It can be switched to GPU processing to test CuPy-enabled UDFs by calling libertem.common.backend.set_use_cuda() with the device ID to use. libertem.common.backend.set_use_cpu(0) switches back to CPU processing.

from libertem import api as lt
from libertem.utils.devices import detect
from libertem.common.backend import set_use_cpu, set_use_cuda

ctx = lt.Context.make_with('inline')

d = detect()
if d['cudas'] and d['has_cupy']:
    set_use_cuda(d['cudas'][0])
ctx.run_udf(dataset=dataset, udf=udf)
set_use_cpu(0)

If a problem is only reproducible using the default executor, you will have to follow the debugging instructions of dask-distributed. As the API server can’t use the synchronous InlineJobExecutor, this is also the case when debugging problems that only occur in context of the API server.

Debugging failing test cases

When a test case fails, there are some options to find the root cause:

The --pdb command line switch of pytest can be used to automatically drop you into a pdb prompt in the failing test case, where you will either land on the failing assert statement, or the place in the code where an exception was raised.

This does not help if the test case only fails in CI. Here, it may be easier to use logging. Because we call pytest with the --log-level=DEBUG parameter, the failing test case output will have a section containing the captured logging output.

You can sprinkle the code with log.debug(…) calls that output the relevant variables. In some cases you may also leave the logging statements in the code even after the problem is fixed, depending on the overhead.

Tracing using opentelemetry

New in version 0.10.0: Tracing support using opentelemetry was added in version 0.10.0

Instead of sprinkling logging or print statements into your code, it is also possible to diagnose issues or gain insight into the runtime behavior of LiberTEM using opentelemetry tracing. This is also based on adding instrumentation to the code, but follows a more structured approach.

Using tracing, instead of relatively unstructured “log lines”, rich and structured information can be logged as traces, which are organized into spans. These traces can then be visualized, inspected, searched, … using different tools and databases, here for example using Jaeger:

_images/jaeger-tracing-visualization.png

This becomes more interesting once your code goes beyond a single thread or process, when it is important to see the temporal relation between different events and functions executing concurrently. Crucially, it is possible to gather traces in distributed systems, from different nodes.

For an overview of opentelemetry, please see the official opentelemetry documentation - here, we will document the practical setup and usage. For the Python API docs, please see the opentelemetry Python API docs.

Getting tracing running

Some external services are needed to gather traces. We include docker-compose configuration for getting these up and running quickly in the tracing directory. Please note that this configuration by default opens some ports, so be careful, as this may circumvent your device’s firewall!

To get these running, start docker-compose up in said directory. This will pull in all required docker images and start the required services, until they are stopped using Ctrl+C.

The Jaeger UI, as shown above, is then available on localhost:16686. An alternative UI, called Zipkin, is available on localhost:9411. Both of these should now be viewable by your browser. The actual trace collection API endpoint is started on port 4317, but is only used under the hood.

In your LiberTEM virtual environment, you need to install the tracing extra, for example via pip install -e .[tracing].

The Python code then needs to be told to enable tracing, and how to connect to the trace collection API endpoint. The easiest way is to set environment variables, for example, in a notebook:

%env OTEL_ENABLE=1
%env OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

from libertem.common.tracing import maybe_setup_tracing
maybe_setup_tracing(service_name="notebook-main")

Or, for intrumenting the libertem-server:

OTEL_ENABLE=1 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 libertem-server

The same works for bringing up libertem-worker processes:

OTEL_ENABLE=1 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 libertem-worker

Be sure to change the endpoint URL to whatever is the correct one from the perspective of the worker process in the distributed case.

Support for setting up tracing on workers is already integrated in the Dask and pipelined executors. The inline executor doesn’t need any additional setup for tracing to work.

For enabling tracing across multiple Python processes in other scenarios, possibly on multiple nodes, set the environment variables for each of these processes, and also call the maybe_setup_tracing() function on each.

On Windows

The easiest way to get the tracing services up and running is using Windows Subsystem for Linux to install Linux and Docker. This allows to run the tracing services as described above. Alternatively, Docker Desktop for Windows could be an option.

Clients running natively on Windows can then connect to these services:

set OTEL_ENABLE=1
set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
libertem-server

Adding your own instrumentation

By default, a minimal set of functions is already annotated with tracing information, to be able to understand how UDFs are executed across multiple processes. Adding tracing instrumentation to your code is similar to setting up logging using the logging module. At the top of your Python module, you create and use a Tracer object like this:

import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def some_function():
    with tracer.start_as_current_span("span-name"):
        time.sleep(0.1)  # do some real work here

some_function()

You can also add some more information to a span, for example events with attributes:

def some_function():
    with tracer.start_as_current_span("span-name") as span:
        for i in range(16):
            time.sleep(0.1)  # do some real work here
            span.add_event(f"work item done", {
                "item_id": i,  # you can add attributes to events
            })

some_function()

Attributes can also be added to spans themselves:

def some_function():
    with tracer.start_as_current_span("span-name") as span:
        time.sleep(0.1)  # do some real work here
        span.set_attributes({
            "attribute-name": "attribute-value-here",
        })

some_function()

Note that, while the tracing is quite lightweight, it is probably a good idea to not add spans and events in the innermost loops of your processing, like UDF.process_frame, but spans for per-partition operations should be fine. In the future, metrics could also be collected to gain further insight into the performance characteristics.

For more details, please also see the opentelemetry Python API docs.