Example of integration with HyperSpy lazy signals

This notebook shows the several ways in which LiberTEM can integrate with Dask and by extension HyperSpy lazy signals. In particular:

how to apply a LiberTEM UDF to a HyperSpy lazy signal
how to create a HyperSpy lazy signal from a LiberTEM DataSet
how to create a HyperSpy lazy signal from a LiberTEM computation.

This notebook requires the hdbscan extra and uses PyXem for scanning diffraction data support in HyperSpy.

Please note that these features are rather new and not well-tested yet (added in version 0.9.0). Bug reports and improvements are much appreciated!

[1]:

%matplotlib inline

[2]:

import os
import numpy as np
import dask
import hyperspy.api as hs
import pyxem
import matplotlib.pyplot as plt
# NBVAL_IGNORE_OUTPUT

WARNING:silx.opencl.common:Unable to import pyOpenCl. Please install it from: https://pypi.org/project/pyopencl

[3]:

from libertem.api import Context
from libertem.udf.sumsigudf import SumSigUDF
from libertem.contrib.daskadapter import make_dask_array
from libertem_blobfinder.common import correlation, patterns
from libertem_blobfinder.udf import refinement
import libertem_blobfinder.common.fullmatch as fm

As always, we start with a LiberTEM Context object to act as interface to Dask. Here we use the make_with convenience method to create a Context which will integrate with any existing Dask scheduler, such as might have been instantiated by HyperSpy or other upstream computation.

[4]:

ctx = Context.make_with("dask-integration")

Alternatively, we could create and set a LiberTEM Dask scheduler as the default Dask scheduler using ctx = Context.make_with("dask-make-default").

Unfortunately, the current BLO-file reader in HyperSpy is not fully compatible with the default distributed scheduler created by LiberTEM.

[5]:

data_base_path = os.environ.get("TESTDATA_BASE_PATH", "/cachedata/users/clausen/libertem-test-data/")
path = os.path.join(data_base_path, "default.blo")

From HyperSpy to LiberTEM

Load data as HyperSpy lazy signal

First we demonstrate the normal HyperSpy lazy signal workflow, where the .data backing the Signal is in fact a dask.array.Array which will be evaluated only when passed to dask.compute(). In the case of reading data from file, the data can remain on disk until required, rather than being loaded into memory at the beginning.

[6]:

lazy_4d = hs.load(path, lazy=True)
lazy_4d

[6]:

Title:
SignalType:	diffraction
	Array	Chunk
Bytes	215.35 MiB	36.49 MiB
Shape	(121, 90\|144, 144)	(41,45\|144,144)
Count	7 Tasks	6 Chunks
Type	uint8	numpy.ndarray

Navigation Axes	Signal Axes

[7]:

# NBVAL_IGNORE_OUTPUT
lazy_4d.data

[7]:

	Array	Chunk
Bytes	215.35 MiB	36.49 MiB
Shape	(90, 121, 144, 144)	(45, 41, 144, 144)
Dask graph	6 chunks in 2 graph layers
Data type	uint8 numpy.ndarray

[8]:

# NBVAL_IGNORE_OUTPUT
lazy_4d.plot()

[########################################] | 100% Completed | 101.54 ms

Load the HyperSpy lazy signal as a LiberTEM dataset

Starting with version 0.9.0, it is possible to load a dask.array.Array as a LiberTEM Dataset, allowing processing using the existing LiberTEM UDF interface. This can be achieved by loading the dask array using the Context.load method and the "dask" file type.

As before, no data is actually loaded into memory at this stage, it will only be read when computation is requested. When the data is eventually read, the HyperSpy BLO reader will be used by Dask to load the file.

[9]:

dask_ds = ctx.load("dask", lazy_4d.data, sig_dims=2, min_size=64*2**20)

WARNING:libertem.io.dataset.dask:Merging nav dimension chunks according to scheme {1: -1} as we cannot maintain continuity of frame indexing in the flattened navigation dimension. Original n_blocks: [2, 3, 1, 1]. New n_blocks: [2, 1, 1, 1].

Due to differences in the design of Dask and LiberTEM, the Dataset must sometimes modify the array chunking (i.e. how the data is organised in memory), leading to the above warning. This does not modify the data itself in any way, it only changes how the computation will be split over parts of the full dataset.

We start by running some basic analyses on the dataset, loading a single diffraction frame with create_pick_analysis and summing over the signal dimensions with SumSigUDF. For both of these calls the data is loaded from disk at this point.

[10]:

pick_a = ctx.create_pick_analysis(dataset=dask_ds, y=45, x=60)
pick_res = ctx.run(pick_a)

[11]:

sum_res = ctx.run_udf(dataset=dask_ds, udf=SumSigUDF())

[12]:

%%timeit
# NBVAL_SKIP
sum_res = ctx.run_udf(dataset=dask_ds, udf=SumSigUDF())

302 ms ± 9.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[13]:

fig, axes = plt.subplots(1, 2, figsize=(12, 8))
axes[0].imshow(pick_res['intensity'].raw_data)
axes[0].set_title('Diffraction frame')
axes[1].imshow(sum_res['intensity'].data)
axes[1].set_title('Sum over signal dimensions')

[13]:

Text(0.5, 1.0, 'Sum over signal dimensions')

From LiberTEM to HyperSpy: Dataset

Load data using LiberTEM

LiberTEM posesses its own BLO-file reader (here with auto-detection of filetype). First we run the same analyses as above to demonstrate their equivalency.

[14]:

native_ds = ctx.load('auto', path)
native_ds.set_num_cores(2)

[15]:

sum_res_lt = ctx.run_udf(dataset=native_ds, udf=SumSigUDF())

[16]:

%%timeit
# NBVAL_SKIP
ctx.run_udf(dataset=native_ds, udf=SumSigUDF())

238 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[17]:

np.allclose(sum_res_lt['intensity'].data, sum_res['intensity'].data)

[17]:

True

Create dask array from LiberTEM dataset

Using the method make_dask_array, we can convert any LiberTEM dataset into a chunked dask.array, such as those used for backing a HyperSpy lazy signal.

[18]:

with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    dask_array, workers = make_dask_array(native_ds)

The dask_array is supplied with a dictionary of workers, which can be passed to dask.compute() at the moment of evaluation to ensure data locality, and avoid transferring chunks of the dataset between processes or over the network.

Create lazy HyperSpy signal from Dask array

We now create a HyperSpy signal from this dask_array, with the aid of pyxem to maintain its metadata as a diffraction dataset.

[19]:

libertem_4d = pyxem.signals.LazyDiffraction2D(dask_array)

This gives us access to the full range of HyperSpy interfaces on the LiberTEM dataset, such as .plot().

[20]:

# NBVAL_IGNORE_OUTPUT
libertem_4d.plot()

[########################################] | 100% Completed | 605.67 ms

From LiberTEM to HyperSpy: UDF computation result

Starting with version 0.9.0, it is possible to obtain Dask arrays as output from LiberTEM UDF computations. This remains experimental and requires use of a dedicated LiberTEM Context. The computation is carried out relying on Dask.delayed, which allows us to run normal Python code, such as a UDF, within a Dask task graph.

As an example, this uses the LiberTEM blobfinder https://libertem.github.io/LiberTEM-blobfinder/ for diffraction disk finding and pattern matching.

Setup the pattern analysis

To begin, we use the HyperSpy signal we created above and LiberTEM blobfinder to create a pattern analysis.

First a reference frame on which we make initial guesses of the peak positions and lattice vectors:

[21]:

reference_frame = lazy_4d.data[50, 60].compute()

Note we must .compute() the frame because the HyperSpy signal is lazily evaluated.

We create a search pattern based on a disk radius measured in pixels:

[22]:

r = 1.5
pattern = patterns.BackgroundSubtraction(radius=r)

We evaluate the initial peak positions in the reference frame using correlation with our search pattern:

[23]:

peaks = correlation.get_peaks(sum_result=reference_frame, match_pattern=pattern, num_peaks=50)

We interpret the detected disks as a diffraction pattern (with principal vectors etc). using the .full_match method, this requires us to identify the transmitted beam position in pixels (the zero position):

[24]:

zero = (71, 71)
matcher = fm.FullMatcher()
all_match = matcher.full_match(peaks, zero=zero)
match = all_match[0][0]

And finally we display the results on the reference frame:

[25]:

fig, axes = plt.subplots(figsize=(10, 8))
axes.imshow(reference_frame)
plt.title("Points of significance and match in the reference frame")

for p in np.flip(match.refineds, axis=1):
    axes.add_artist(plt.Circle(p, r, color="r", fill=False))
for p in np.flip(match.calculated_refineds, axis=1):
    axes.add_artist(plt.Circle(p, r, color="g", fill=False))
plt.arrow(*np.flip(match.zero), *(np.flip(match.a)), color='g')
plt.arrow(*np.flip(match.zero), *(np.flip(match.b)), color='w')

[25]:

<matplotlib.patches.FancyArrow at 0x7fdb880eed50>