Lazy loading#

Data can be loaded lazily by using lazy=True, however not all formats are supported. The data will be loaded as a dask array instead of a numpy array.

>>> from rsciio import msa
>>> d = msa.file_reader("file.mrc", lazy=True)
>>> d["data"]
dask.array<array, shape=(10, 20, 30), dtype=int64, chunksize=(10, 20, 30), chunktype=numpy.ndarray>

Memory mapping#

Binary file formats are loaded lazily using memory mapping. The common implementation consists in passing the numpy.memmap to the dask.array.from_array(). However, it has some shortcomings, to name a few: it is not compatible with the dask distributed scheduler and it has limited control on the memory usage.

For supported file formats, a different implementation can be used to load data lazily in a manner that is compatible with the dask distributed scheduler and allow for better control of the memory usage. This implementation uses an approach similar to that described in the dask documentation on memory mapping and is enabled using the distributed parameter (not all formats are supported):

>>> s = hs.load("file.mrc", lazy=True, distributed=True)

Chunks#

Depending on the intended processing after loading the data, it may be necessary to define the chunking manually to control the memory usage or compute distribution. The chunking can also be specified as follow using the chunks parameter:

>>> s = hs.load("file.mrc", lazy=True, distributed=True, chunks=(5, 10, 10))

Note

Some file reader support specifying the chunks parameter with the distributed parameter being set to True or False. In both cases the reader will return a dask array with specifyed chunks, However, the way the dask array is created differs significantly and if there are issues with memory usage or slow loading, it is recommend to try the distributed implementation.