Lazy loading#
Data can be loaded lazily by using lazy=True
, however not all formats are supported.
The data will be loaded as a dask array instead of a numpy array.
>>> from rsciio import msa
>>> d = msa.file_reader("file.mrc", lazy=True)
>>> d["data"]
dask.array<array, shape=(10, 20, 30), dtype=int64, chunksize=(10, 20, 30), chunktype=numpy.ndarray>
Memory mapping#
Binary file formats are loaded lazily using memory mapping.
The common implementation consists in passing the numpy.memmap
to the dask.array.from_array()
.
However, it has some shortcomings, to name a few: it is not compatible with the dask distributed
scheduler and it has limited control on the memory usage.
For supported file formats, a different implementation can be used to load data lazily in a manner that is
compatible with the dask distributed scheduler and allow for better control of the memory usage.
This implementation uses an approach similar to that described in the dask documentation on
memory mapping and is enabled using the distributed
parameter (not all formats are
supported):
>>> s = hs.load("file.mrc", lazy=True, distributed=True)
Chunks#
Depending on the intended processing after loading the data, it may be necessary to
define the chunking manually to control the memory usage or compute distribution.
The chunking can also be specified as follow using the chunks
parameter:
>>> s = hs.load("file.mrc", lazy=True, distributed=True, chunks=(5, 10, 10))
Note
Some file reader support specifying the chunks
parameter with the distributed
parameter
being set to True
or False
. In both cases the reader will return a dask array with
specifyed chunks, However, the way the dask array is created differs significantly and if
there are issues with memory usage or slow loading, it is recommend to try the distributed
implementation.