Utility functions#

RosettaSciIO provides certain utility functions that are applicable for multiple formats, e.g. for the HDF5-format on which a number of plugins are based.

HDF5 utility functions#

HDF5 file inspection.

rsciio.utils.hdf5.list_datasets_in_file(filename, dataset_key=None, hardlinks_only=False, verbose=True)#

Read from a NeXus or .hdf file and return a list of the dataset paths.

This method is used to inspect the contents of an hdf5 file. The method iterates through group attributes and returns NXdata or hdf datasets of size >=2 if they’re not already NXdata blocks and returns a list of the entries. This is a convenience method to inspect a file to list datasets present rather than loading all the datasets in the file as signals.

Parameters:

filenamestr, pathlib.Path: Filename of the file to read or corresponding pathlib.Path.
dataset_keystr, list of str, None , default=None: If a str or list of strings is provided only return items whose path contain the strings. For example, dataset_key = [“instrument”, “Fe”] will only return hdf entries with “instrument” or “Fe” somewhere in their hdf path.
hardlinks_onlybool, default=False: If true any links (soft or External) will be ignored when loading.
verbosebool, default=True: Prints the results to screen.

Returns:

list: List of paths to datasets.

Generic utility functions#

rsciio.utils.tools.get_file_handle(data, warn=True)#

Return file handle of a dask array when possible. Currently only hdf5 and tiff file are supported.

Parameters:

datadask.array.Array: The dask array from which the file handle will be retrieved.
warnbool: Whether to warn or not when the file handle can’t be retrieved. Default is True.

Returns:

File handle or None: The file handle of the file when possible.

Distributed utility functions#

rsciio.utils.distributed.get_arbitrary_chunk_slice(positions, shape, chunks='auto', block_size_limit=None, dtype=None)#

Get chunk slices for the rsciio.utils.distributed.slice_memmap() function. From arbitrary positions given by a list of x, y coordinates.

Parameters:

positionsarray_like: A numpy array in the form [[x1, y1], [x2, y2], …] where x, y map the frame to the real space coordinate of the data.
shapetuple: Shape of the signal data.
chunkstuple, optional: Chunk shape. The default is “auto”.
block_size_limitint, optional: Maximum size of a block in bytes. The default is None. This is passed to the dask.array.core.normalize_chunks() function when chunks == “auto”.
dtypenumpy.dtype, optional: Data type. The default is None. This is passed to the dask.array.core.normalize_chunks() function when chunks == “auto”.

Returns:

dask.array.Array: Dask array of the slices.

rsciio.utils.distributed.get_chunk_slice(shape, chunks='auto', block_size_limit=None, dtype=None)#

Get chunk slices for the rsciio.utils.distributed.slice_memmap() function.

Takes a shape and chunks and returns a dask array of the slices to be used with the rsciio.utils.distributed.slice_memmap() function. This is useful for loading data from a memmaped file in a distributed manner.

Parameters:

shapetuple: Shape of the data.
chunkstuple or str, optional: Define the chunk shape. This argument is passed to dask.array.core.normalize_chunks(). The default is “auto”.
block_size_limitint, optional: Maximum size of a block in bytes. The default is None. This is passed to the dask.array.core.normalize_chunks() function when chunks == “auto”.
dtypenumpy.dtype, optional: Data type. The default is None. This is passed to the dask.array.core.normalize_chunks() function when chunks == “auto”.

Returns:

dask.array.Array: Dask array of the slices.
tuple: Tuple of the chunks.

rsciio.utils.distributed.memmap_distributed(filename, dtype, positions=None, offset=0, shape=None, order='C', chunks='auto', block_size_limit=None, key=None)#

Drop in replacement for py:func:numpy.memmap allowing for distributed loading of data.

This always loads the data using dask which can be beneficial in many cases, but may not be ideal in others. The chunks and block_size_limit are for describing an ideal chunk shape and size as defined using the dask.array.core.normalize_chunks() function.

Parameters:

filenamestr: Path to the file.
dtypenumpy.dtype: Data type of the data for memmap function.
positionsarray_like, optional: A numpy array in the form [[x1, y1], [x2, y2], …] where x, y map the frame to the real space coordinate of the data. The default is None.
offsetint, optional: Offset in bytes. The default is 0.
shapetuple, optional: Shape of the data to be read. The default is None.
orderstr, optional: Order of the data. The default is “C” see numpy.memmap for more details.
chunkstuple, optional: Chunk shape. The default is “auto”.
block_size_limitint, optional: Maximum size of a block in bytes. The default is None.
keyNone, str: For structured dtype only. Specify the key of the structured dtype to use.

Returns:

dask.array.Array: Dask array of the data from the memmaped file and with the specified chunks.

Notes

Currently dask.array.map_blocks() does not allow for multiple outputs. As a result, in case of structured dtype, the key of the structured dtype need to be specified. For example: with dtype = ((“data”, int, (128, 128)), (“sec”, “<u4”, 512)), “data” or “sec” will need to be specified.

rsciio.utils.distributed.slice_memmap(slices, file, dtypes, shape, key=None, positions=False, **kwargs)#

Slice a memory mapped file using a tuple of slices.

This is useful for loading data from a memory mapped file in a distributed manner. The function first creates a memory mapped array of the entire dataset and then uses the slices to slice the memory mapped array. The slices can be used to build a dask array as each slice translates to one chunk for the dask array.

Parameters:

slicesarray_like of int: An array of the slices to use. The dimensions of the array should be (n,2) where n is the number of dimensions of the data. The first column is the start of the slice and the second column is the stop of the slice.
filestr: Path to the file.
dtypesnumpy.dtype: Data type of the data for numpy.memmap function.
shapetuple: Shape of the entire dataset. Passed to the numpy.memmap function.
keyNone, str: For structured dtype only. Specify the key of the structured dtype to use.
positionsbool, optional: If True, the slices include indexes for positions which are then used to create a custom scan pattern. The default is False.
**kwargsdict: Additional keyword arguments to pass to the numpy.memmap function.

Returns:

numpy.ndarray: Array of the data from the memory mapped file sliced using the provided slice.

Logging#

rsciio.set_log_level(level)#

Convenience function to set the log level of all rsciio modules.

Note: The log level of all other modules are left untouched.

Parameters:

levelint or str

The log level to set. Any values that logging.Logger.setLevel() accepts are valid. The default options are:

‘CRITICAL’
‘ERROR’
‘WARNING’
‘INFO’
‘DEBUG’
‘NOTSET’

Examples

For normal logging of rsciio functions, you can set the log level like this:

>>> import rsciio
>>> rsciio.set_log_level('INFO')
>>> from rsciio.digitalmicrograph import file_reader
>>> file_reader('my_file.dm3')
INFO:rsciio.digital_micrograph:DM version: 3
INFO:rsciio.digital_micrograph:size 4796607 B
INFO:rsciio.digital_micrograph:Is file Little endian? True
INFO:rsciio.digital_micrograph:Total tags in root group: 15

Test utility functions#

rsciio.tests.registry_utils.download_all(pooch_object=None, ignore_hash=None, show_progressbar=True)#

Download all test data if they are not already locally available in rsciio.tests.data folder.

Parameters:

pooch_objectpooch.Pooch or None, default=None: The registry to be used. If None, a RosettaSciIO registry will be used.
ignore_hashbool or None, default=None: Don’t compare the hash of the downloaded file with the corresponding hash in the registry. On windows, the hash comparison will fail for non-binary file, because of difference in line ending. If None, the comparision will only be used on unix system.
show_progressbarbool, default=True: Whether to show the progressbar or not.

rsciio.tests.registry_utils.make_registry(directory, output, recursive=True, exclude_pattern=None)#

Make a registry of files and hashes for the given directory.

This is helpful if you have many files in your test dataset as it keeps you from needing to manually update the registry.

Parameters:

directorystr: Directory of the test data to put in the registry. All file names in the registry will be relative to this directory.
outputstr: Name of the output registry file.
recursivebool: If True, will recursively look for files in subdirectories of directory.
exclude_patternlist or None: List of pattern to exclude.

Notes

Adapted from fatiando/pooch BSD-3-Clause

rsciio.tests.registry_utils.update_registry()#

Update the rsciio.tests.registry.txt file, which is required after adding or updating test data files.

Unix system only. This is not supported on windows, because the hash comparison will fail for non-binary file, because of difference in line ending.

Utility functions#

HDF5 utility functions#

Generic utility functions#

Distributed utility functions#

Logging#

Test utility functions#

This Page