Utility functions#
RosettaSciIO provides certain utility functions that are applicable for multiple formats, e.g. for the HDF5-format on which a number of plugins are based.
HDF5 utility functions#
HDF5 file inspection.
- rsciio.utils.hdf5.list_datasets_in_file(filename, dataset_key=None, hardlinks_only=False, verbose=True)#
Read from a NeXus or
.hdf
file and return a list of the dataset paths.This method is used to inspect the contents of an hdf5 file. The method iterates through group attributes and returns NXdata or hdf datasets of size >=2 if they’re not already NXdata blocks and returns a list of the entries. This is a convenience method to inspect a file to list datasets present rather than loading all the datasets in the file as signals.
- Parameters:
- filename
str
,pathlib.Path
Filename of the file to read or corresponding pathlib.Path.
- dataset_key
str
,list
ofstr
,None
, default=None If a str or list of strings is provided only return items whose path contain the strings. For example, dataset_key = [“instrument”, “Fe”] will only return hdf entries with “instrument” or “Fe” somewhere in their hdf path.
- hardlinks_onlybool, default=False
If true any links (soft or External) will be ignored when loading.
- verbosebool, default=True
Prints the results to screen.
- filename
- Returns:
list
List of paths to datasets.
See also
rsciio.utils.hdf5.read_metadata_from_file
Convenience function to read metadata present in a file.
- rsciio.utils.hdf5.read_metadata_from_file(filename, lazy=False, metadata_key=None, verbose=False, skip_array_metadata=False)#
Read the metadata from a NeXus or
.hdf
file.This method iterates through the hdf5 file and returns a dictionary of the entries. This is a convenience method to inspect a file for a value rather than loading the file as a signal.
- Parameters:
- filename
str
,pathlib.Path
Filename of the file to read or corresponding pathlib.Path.
- lazybool, default=False
Whether to open the file lazily or not. The file will stay open until closed in
compute()
or closed manually.get_file_handle()
can be used to access the file handler and close it manually.- metadata_key
None
,str
,list
ofstr
, default=None None will return all datasets found including linked data. Providing a string or list of strings will only return items which contain the string(s). For example, search_keys = [“instrument”,”Fe”] will return hdf entries with “instrument” or “Fe” in their hdf path.
- verbosebool, default=False
Pretty print the results to screen.
- skip_array_metadatabool, default=False
Whether to skip loading array metadata. This is useful as a lot of large array may be present in the metadata and it is redundant with dataset itself.
- filename
- Returns:
dict
Metadata dictionary.
See also
rsciio.utils.hdf5.list_datasets_in_file
Convenience function to list datasets present in a file.
Generic utility functions#
- rsciio.utils.tools.get_file_handle(data, warn=True)#
Return file handle of a dask array when possible. Currently only hdf5 and tiff file are supported.
- Parameters:
- data
dask.array.Array
The dask array from which the file handle will be retrieved.
- warnbool
Whether to warn or not when the file handle can’t be retrieved. Default is True.
- data
- Returns:
- File handle or
None
The file handle of the file when possible.
- File handle or
Distributed utility functions#
- rsciio.utils.distributed.get_chunk_slice(shape, chunks='auto', block_size_limit=None, dtype=None)#
Get chunk slices for the
rsciio.utils.distributed.slice_memmap()
function.Takes a shape and chunks and returns a dask array of the slices to be used with the
rsciio.utils.distributed.slice_memmap()
function. This is useful for loading data from a memmaped file in a distributed manner.- Parameters:
- shape
tuple
Shape of the data.
- chunks
tuple
, optional Chunk shape. The default is “auto”.
- block_size_limit
int
, optional Maximum size of a block in bytes. The default is None. This is passed to the
dask.array.core.normalize_chunks()
function when chunks == “auto”.- dtype
numpy.dtype
, optional Data type. The default is None. This is passed to the
dask.array.core.normalize_chunks()
function when chunks == “auto”.
- shape
- Returns:
dask.array.Array
Dask array of the slices.
tuple
Tuple of the chunks.
- rsciio.utils.distributed.memmap_distributed(filename, dtype, offset=0, shape=None, order='C', chunks='auto', block_size_limit=None, key=None)#
Drop in replacement for py:func:numpy.memmap allowing for distributed loading of data.
This always loads the data using dask which can be beneficial in many cases, but may not be ideal in others. The
chunks
andblock_size_limit
are for describing an ideal chunk shape and size as defined using thedask.array.core.normalize_chunks()
function.- Parameters:
- filename
str
Path to the file.
- dtype
numpy.dtype
Data type of the data for memmap function.
- offset
int
, optional Offset in bytes. The default is 0.
- shape
tuple
, optional Shape of the data to be read. The default is None.
- order
str
, optional Order of the data. The default is “C” see
numpy.memmap
for more details.- chunks
tuple
, optional Chunk shape. The default is “auto”.
- block_size_limit
int
, optional Maximum size of a block in bytes. The default is None.
- key
None
,str
For structured dtype only. Specify the key of the structured dtype to use.
- filename
- Returns:
dask.array.Array
Dask array of the data from the memmaped file and with the specified chunks.
Notes
Currently
dask.array.map_blocks()
does not allow for multiple outputs. As a result, in case of structured dtype, the key of the structured dtype need to be specified. For example: with dtype = ((“data”, int, (128, 128)), (“sec”, “<u4”, 512)), “data” or “sec” will need to be specified.
- rsciio.utils.distributed.slice_memmap(slices, file, dtypes, shape, key=None, **kwargs)#
Slice a memory mapped file using a tuple of slices.
This is useful for loading data from a memory mapped file in a distributed manner. The function first creates a memory mapped array of the entire dataset and then uses the
slices
to slice the memory mapped array. The slices can be used to build adask
array as each slice translates to one chunk for thedask
array.- Parameters:
- slicesarray_like of
int
An array of the slices to use. The dimensions of the array should be (n,2) where n is the number of dimensions of the data. The first column is the start of the slice and the second column is the stop of the slice.
- file
str
Path to the file.
- dtypes
numpy.dtype
Data type of the data for
numpy.memmap
function.- shape
tuple
Shape of the entire dataset. Passed to the
numpy.memmap
function.- key
None
,str
For structured dtype only. Specify the key of the structured dtype to use.
- **kwargs
dict
Additional keyword arguments to pass to the
numpy.memmap
function.
- slicesarray_like of
- Returns:
numpy.ndarray
Array of the data from the memory mapped file sliced using the provided slice.
Logging#
- rsciio.set_log_level(level)#
Convenience function to set the log level of all rsciio modules.
Note: The log level of all other modules are left untouched.
- Parameters:
Examples
For normal logging of rsciio functions, you can set the log level like this:
>>> import rsciio >>> rsciio.set_log_level('INFO') >>> from rsciio.digitalmicrograph import file_reader >>> file_reader('my_file.dm3') INFO:rsciio.digital_micrograph:DM version: 3 INFO:rsciio.digital_micrograph:size 4796607 B INFO:rsciio.digital_micrograph:Is file Little endian? True INFO:rsciio.digital_micrograph:Total tags in root group: 15
Test utility functions#
- rsciio.tests.registry_utils.download_all(pooch_object=None, ignore_hash=None, show_progressbar=True)#
Download all test data if they are not already locally available in
rsciio.tests.data
folder.- Parameters:
- pooch_object
pooch.Pooch
orNone
, default=None The registry to be used. If None, a RosettaSciIO registry will be used.
- ignore_hashbool or
None
, default=None Don’t compare the hash of the downloaded file with the corresponding hash in the registry. On windows, the hash comparison will fail for non-binary file, because of difference in line ending. If None, the comparision will only be used on unix system.
- show_progressbarbool, default=True
Whether to show the progressbar or not.
- pooch_object
- rsciio.tests.registry_utils.make_registry(directory, output, recursive=True, exclude_pattern=None)#
Make a registry of files and hashes for the given directory.
This is helpful if you have many files in your test dataset as it keeps you from needing to manually update the registry.
- Parameters:
- directory
str
Directory of the test data to put in the registry. All file names in the registry will be relative to this directory.
- output
str
Name of the output registry file.
- recursivebool
If True, will recursively look for files in subdirectories of directory.
- exclude_pattern
list
orNone
List of pattern to exclude.
- directory
Notes
Adapted from fatiando/pooch BSD-3-Clause
- rsciio.tests.registry_utils.update_registry()#
Update the
rsciio.tests.registry.txt
file, which is required after adding or updating test data files.Unix system only. This is not supported on windows, because the hash comparison will fail for non-binary file, because of difference in line ending.