rsciio.utils#

RosettaSciIO provides certain utility functions that are applicable for multiple formats, e.g. for the HDF5-format on which a number of plugins are based.

rsciio.utils.file

Utility functions for file handling.

rsciio.utils.hdf5

Utility functions for HDF5 file inspection.

rsciio.utils.path

Utility functions for path handling.

rsciio.utils.rgb

Utility functions for RGB array handling.

rsciio.utils.xml

Utility functions for XML handling.

File#

rsciio.utils.file.get_file_handle(data[, warn])

Return file handle of a dask array when possible.

rsciio.utils.file.inspect_npy_bytes(file_)

Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype.

rsciio.utils.file.memmap_distributed(...[, ...])

Drop in replacement for numpy.memmap allowing for distributed loading of data.

Utility functions for file handling.

rsciio.utils.file.get_file_handle(data, warn=True)#

Return file handle of a dask array when possible. Currently only hdf5 and tiff file are supported.

Parameters:
datadask.array.Array

The dask array from which the file handle will be retrieved.

warnbool

Whether to warn or not when the file handle can’t be retrieved. Default is True.

Returns:
File handle or None

The file handle of the file when possible.

rsciio.utils.file.inspect_npy_bytes(file_: IO[bytes]) tuple[int, tuple[int, ...], str]#

Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype.

Warning

After calling this function, the file_ stream position advanced to the start of the data.

Parameters:
file_File handle

The .npy byte stream to inspect.

Returns:
tuple

A tuple containing:

  • offset (int): The byte offset where the data starts.

  • shape (tuple): The shape of the array stored in the file.

  • dtype (str): The data type of the array elements.

Examples

>>> with open('example.npy', 'rb') as f:
...     offset, shape, dtype = inspect_npy_bytes(f)
rsciio.utils.file.memmap_distributed(filename, dtype, positions=None, offset=0, shape=None, order='C', chunks='auto', block_size_limit=None, key=None)#

Drop in replacement for numpy.memmap allowing for distributed loading of data.

This always loads the data using dask which can be beneficial in many cases, but may not be ideal in others. The chunks and block_size_limit are for describing an ideal chunk shape and size as defined using the dask.array.core.normalize_chunks() function.

Parameters:
filenamestr

Path to the file.

dtypenumpy.dtype

Data type of the data for memmap function.

positionsarray_like, optional

A numpy array in the form [[x1, y1], [x2, y2], …] where x, y map the frame to the real space coordinate of the data. The default is None.

offsetint, optional

Offset in bytes. The default is 0.

shapetuple, optional

Shape of the data to be read. The default is None.

orderstr, optional

Order of the data. The default is “C” see numpy.memmap for more details.

chunkstuple, optional

Chunk shape. The default is “auto”.

block_size_limitint, optional

Maximum size of a block in bytes. The default is None.

keyNone, str

For structured dtype only. Specify the key of the structured dtype to use.

Returns:
dask.array.Array

Dask array of the data from the memmaped file and with the specified chunks.

Notes

Currently dask.array.map_blocks() does not allow for multiple outputs. As a result, in case of structured dtype, the key of the structured dtype need to be specified. For example: with dtype = ((“data”, int, (128, 128)), (“sec”, “<u4”, 512)), “data” or “sec” will need to be specified.

HDF5#

rsciio.utils.hdf5.list_datasets_in_file(filename)

Read from a NeXus or .hdf file and return a list of the dataset paths.

rsciio.utils.hdf5.read_metadata_from_file(...)

Read the metadata from a NeXus or .hdf file.

Utility functions for HDF5 file inspection.

rsciio.utils.hdf5.list_datasets_in_file(filename, dataset_key=None, hardlinks_only=False, verbose=True)#

Read from a NeXus or .hdf file and return a list of the dataset paths.

This method is used to inspect the contents of an hdf5 file. The method iterates through group attributes and returns NXdata or hdf datasets of size >=2 if they’re not already NXdata blocks and returns a list of the entries. This is a convenience method to inspect a file to list datasets present rather than loading all the datasets in the file as signals.

Parameters:
filenamestr, pathlib.Path

Filename of the file to read or corresponding pathlib.Path.

dataset_keystr, list of str, None , default=None

If a str or list of strings is provided only return items whose path contain the strings. For example, dataset_key = [“instrument”, “Fe”] will only return hdf entries with “instrument” or “Fe” somewhere in their hdf path.

hardlinks_onlybool, default=False

If true any links (soft or External) will be ignored when loading.

verbosebool, default=True

Prints the results to screen.

Returns:
list

List of paths to datasets.

See also

rsciio.utils.hdf5.read_metadata_from_file

Convenience function to read metadata present in a file.

rsciio.utils.hdf5.read_metadata_from_file(filename, lazy=False, metadata_key=None, verbose=False, skip_array_metadata=False)#

Read the metadata from a NeXus or .hdf file.

This method iterates through the hdf5 file and returns a dictionary of the entries. This is a convenience method to inspect a file for a value rather than loading the file as a signal.

Parameters:
filenamestr, pathlib.Path

Filename of the file to read or corresponding pathlib.Path.

lazybool, default=False

Whether to open the file lazily or not. The file will stay open until closed in compute() or closed manually. get_file_handle() can be used to access the file handler and close it manually.

metadata_keyNone, str, list of str, default=None

None will return all datasets found including linked data. Providing a string or list of strings will only return items which contain the string(s). For example, search_keys = [“instrument”,”Fe”] will return hdf entries with “instrument” or “Fe” in their hdf path.

verbosebool, default=False

Pretty print the results to screen.

skip_array_metadatabool, default=False

Whether to skip loading array metadata. This is useful as a lot of large array may be present in the metadata and it is redundant with dataset itself.

Returns:
dict

Metadata dictionary.

See also

rsciio.utils.hdf5.list_datasets_in_file

Convenience function to list datasets present in a file.

Path#

rsciio.utils.path.append2pathname(filename, ...)

Append a string to a path name.

rsciio.utils.path.ensure_directory(path)

Check if the path exists and if it does not, creates the directory.

rsciio.utils.path.incremental_filename(filename)

If a file with the same file name exists, returns a new filename that does not exist.

rsciio.utils.path.overwrite(filename)

If file 'filename' exists, ask for overwriting and return True or False, else return True.

Utility functions for path handling.

rsciio.utils.path.append2pathname(filename, to_append)#

Append a string to a path name.

Parameters:
filenamestr

The original file name.

to_appendstr

The string to append to the file name.

Returns:
pathlib.Path

The new file name with the appended string.

rsciio.utils.path.ensure_directory(path)#

Check if the path exists and if it does not, creates the directory.

Parameters:
pathstr or pathlib.Path

The path to check and create if it does not exist.

rsciio.utils.path.incremental_filename(filename, i=1)#

If a file with the same file name exists, returns a new filename that does not exist.

The new file name is created by appending -n (where n is an integer) to path name

Parameters:
filenamestr

The original file name.

iint

The number to be appended.

Returns:
pathlib.Path

The new file name with the appended number.

rsciio.utils.path.overwrite(filename)#

If file ‘filename’ exists, ask for overwriting and return True or False, else return True.

Parameters:
filenamestr or pathlib.Path

File to check for overwriting.

Returns:
bool

Whether to overwrite file.

RGB#

rsciio.utils.rgb.is_rgb(array)

Check if the array is a RGB structured numpy array.

rsciio.utils.rgb.is_rgba(array)

Check if the array is a RGBA structured numpy array.

rsciio.utils.rgb.is_rgbx(array)

Check if the array is a RGB or RGBA structured numpy array.

rsciio.utils.rgb.regular_array2rgbx(data)

Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array.

rsciio.utils.rgb.rgbx2regular_array(data[, ...])

Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel.

rsciio.utils.rgb.RGB_DTYPES

Mapping of RGB color space names to their corresponding numpy structured dtypes.

Utility functions for RGB array handling.

rsciio.utils.rgb.RGB_DTYPES#

Mapping of RGB color space names to their corresponding numpy structured dtypes.

rsciio.utils.rgb.is_rgb(array)#

Check if the array is a RGB structured numpy array.

Parameters:
arraynumpy.ndarray

The array to check.

Returns:
bool

True if the array is RGB, False otherwise.

rsciio.utils.rgb.is_rgba(array)#

Check if the array is a RGBA structured numpy array.

Parameters:
arraynumpy.ndarray

The array to check.

Returns:
bool

True if the array is RGBA, False otherwise.

rsciio.utils.rgb.is_rgbx(array)#

Check if the array is a RGB or RGBA structured numpy array.

Parameters:
arraynumpy.ndarray

The array to check.

Returns:
bool

True if the array is RGB or RGBA, False otherwise.

rsciio.utils.rgb.regular_array2rgbx(data)#

Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array.

Parameters:
datanumpy.ndarray or dask.array.Array

The regular array to be transformed.

Returns:
numpy.ndarray or dask.array.Array

The transformed RGBx structured array.

rsciio.utils.rgb.rgbx2regular_array(data, plot_friendly=False, show_progressbar=True)#

Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel.

Parameters:
datanumpy.ndarray or dask.array.Array

The RGB array to be transformed.

plot_friendlybool

If True, change the dtype to float when dtype is not uint8 and normalize the array so that it is ready to be plotted by matplotlib.

show_progressbarbool, default=True

Whether to show the progressbar or not.

Returns:
numpy.ndarray or dask.array.Array

The transformed array with additional dimension for the color channel.

XML#

rsciio.utils.xml.XmlToDict([...])

Customisable XML to python dict and list based Hierarchical tree translator.

rsciio.utils.xml.convert_xml_to_dict(xml_object)

Convert XML object to a DTBox object.

rsciio.utils.xml.sanitize_msxml_float(...)

Replace comma with dot in floating point numbers in given xml raw string.

rsciio.utils.xml.xml2dtb(et, dictree)

Convert XML ElementTree node to DTBox object.

Utility functions for XML handling.

class rsciio.utils.xml.XmlToDict(dub_attr_pre_str='@', dub_text_str='#value', tags_to_flatten=None, interchild_text_parsing='first')#

Customisable XML to python dict and list based Hierarchical tree translator.

Parameters:
dub_attr_pre_strstr

String to be prepended to attribute name when creating dictionary tree if children element with same name is used. Default is “@”.

dub_text_strstr (default: “#text”)

String to use as key in case element contains text and children tag. Default “#text”.

tags_to_flattenNone, str or list of str

Define tag names which should be flattened/skipped, placing children of such tag one level shallower in constructed python structure. It is useful when OEM generated XML are not human designed, but machine/programming language/framework generated and painfully verbose. See example below. Default is None, which means no tags are flattened.

interchild_text_parsingstr

Must be one of (“skip”, “first”, “cat”, “list”). This considers the behaviour when both .text and children tags are presented under same element tree node:

  • “skip” - will not try to retrieve any .text values from such node.

  • “first” - only string under .text attribute will be returned.

  • “cat” - return concatenated string from .text of node and .tail’s of children nodes.

  • “list” - similar to “cat”, but return the result in list without concatenation.

Default is “first”, which is the most common case.

Examples

Consider such redundant tree structure:

DetectorHeader
|-ClassInstances
    |-ClassInstance
    |-Type
    |-Window
    ...

It can be sanitized/simplified by setting tags_to_flatten keyword with [“ClassInstances”, “ClassInstance”] to eliminate redundant levels of tree with such tag names:

DetectorHeader
|-Type
|-Window
...

Produced dict/list structures are then good enought to be returned as part of original metadata without making any more copies.

Setup the parser:

>>> from rsciio.utils.xml import XmlToDict
>>> xml_to_dict = XmlToDict(
...     pre_str_dub_attr="XmlClass",
...     tags_to_flatten=[
...         "ClassInstance", "ChildrenClassInstance", "JustAnotherRedundantTag"
...     ]
... )

Use parser:

>>> pytree = xml_to_dict.dictionarize(etree_node)
dictionarize(et_node)#

Take etree XML node and return its conversion into pythonic dict/list representation of that XML tree with some sanitization.

Parameters:
et_nodexml.etree.ElementTree.Element

XML node to be converted.

Returns:
dict

Dictionary representation of the XML node.

static eval(string)#

Interpret any string and return casted to appropriate dtype python object.

Parameters:
stringstr

String to be interpreted.

Returns:
str

Interpreted string.

Notes

If this does not return desired type, consider subclassing and reimplementing this method like this:

class SubclassedXmlToDict(XmlToDict):
    @staticmethod
    def eval(string):
        if condition check to catch the case
        ...
        elif
        ...
        else:
            return XmlToDict.eval(string)
rsciio.utils.xml.convert_xml_to_dict(xml_object)#

Convert XML object to a DTBox object.

Parameters:
xml_objectstr or xml.etree.ElementTree.Element

XML object to be converted. It can be a string or an ElementTree node.

Returns:
DTBox

A DTBox object containing the converted XML data.

rsciio.utils.xml.sanitize_msxml_float(xml_b_string)#

Replace comma with dot in floating point numbers in given xml raw string.

Parameters:
xml_b_stringstr

Raw binary string representing the xml to be parsed.

Returns:
str

Binary string with commas used as decimal marks replaced with dots to adhere to XML standard.

Notes

What, why, how? In case OEM software runs on MS Windows and directly uses system built-in MSXML lib, which does not comply with XML standards, and OS is set to locale of some country with weird and illogical preferences of using comma as decimal separation; Software in conjunction of these above listed conditions can produce not-interoperable XML, which leads to wrong interpretation of context. This sanitizer searches and corrects that - it should be used before being fed to .fromstring of element tree.

rsciio.utils.xml.xml2dtb(et, dictree)#

Convert XML ElementTree node to DTBox object. This is a recursive function that traverses the XML tree and populates the DTBox object with the data from the XML node.

Parameters:
etxml.etree.ElementTree.Element

XML node to be converted.

dictreeDTBox

Box object to be populated.