`rsciio.utils`#

RosettaSciIO provides certain utility functions that are applicable for multiple formats, e.g. for the HDF5-format on which a number of plugins are based.

`rsciio.utils.file`	Utility functions for file handling.
`rsciio.utils.hdf5`	Utility functions for HDF5 file inspection.
`rsciio.utils.path`	Utility functions for path handling.
`rsciio.utils.rgb`	Utility functions for RGB array handling.
`rsciio.utils.xml`	Utility functions for XML handling.

File#

`rsciio.utils.file.get_file_handle`(data[, warn])	Return file handle of a dask array when possible.
`rsciio.utils.file.inspect_npy_bytes`(file_)	Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype.
`rsciio.utils.file.memmap_distributed`(...[, ...])	Drop in replacement for `numpy.memmap` allowing for distributed loading of data.

Utility functions for file handling.

rsciio.utils.file.get_file_handle(data, warn=True)#

Return file handle of a dask array when possible. Currently only hdf5 and tiff file are supported.

Parameters:

datadask.array.Array: The dask array from which the file handle will be retrieved.
warnbool: Whether to warn or not when the file handle can’t be retrieved. Default is True.

Returns:

File handle or None: The file handle of the file when possible.

rsciio.utils.file.inspect_npy_bytes(file_: IO[bytes]) → tuple[int, tuple[int, ...], str]#

Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype.

Warning

After calling this function, the file_ stream position advanced to the start of the data.

Parameters:

file_File handle: The .npy byte stream to inspect.

Returns:

tuple

A tuple containing:

offset (int): The byte offset where the data starts.
shape (tuple): The shape of the array stored in the file.
dtype (str): The data type of the array elements.

Examples

>>> with open('example.npy', 'rb') as f:
...     offset, shape, dtype = inspect_npy_bytes(f)

rsciio.utils.file.memmap_distributed(filename, dtype, positions=None, offset=0, shape=None, order='C', chunks='auto', block_size_limit=None, key=None)#

Drop in replacement for numpy.memmap allowing for distributed loading of data.

This always loads the data using dask which can be beneficial in many cases, but may not be ideal in others. The chunks and block_size_limit are for describing an ideal chunk shape and size as defined using the dask.array.core.normalize_chunks() function.

Parameters:

filenamestr: Path to the file.
dtypenumpy.dtype: Data type of the data for memmap function.
positionsarray_like, optional: A numpy array in the form [[x1, y1], [x2, y2], …] where x, y map the frame to the real space coordinate of the data. The default is None.
offsetint, optional: Offset in bytes. The default is 0.
shapetuple, optional: Shape of the data to be read. The default is None.
orderstr, optional: Order of the data. The default is “C” see numpy.memmap for more details.
chunkstuple, optional: Chunk shape. The default is “auto”.
block_size_limitint, optional: Maximum size of a block in bytes. The default is None.
keyNone, str: For structured dtype only. Specify the key of the structured dtype to use.

Returns:

dask.array.Array: Dask array of the data from the memmaped file and with the specified chunks.

Notes

Currently dask.array.map_blocks() does not allow for multiple outputs. As a result, in case of structured dtype, the key of the structured dtype need to be specified. For example: with dtype = ((“data”, int, (128, 128)), (“sec”, “<u4”, 512)), “data” or “sec” will need to be specified.

HDF5#

`rsciio.utils.hdf5.list_datasets_in_file`(filename)	Read from a NeXus or `.hdf` file and return a list of the dataset paths.
`rsciio.utils.hdf5.read_metadata_from_file`(...)	Read the metadata from a NeXus or `.hdf` file.

Utility functions for HDF5 file inspection.

rsciio.utils.hdf5.list_datasets_in_file(filename, dataset_key=None, hardlinks_only=False, verbose=True)#

Read from a NeXus or .hdf file and return a list of the dataset paths.

This method is used to inspect the contents of an hdf5 file. The method iterates through group attributes and returns NXdata or hdf datasets of size >=2 if they’re not already NXdata blocks and returns a list of the entries. This is a convenience method to inspect a file to list datasets present rather than loading all the datasets in the file as signals.

Parameters:

filenamestr, pathlib.Path: Filename of the file to read or corresponding pathlib.Path.
dataset_keystr, list of str, None , default=None: If a str or list of strings is provided only return items whose path contain the strings. For example, dataset_key = [“instrument”, “Fe”] will only return hdf entries with “instrument” or “Fe” somewhere in their hdf path.
hardlinks_onlybool, default=False: If true any links (soft or External) will be ignored when loading.
verbosebool, default=True: Prints the results to screen.

Returns:

list: List of paths to datasets.

Path#

`rsciio.utils.path.append2pathname`(filename, ...)	Append a string to a path name.
`rsciio.utils.path.ensure_directory`(path)	Check if the path exists and if it does not, creates the directory.
`rsciio.utils.path.incremental_filename`(filename)	If a file with the same file name exists, returns a new filename that does not exist.
`rsciio.utils.path.overwrite`(filename)	If file 'filename' exists, ask for overwriting and return True or False, else return True.

Utility functions for path handling.

rsciio.utils.path.append2pathname(filename, to_append)#

Append a string to a path name.

Parameters:

filenamestr: The original file name.
to_appendstr: The string to append to the file name.

Returns:

pathlib.Path: The new file name with the appended string.

rsciio.utils.path.ensure_directory(path)#

Check if the path exists and if it does not, creates the directory.

Parameters:

pathstr or pathlib.Path: The path to check and create if it does not exist.

rsciio.utils.path.incremental_filename(filename, i=1)#

If a file with the same file name exists, returns a new filename that does not exist.

The new file name is created by appending -n (where n is an integer) to path name

Parameters:

filenamestr: The original file name.
iint: The number to be appended.

Returns:

pathlib.Path: The new file name with the appended number.

rsciio.utils.path.overwrite(filename)#

If file ‘filename’ exists, ask for overwriting and return True or False, else return True.

Parameters:

filenamestr or pathlib.Path: File to check for overwriting.

Returns:

bool: Whether to overwrite file.

RGB#

`rsciio.utils.rgb.is_rgb`(array)	Check if the array is a RGB structured numpy array.
`rsciio.utils.rgb.is_rgba`(array)	Check if the array is a RGBA structured numpy array.
`rsciio.utils.rgb.is_rgbx`(array)	Check if the array is a RGB or RGBA structured numpy array.
`rsciio.utils.rgb.regular_array2rgbx`(data)	Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array.
`rsciio.utils.rgb.rgbx2regular_array`(data[, ...])	Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel.
`rsciio.utils.rgb.RGB_DTYPES`	Mapping of RGB color space names to their corresponding numpy structured dtypes.

Utility functions for RGB array handling.

rsciio.utils.rgb.RGB_DTYPES#

Mapping of RGB color space names to their corresponding numpy structured dtypes.

rsciio.utils.rgb.is_rgb(array)#

Check if the array is a RGB structured numpy array.

Parameters:

arraynumpy.ndarray: The array to check.

Returns:

bool: True if the array is RGB, False otherwise.

rsciio.utils.rgb.is_rgba(array)#

Check if the array is a RGBA structured numpy array.

Parameters:

arraynumpy.ndarray: The array to check.

Returns:

bool: True if the array is RGBA, False otherwise.

rsciio.utils.rgb.is_rgbx(array)#

Check if the array is a RGB or RGBA structured numpy array.

Parameters:

arraynumpy.ndarray: The array to check.

Returns:

bool: True if the array is RGB or RGBA, False otherwise.

rsciio.utils.rgb.regular_array2rgbx(data)#

Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array.

Parameters:

datanumpy.ndarray or dask.array.Array: The regular array to be transformed.

Returns:

numpy.ndarray or dask.array.Array: The transformed RGBx structured array.

rsciio.utils.rgb.rgbx2regular_array(data, plot_friendly=False, show_progressbar=True)#

Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel.

Parameters:

datanumpy.ndarray or dask.array.Array: The RGB array to be transformed.
plot_friendlybool: If True, change the dtype to float when dtype is not uint8 and normalize the array so that it is ready to be plotted by matplotlib.
show_progressbarbool, default=True: Whether to show the progressbar or not.

Returns:

numpy.ndarray or dask.array.Array: The transformed array with additional dimension for the color channel.

XML#

`rsciio.utils.xml.XmlToDict`([...])	Customisable XML to python dict and list based Hierarchical tree translator.
`rsciio.utils.xml.convert_xml_to_dict`(xml_object)	Convert XML object to a DTBox object.
`rsciio.utils.xml.sanitize_msxml_float`(...)	Replace comma with dot in floating point numbers in given xml raw string.
`rsciio.utils.xml.xml2dtb`(et, dictree)	Convert XML ElementTree node to DTBox object.

Utility functions for XML handling.

class rsciio.utils.xml.XmlToDict(dub_attr_pre_str='@', dub_text_str='#value', tags_to_flatten=None, interchild_text_parsing='first')#

Customisable XML to python dict and list based Hierarchical tree translator.

Parameters:

dub_attr_pre_strstr

String to be prepended to attribute name when creating dictionary tree if children element with same name is used. Default is “@”.

dub_text_strstr (default: “#text”)

String to use as key in case element contains text and children tag. Default “#text”.

tags_to_flattenNone, str or list of str

Define tag names which should be flattened/skipped, placing children of such tag one level shallower in constructed python structure. It is useful when OEM generated XML are not human designed, but machine/programming language/framework generated and painfully verbose. See example below. Default is None, which means no tags are flattened.

interchild_text_parsingstr

Must be one of (“skip”, “first”, “cat”, “list”). This considers the behaviour when both .text and children tags are presented under same element tree node:

“skip” - will not try to retrieve any .text values from such node.
“first” - only string under .text attribute will be returned.
“cat” - return concatenated string from .text of node and .tail’s of children nodes.
“list” - similar to “cat”, but return the result in list without concatenation.

Default is “first”, which is the most common case.

Examples

Consider such redundant tree structure:

DetectorHeader
|-ClassInstances
    |-ClassInstance
    |-Type
    |-Window
    ...

It can be sanitized/simplified by setting tags_to_flatten keyword with [“ClassInstances”, “ClassInstance”] to eliminate redundant levels of tree with such tag names:

DetectorHeader
|-Type
|-Window
...

Produced dict/list structures are then good enought to be returned as part of original metadata without making any more copies.

Setup the parser:

>>> from rsciio.utils.xml import XmlToDict
>>> xml_to_dict = XmlToDict(
...     pre_str_dub_attr="XmlClass",
...     tags_to_flatten=[
...         "ClassInstance", "ChildrenClassInstance", "JustAnotherRedundantTag"
...     ]
... )

Use parser:

>>> pytree = xml_to_dict.dictionarize(etree_node)

dictionarize(et_node)#

Take etree XML node and return its conversion into pythonic dict/list representation of that XML tree with some sanitization.

Parameters:

et_nodexml.etree.ElementTree.Element: XML node to be converted.

Returns:

dict: Dictionary representation of the XML node.

static eval(string)#

Interpret any string and return casted to appropriate dtype python object.

Parameters:

stringstr: String to be interpreted.

Returns:

str: Interpreted string.

Notes

If this does not return desired type, consider subclassing and reimplementing this method like this:

class SubclassedXmlToDict(XmlToDict):
    @staticmethod
    def eval(string):
        if condition check to catch the case
        ...
        elif
        ...
        else:
            return XmlToDict.eval(string)

rsciio.utils.xml.convert_xml_to_dict(xml_object)#

Convert XML object to a DTBox object.

Parameters:

xml_objectstr or xml.etree.ElementTree.Element: XML object to be converted. It can be a string or an ElementTree node.

Returns:

DTBox: A DTBox object containing the converted XML data.

rsciio.utils.xml.sanitize_msxml_float(xml_b_string)#

Replace comma with dot in floating point numbers in given xml raw string.

Parameters:

xml_b_stringstr: Raw binary string representing the xml to be parsed.

Returns:

str: Binary string with commas used as decimal marks replaced with dots to adhere to XML standard.

Notes

What, why, how? In case OEM software runs on MS Windows and directly uses system built-in MSXML lib, which does not comply with XML standards, and OS is set to locale of some country with weird and illogical preferences of using comma as decimal separation; Software in conjunction of these above listed conditions can produce not-interoperable XML, which leads to wrong interpretation of context. This sanitizer searches and corrects that - it should be used before being fed to .fromstring of element tree.

rsciio.utils.xml.xml2dtb(et, dictree)#

Convert XML ElementTree node to DTBox object. This is a recursive function that traverses the XML tree and populates the DTBox object with the data from the XML node.

Parameters:

etxml.etree.ElementTree.Element: XML node to be converted.
dictreeDTBox: Box object to be populated.

rsciio.utils#

File#

HDF5#

Path#

RGB#

XML#

This Page

`rsciio.utils`#