rsciio.utils#
RosettaSciIO provides certain utility functions that are applicable for multiple formats, e.g. for the HDF5-format on which a number of plugins are based.
Utility functions for file handling. |
|
Utility functions for HDF5 file inspection. |
|
Utility functions for path handling. |
|
Utility functions for RGB array handling. |
|
Utility functions for XML handling. |
File#
|
Return file handle of a dask array when possible. |
Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype. |
|
|
Drop in replacement for |
Utility functions for file handling.
- rsciio.utils.file.get_file_handle(data, warn=True)#
Return file handle of a dask array when possible. Currently only hdf5 and tiff file are supported.
- Parameters:
- data
dask.array.Array The dask array from which the file handle will be retrieved.
- warnbool
Whether to warn or not when the file handle can’t be retrieved. Default is True.
- data
- Returns:
- File handle or
None The file handle of the file when possible.
- File handle or
- rsciio.utils.file.inspect_npy_bytes(file_: IO[bytes]) tuple[int, tuple[int, ...], str]#
Inspect a .npy byte stream to extract metadata such as data offset, shape, and dtype.
Warning
After calling this function, the
file_stream position advanced to the start of the data.- Parameters:
- file_File handle
The .npy byte stream to inspect.
- Returns:
tupleA tuple containing:
offset(int): The byte offset where the data starts.shape(tuple): The shape of the array stored in the file.dtype(str): The data type of the array elements.
Examples
>>> with open('example.npy', 'rb') as f: ... offset, shape, dtype = inspect_npy_bytes(f)
- rsciio.utils.file.memmap_distributed(filename, dtype, positions=None, offset=0, shape=None, order='C', chunks='auto', block_size_limit=None, key=None)#
Drop in replacement for
numpy.memmapallowing for distributed loading of data.This always loads the data using dask which can be beneficial in many cases, but may not be ideal in others. The
chunksandblock_size_limitare for describing an ideal chunk shape and size as defined using thedask.array.core.normalize_chunks()function.- Parameters:
- filename
str Path to the file.
- dtype
numpy.dtype Data type of the data for memmap function.
- positionsarray_like, optional
A numpy array in the form [[x1, y1], [x2, y2], …] where x, y map the frame to the real space coordinate of the data. The default is None.
- offset
int, optional Offset in bytes. The default is 0.
- shape
tuple, optional Shape of the data to be read. The default is None.
- order
str, optional Order of the data. The default is “C” see
numpy.memmapfor more details.- chunks
tuple, optional Chunk shape. The default is “auto”.
- block_size_limit
int, optional Maximum size of a block in bytes. The default is None.
- key
None,str For structured dtype only. Specify the key of the structured dtype to use.
- filename
- Returns:
dask.array.ArrayDask array of the data from the memmaped file and with the specified chunks.
Notes
Currently
dask.array.map_blocks()does not allow for multiple outputs. As a result, in case of structured dtype, the key of the structured dtype need to be specified. For example: with dtype = ((“data”, int, (128, 128)), (“sec”, “<u4”, 512)), “data” or “sec” will need to be specified.
HDF5#
|
Read from a NeXus or |
Read the metadata from a NeXus or |
Utility functions for HDF5 file inspection.
- rsciio.utils.hdf5.list_datasets_in_file(filename, dataset_key=None, hardlinks_only=False, verbose=True)#
Read from a NeXus or
.hdffile and return a list of the dataset paths.This method is used to inspect the contents of an hdf5 file. The method iterates through group attributes and returns NXdata or hdf datasets of size >=2 if they’re not already NXdata blocks and returns a list of the entries. This is a convenience method to inspect a file to list datasets present rather than loading all the datasets in the file as signals.
- Parameters:
- filename
str,pathlib.Path Filename of the file to read or corresponding pathlib.Path.
- dataset_key
str,listofstr,None, default=None If a str or list of strings is provided only return items whose path contain the strings. For example, dataset_key = [“instrument”, “Fe”] will only return hdf entries with “instrument” or “Fe” somewhere in their hdf path.
- hardlinks_onlybool, default=False
If true any links (soft or External) will be ignored when loading.
- verbosebool, default=True
Prints the results to screen.
- filename
- Returns:
listList of paths to datasets.
See also
rsciio.utils.hdf5.read_metadata_from_fileConvenience function to read metadata present in a file.
- rsciio.utils.hdf5.read_metadata_from_file(filename, lazy=False, metadata_key=None, verbose=False, skip_array_metadata=False)#
Read the metadata from a NeXus or
.hdffile.This method iterates through the hdf5 file and returns a dictionary of the entries. This is a convenience method to inspect a file for a value rather than loading the file as a signal.
- Parameters:
- filename
str,pathlib.Path Filename of the file to read or corresponding pathlib.Path.
- lazybool, default=False
Whether to open the file lazily or not. The file will stay open until closed in
compute()or closed manually.get_file_handle()can be used to access the file handler and close it manually.- metadata_key
None,str,listofstr, default=None None will return all datasets found including linked data. Providing a string or list of strings will only return items which contain the string(s). For example, search_keys = [“instrument”,”Fe”] will return hdf entries with “instrument” or “Fe” in their hdf path.
- verbosebool, default=False
Pretty print the results to screen.
- skip_array_metadatabool, default=False
Whether to skip loading array metadata. This is useful as a lot of large array may be present in the metadata and it is redundant with dataset itself.
- filename
- Returns:
dictMetadata dictionary.
See also
rsciio.utils.hdf5.list_datasets_in_fileConvenience function to list datasets present in a file.
Path#
|
Append a string to a path name. |
Check if the path exists and if it does not, creates the directory. |
|
|
If a file with the same file name exists, returns a new filename that does not exist. |
|
If file 'filename' exists, ask for overwriting and return True or False, else return True. |
Utility functions for path handling.
- rsciio.utils.path.append2pathname(filename, to_append)#
Append a string to a path name.
- Parameters:
- Returns:
pathlib.PathThe new file name with the appended string.
- rsciio.utils.path.ensure_directory(path)#
Check if the path exists and if it does not, creates the directory.
- Parameters:
- path
strorpathlib.Path The path to check and create if it does not exist.
- path
- rsciio.utils.path.incremental_filename(filename, i=1)#
If a file with the same file name exists, returns a new filename that does not exist.
The new file name is created by appending -n (where n is an integer) to path name
- Parameters:
- Returns:
pathlib.PathThe new file name with the appended number.
- rsciio.utils.path.overwrite(filename)#
If file ‘filename’ exists, ask for overwriting and return True or False, else return True.
- Parameters:
- filename
strorpathlib.Path File to check for overwriting.
- filename
- Returns:
- bool
Whether to overwrite file.
RGB#
|
Check if the array is a RGB structured numpy array. |
|
Check if the array is a RGBA structured numpy array. |
|
Check if the array is a RGB or RGBA structured numpy array. |
Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array. |
|
|
Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel. |
Mapping of RGB color space names to their corresponding numpy structured dtypes. |
Utility functions for RGB array handling.
- rsciio.utils.rgb.RGB_DTYPES#
Mapping of RGB color space names to their corresponding numpy structured dtypes.
- rsciio.utils.rgb.is_rgb(array)#
Check if the array is a RGB structured numpy array.
- Parameters:
- array
numpy.ndarray The array to check.
- array
- Returns:
- bool
True if the array is RGB, False otherwise.
- rsciio.utils.rgb.is_rgba(array)#
Check if the array is a RGBA structured numpy array.
- Parameters:
- array
numpy.ndarray The array to check.
- array
- Returns:
- bool
True if the array is RGBA, False otherwise.
- rsciio.utils.rgb.is_rgbx(array)#
Check if the array is a RGB or RGBA structured numpy array.
- Parameters:
- array
numpy.ndarray The array to check.
- array
- Returns:
- bool
True if the array is RGB or RGBA, False otherwise.
- rsciio.utils.rgb.regular_array2rgbx(data)#
Transform a regular numpy array with an additional dimension for the color channel into a RGBx structured numpy array.
- Parameters:
- data
numpy.ndarrayordask.array.Array The regular array to be transformed.
- data
- Returns:
numpy.ndarrayordask.array.ArrayThe transformed RGBx structured array.
- rsciio.utils.rgb.rgbx2regular_array(data, plot_friendly=False, show_progressbar=True)#
Transform a RGBx structured numpy array into a standard one with an additional dimension for the color channel.
- Parameters:
- data
numpy.ndarrayordask.array.Array The RGB array to be transformed.
- plot_friendlybool
If True, change the dtype to float when dtype is not uint8 and normalize the array so that it is ready to be plotted by matplotlib.
- show_progressbarbool, default=True
Whether to show the progressbar or not.
- data
- Returns:
numpy.ndarrayordask.array.ArrayThe transformed array with additional dimension for the color channel.
XML#
|
Customisable XML to python dict and list based Hierarchical tree translator. |
|
Convert XML object to a DTBox object. |
Replace comma with dot in floating point numbers in given xml raw string. |
|
|
Convert XML ElementTree node to DTBox object. |
Utility functions for XML handling.
- class rsciio.utils.xml.XmlToDict(dub_attr_pre_str='@', dub_text_str='#value', tags_to_flatten=None, interchild_text_parsing='first')#
Customisable XML to python dict and list based Hierarchical tree translator.
- Parameters:
- dub_attr_pre_str
str String to be prepended to attribute name when creating dictionary tree if children element with same name is used. Default is “@”.
- dub_text_str
str(default: “#text”) String to use as key in case element contains text and children tag. Default “#text”.
- tags_to_flatten
None,strorlistofstr Define tag names which should be flattened/skipped, placing children of such tag one level shallower in constructed python structure. It is useful when OEM generated XML are not human designed, but machine/programming language/framework generated and painfully verbose. See example below. Default is None, which means no tags are flattened.
- interchild_text_parsing
str Must be one of (“skip”, “first”, “cat”, “list”). This considers the behaviour when both .text and children tags are presented under same element tree node:
“skip” - will not try to retrieve any .text values from such node.
“first” - only string under .text attribute will be returned.
“cat” - return concatenated string from .text of node and .tail’s of children nodes.
“list” - similar to “cat”, but return the result in list without concatenation.
Default is “first”, which is the most common case.
- dub_attr_pre_str
Examples
Consider such redundant tree structure:
DetectorHeader |-ClassInstances |-ClassInstance |-Type |-Window ...
It can be sanitized/simplified by setting tags_to_flatten keyword with [“ClassInstances”, “ClassInstance”] to eliminate redundant levels of tree with such tag names:
DetectorHeader |-Type |-Window ...
Produced dict/list structures are then good enought to be returned as part of original metadata without making any more copies.
Setup the parser:
>>> from rsciio.utils.xml import XmlToDict >>> xml_to_dict = XmlToDict( ... pre_str_dub_attr="XmlClass", ... tags_to_flatten=[ ... "ClassInstance", "ChildrenClassInstance", "JustAnotherRedundantTag" ... ] ... )
Use parser:
>>> pytree = xml_to_dict.dictionarize(etree_node)
- dictionarize(et_node)#
Take etree XML node and return its conversion into pythonic dict/list representation of that XML tree with some sanitization.
- Parameters:
- et_node
xml.etree.ElementTree.Element XML node to be converted.
- et_node
- Returns:
dictDictionary representation of the XML node.
- static eval(string)#
Interpret any string and return casted to appropriate dtype python object.
Notes
If this does not return desired type, consider subclassing and reimplementing this method like this:
class SubclassedXmlToDict(XmlToDict): @staticmethod def eval(string): if condition check to catch the case ... elif ... else: return XmlToDict.eval(string)
- rsciio.utils.xml.convert_xml_to_dict(xml_object)#
Convert XML object to a DTBox object.
- Parameters:
- xml_object
strorxml.etree.ElementTree.Element XML object to be converted. It can be a string or an ElementTree node.
- xml_object
- Returns:
- DTBox
A DTBox object containing the converted XML data.
- rsciio.utils.xml.sanitize_msxml_float(xml_b_string)#
Replace comma with dot in floating point numbers in given xml raw string.
- Parameters:
- xml_b_string
str Raw binary string representing the xml to be parsed.
- xml_b_string
- Returns:
strBinary string with commas used as decimal marks replaced with dots to adhere to XML standard.
Notes
What, why, how? In case OEM software runs on MS Windows and directly uses system built-in MSXML lib, which does not comply with XML standards, and OS is set to locale of some country with weird and illogical preferences of using comma as decimal separation; Software in conjunction of these above listed conditions can produce not-interoperable XML, which leads to wrong interpretation of context. This sanitizer searches and corrects that - it should be used before being fed to .fromstring of element tree.
- rsciio.utils.xml.xml2dtb(et, dictree)#
Convert XML ElementTree node to DTBox object. This is a recursive function that traverses the XML tree and populates the DTBox object with the data from the XML node.
- Parameters:
- et
xml.etree.ElementTree.Element XML node to be converted.
- dictreeDTBox
Box object to be populated.
- et