NeXus data format#

Note

To read this format, the optional dependency h5py is required.

Background#

NeXus is a common data format originally developed by the neutron and x-ray science x-ray communities. It is still being developed as an international standard by scientists and programmers representing major scientific facilities in order to facilitate greater cooperation in the analysis and visualization of data. NeXus uses a variety of classes to record data, values, units and other experimental metadata associated with an experiment. For specific types of experiments an Application Definition may exist, which defines an agreed common layout that facilities can adhere to.

NeXus metadata and data are stored in Hierarchical Data Format Files (HDF5) with a .nxs extension although standard HDF5 extensions are sometimes used.

Note

In HyperSpy, files must use the .nxs file extension in order to default to the NeXus loader. If your file has an HDF5 extension, you can also explicitly set the NeXus file reader:

# Load a NeXus file with a .h5 extension
>>> import hyperspy.api as hs
>>> s = hs.load("filename.h5", reader="nxs")

The loader will follow version 3 of the NeXus data rules. The signal type, Signal1D or Signal2D, will be inferred by the interpretation attribute, if this is set to spectrum or image, in the NXdata description. If the interpretation attribute is not set, the loader will return a BaseSignal, which must then be converted to the appropriate signal type. Following the NeXus data rules, if a default dataset is not defined, the loader will load NXdata and HDF datasets according to the keyword options in the reader. A number of the NeXus examples from large facilties do not use NXdata or use older versions of the NeXus implementation. Data can still be loaded from these files but information or associations may be missing. However, this missing information can be recovered from within the original_metadata which contains the overall structure of the entry.

As the NeXus format uses the HDF5 format and needs to read both data and metadata structured in different ways, the loader is written to be quite flexible and can also be used to inspect any hdf5 based file.

Differences with respect to HSpy#

The HyperSpy metadata structure stores arrays as hdf datasets without attributes and stores floats, ints and strings as attributes. The NeXus format uses hdf dataset attributes to store additional information such as an indication of the units for an axis or the NX_class which the dataset structure follows. The metadata, HyperSpy or original_metadata, therefore needs to be able to indicate the values and attributes of a dataset. To implement this structure the value and attrs of a dataset can also be defined. The value of a dataset is set using a value key. The attributes of a dataset are defined by an attrs key.

For example, to store an array called axis_x, with a units attribute within original_metadata, the following structure would be used:

├──original_metadata
│   ├── axis_x
│   │   ├── value : array([1.0,2.0,3.0,4.0,5.0])
│   │   ├── attrs
│   │   │   ├── units : mm
>>> original_metadata.set_item(axis_x.value,[1.0,2.0,3.0,4.0,5.0])
>>> original_metadata.set_item(axis_x.attrs.units,"mm")

To access the axis information:

>>> original_metadata.axis_x.value
>>> original_metadata.axis_x.attrs.units

To modify the axis information:

>>> original_metadata.axis_x.value = [2.0,3.0,4.0,5.0,6.0]
>>> original_metadata.axis_x.attrs.units = "um"

To store data in a NeXus monochromator format, value and attrs keys can define additional attributes:

├── monochromator
│   ├── energy
│   │   ├── value : 12.0
│   │   ├── attrs
│   │   │   ├── units : keV
│   │   │   ├── NXclass : NXmonochromator

The attrs key can also be used to define NeXus structures for the definition of structures and relationships between data:

├── mydata
│   ├── attrs
│   │   ├── NX_class : "NXdata"
│   │   ├── axes : ["x","."]
│   ├── data
│   │   ├──value : [[30,23...110]
│   ├── x
│   │   ├──value : [1,2.....100]
│   │   ├── attrs
│   │   │   ├── unit : "mm"

The use of attrs or value to set values within the metadata is optional and metadata values can also be set, read or modified in the normal way.

>>> original_metadata.monochromator.energy = 12.5

HyperSpy metadata is stored within the NeXus file and should be automatically restored when a signal is loaded from a previously saved NeXus file.

Note

Altering the standard metadata structure of a signal using attrs or value keywords is not recommended.

Also see the HDF5 utility functions for inspecting HDF5 files.

API functions#

rsciio.nexus.file_reader(filename, lazy=False, dataset_key=None, dataset_path=None, metadata_key=None, skip_array_metadata=False, nxdata_only=False, hardlinks_only=False, use_default=False, mapping=None)#

Read NXdata class or hdf datasets from a file and return signal(s).

Parameters:
filenamestr, pathlib.Path

Filename of the file to read or corresponding pathlib.Path.

lazybool, default=False

Whether to open the file lazily or not.

dataset_keyNone, str, list of str, default=None

If None all datasets are returned. If a string or list of strings is provided only items whose path contain the string(s) are returned. For example dataset_key = [“instrument”, “Fe”] will return data entries with instrument or Fe in their hdf path.

dataset_pathNone, str, list of str, default=None

If None, no absolute path is searched. If a string or list of strings is provided items with the absolute paths specified will be returned. For example, dataset_path = [‘/data/spectrum/Mn’], it returns the exact dataset with this path. It is not filtered by dataset_key, i.e. with dataset_key = [‘Fe’], it still returns the specific dataset at ‘/data/spectrum/Mn’. It is empty if no dataset matching the absolute path provided is present.

metadata_keyNone, str, list of str, default=None

Only return items from the original metadata whose path contain the strings .e.g metadata_key = [“instrument”, “Fe”] will return all metadata entries with “instrument” or “Fe” in their hdf path.

skip_array_metadatabool, default=False

Whether to skip loading metadata with an array entry. This is useful as metadata may contain large array that is redundant with the data.

nxdata_onlybool, default=False

If True only NXdata will be converted into a signal if False NXdata and any hdf datasets will be loaded as signals.

hardlinks_onlybool, default=False

If True any links (soft or External) will be ignored when loading.

use_defaultbool, default=False

If True and a default NXdata is defined in the file load this as a signal. This will ignore the other keyword options. If True and no default is defined the file will be loaded according to the keyword options.

mappingNone or dict

Define the mapping from the original metadata to the returned metadata.

Returns:
list of dict

List of dictionaries containing the following fields:

  • ‘data’ – multidimensional numpy.ndarray or dask.array.Array

  • ‘axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector

  • ‘metadata’ – dictionary containing the parsed metadata

  • ‘original_metadata’ – dictionary containing the full metadata tree from the input file

When the file contains several datasets, each dataset will be loaded as separate dictionary.

See also

rsciio.utils.hdf5.list_datasets_in_file

Convenience function to list datasets present in a file.

rsciio.utils.hdf5.read_metadata_from_file

Convenience function to read metadata present in a file.

Notes

Loading all datasets can result in a large number of signals Please review your datasets and use the dataset_key to target the datasets of interest. “keys” is a special keywords and prepended with “fix” in the metadata structure to avoid any issues.

Datasets are all arrays with size>2 (arrays, lists)

rsciio.nexus.file_writer(filename, signal, save_original_metadata=True, skip_metadata_key=None, use_default=False, compression='gzip')#

Write the signal and metadata as a NeXus file.

This will save the signal in NXdata format in the file. As the form of the metadata can vary and is not validated it will be stored as an NXcollection (an unvalidated collection)

Parameters:
filenamestr, pathlib.Path

Filename of the file to write to or corresponding pathlib.Path.

signaldict

Dictionary containing the signal object. Should contain the following fields:

  • ‘data’ – multidimensional numpy array

  • ‘axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector

  • ‘metadata’ – dictionary containing the metadata tree

save_original_metadatabool , defaultTrue

Option to save hyperspy.original_metadata with the signal. A loaded NeXus file may have a large amount of data when loaded which you may wish to omit on saving.

skip_metadata_keystr or list of str, defaultNone

The key(s) to skip when saving original metadata. This is useful when some metadata keys should be ignored.

use_defaultbool , defaultFalse

Option to define the default dataset in the file. If set to True the signal or first signal in the list of signals will be defined as the default (following NeXus v3 data rules).

compressionNone, ‘gzip’, ‘szip’, ‘lzf’, default=’gzip’

Compression can significantly increase the saving speed. If file size is not an issue, it can be disabled by setting compression=None. RosettaSciIO uses h5py for reading and writing HDF5 files and, therefore, it supports all compression filters supported by h5py. The default is 'gzip'. Also see notes below.

See also

rsciio.utils.hdf5.list_datasets_in_file

Convenience function to list datasets present in a file.

rsciio.utils.hdf5.read_metadata_from_file

Convenience function to read metadata present in a file.

Notes

It is possible to enable other compression filters such as blosc by installing e.g. hdf5plugin. Similarly, the availability of 'szip' depends on the HDF5 installation. If not available an error will be raised. Be aware that loading those files will require installing the package providing the compression filter and it may thus not be possible to load it on some platforms. Only compression=None and compression='gzip' are available on all platforms. For more details, see the h5py documentation.

Reading examples#

NeXus files can contain multiple datasets within the same file, but the ordering of datasets can vary depending on the setup of an experiment or processing step when the data was collected. For example, in one experiment Fe, Ca, P, Pb were collected but in the next experiment Ca, P, K, Fe, Pb were collected. RosettaSciIO supports reading in one or more datasets and returns a list of signals but in this example case the indexing is different. To control which data or metadata is loaded and in what order some additional loading arguments are provided.

Note

Given that HDF5 files can accommodate very large datasets, setting lazy=True is strongly recommended if the content of the HDF5 file is not known apriori. This prevents issues with regard to loading datasets far larger than memory.

Also note that setting lazy=True leaves the file handle to the HDF5 file open. In Hyperspy, it can with _signals.lazy.LazySignal.close_file or when using _signals.lazy.LazySignal.compute with close_file=True.

We can load a specific dataset using the dataset_path keyword argument. Setting it to the absolute path of the desired dataset will cause the single dataset to be loaded:

>>> from rsciio.nexus import file_reader
>>> # Loading a specific dataset
>>> file_reader("sample.nxs", dataset_path="/entry/experiment/EDS/data")

We can also choose to load datasets based on a search key using the dataset_key keyword argument. This can also be used to load NXdata not outside of the default version 3 rules. Instead of providing an absolute path, a string can be provided as well, and datasets with this key will be returned. The previous example could also be written as:

>>> # Loading datasets containing the string "EDS"
>>> file_reader("sample.nxs", dataset_key="EDS")

The difference between dataset_path and dataset_key is illustrated here:

>>> # Only the dataset /entry/experiment/EDS/data will be loaded
>>> file_reader("sample.nxs", dataset_path="/entry/experiment/EDS/data")
>>> # All datasets contain the entire string "/entry/experiment/EDS/data" will be loaded
>>> file_reader("sample.nxs", dataset_key="/entry/experiment/EDS/data")

Multiple datasets can be loaded by providing a number of keys:

>>> # Loading a specific dataset
>>> file_reader("sample.nxs", dataset_key=["EDS", "Fe", "Ca"])

Metadata can also be filtered in the same way using metadata_key:

>>> # Load data with metadata matching metadata_key
>>> file_reader("sample.nxs", metadata_key="entry/instrument")

Note

The NeXus loader removes any NXdata blocks from the metadata.

Metadata that are arrays can be skipped by using skip_array_metadata:

>>> # Load data while skipping metadata that are arrays
>>> file_reader("sample.nxs", skip_array_metadata=True)

Note

NeXus files also support parameters or dimensions that have been varied non-linearly. Since the reading of non-uniform axes is not yet implemented for the NeXus plugin, such non-linear information would be lost in the axes manager and replaced with indices.

Note

NeXus and HDF can result in large metadata structures with large datasets within the loaded original_metadata. If lazy loading is used this may not be a concern but care must be taken when saving the data. To control whether large datasets are loaded or saved, use the metadata_key to load only the most relevant information. Alternatively, set skip_array_metadata to True to avoid loading those large datasets in original_metadata.

Writing examples#

Using the file_writer method will store the NeXus file with the following structure:

├── entry1
│   ├── signal_name
│   │   ├── auxiliary
│   │   │   ├── original_metadata
│   │   │   ├── hyperspy_metadata
│   │   │   ├── learning_results
│   │   ├── signal_data
│   │   │   ├── data and axes (NXdata format)

To save multiple signals, the file_writer can be called directly passing a list of signals.

>>> from rsciio.nexus import file_writer
>>> file_writer("test.nxs",[signal1,signal2])

When saving multiple signals, a default signal can be defined. This can be used when storing associated data or processing steps along with a final result. All signals can be saved but a single signal can be marked as the default for easier loading using RosettaSciIO or plotting with NeXus tools. The default signal is selected as the first signal in the list:

>>> from rsciio.nexus import file_writer
>>> file_writer("test.nxs", [signal1, signal2], use_default = True)

The output will be arranged by signal name:

├── entry1 (NXentry)
│   ├── signal_name (NXentry)
│   │   ├── auxiliary (NXentry)
│   │   │   ├── original_metadata (NXcollection)
│   │   │   ├── hyperspy_metadata (NXcollection)
│   │   │   ├── learning_results  (NXcollection)
│   │   ├── signal_data (NXdata format)
│   │   │   ├── data and axes
├── entry2 (NXentry)
│   ├── signal_name (NXentry)
│   │   ├── auxiliary (NXentry)
│   │   │   ├── original_metadata (NXcollection)
│   │   │   ├── hyperspy_metadata (NXcollection)
│   │   │   ├── learning_results (NXcollection)
│   │   ├── signal_data (NXdata)
│   │   │   ├── data and axes

Note

Signals saved as .nxs by this plugin can be loaded normally in HyperSpy and the original_metadata, signal data, axes, metadata and learning_results will be restored. Model information is not currently stored. NeXus does not store how the data should be displayed. To preserve the signal details an additional navigation attribute is added to each axis to indicate if it is a navigation axis.