NeXus data format#
Background#
NeXus is a common data format originally developed by the neutron and x-ray science x-ray communities. It is still being developed as an international standard by scientists and programmers representing major scientific facilities in order to facilitate greater cooperation in the analysis and visualization of data. NeXus uses a variety of classes to record data, values, units and other experimental metadata associated with an experiment. For specific types of experiments an Application Definition may exist, which defines an agreed common layout that facilities can adhere to.
NeXus metadata and data are stored in Hierarchical Data Format Files (HDF5) with
a .nxs
extension although standard HDF5 extensions are sometimes used.
Note
In HyperSpy, files must use the .nxs
file
extension in order to default to the NeXus loader. If your file has
an HDF5 extension, you can also explicitly set the NeXus file reader:
# Load a NeXus file with a .h5 extension
>>> import hyperspy.api as hs
>>> s = hs.load("filename.h5", reader="nxs")
The loader will follow version 3 of the
NeXus data rules.
The signal type, Signal1D or Signal2D, will be inferred by the interpretation
attribute, if this is set to spectrum
or image
, in the NXdata
description. If the interpretation attribute is
not set, the loader will return a BaseSignal
, which must then be converted
to the appropriate signal type. Following the NeXus data rules, if a default
dataset is not defined, the loader will load NXdata
and HDF datasets according to the keyword options in the reader.
A number of the NeXus examples
from large facilties do not use NXdata or use older versions of the NeXus
implementation. Data can still be loaded from these files but information or
associations may be missing. However, this missing information can be recovered
from within the original_metadata
which contains the overall structure of
the entry.
As the NeXus format uses the HDF5 format and needs to read both data and metadata structured in different ways, the loader is written to be quite flexible and can also be used to inspect any hdf5 based file.
Differences with respect to HSpy#
The HyperSpy metadata structure
stores arrays as hdf datasets without attributes
and stores floats, ints and strings as attributes.
The NeXus format uses hdf dataset attributes to store additional
information such as an indication of the units for an axis or the NX_class
which
the dataset structure follows. The metadata, HyperSpy or original_metadata
,
therefore needs to be able to indicate the values and attributes of a dataset.
To implement this structure the value
and attrs
of a dataset can also be
defined. The value of a dataset is set using a value
key.
The attributes of a dataset are defined by an attrs
key.
For example, to store an array called axis_x
, with a units attribute within
original_metadata
, the following structure would be used:
├──original_metadata
│ ├── axis_x
│ │ ├── value : array([1.0,2.0,3.0,4.0,5.0])
│ │ ├── attrs
│ │ │ ├── units : mm
>>> original_metadata.set_item(axis_x.value,[1.0,2.0,3.0,4.0,5.0])
>>> original_metadata.set_item(axis_x.attrs.units,"mm")
To access the axis information:
>>> original_metadata.axis_x.value
>>> original_metadata.axis_x.attrs.units
To modify the axis information:
>>> original_metadata.axis_x.value = [2.0,3.0,4.0,5.0,6.0]
>>> original_metadata.axis_x.attrs.units = "um"
To store data in a NeXus monochromator format, value
and attrs
keys can define additional attributes:
├── monochromator
│ ├── energy
│ │ ├── value : 12.0
│ │ ├── attrs
│ │ │ ├── units : keV
│ │ │ ├── NXclass : NXmonochromator
The attrs
key can also be used to define NeXus structures for the definition
of structures and relationships between data:
├── mydata
│ ├── attrs
│ │ ├── NX_class : "NXdata"
│ │ ├── axes : ["x","."]
│ ├── data
│ │ ├──value : [[30,23...110]
│ ├── x
│ │ ├──value : [1,2.....100]
│ │ ├── attrs
│ │ │ ├── unit : "mm"
The use of attrs
or value
to set values within the metadata is optional
and metadata values can also be set, read or modified in the normal way.
>>> original_metadata.monochromator.energy = 12.5
HyperSpy metadata is stored within the NeXus file and should be automatically restored when a signal is loaded from a previously saved NeXus file.
Note
Altering the standard metadata structure of a signal
using attrs
or value
keywords is not recommended.
Also see the HDF5 utility functions for inspecting HDF5 files.
API functions#
- rsciio.nexus.file_reader(filename, lazy=False, dataset_key=None, dataset_path=None, metadata_key=None, skip_array_metadata=False, nxdata_only=False, hardlinks_only=False, use_default=False, mapping=None)#
Read NXdata class or hdf datasets from a file and return signal(s).
- Parameters:
- filename
str
,pathlib.Path
Filename of the file to read or corresponding pathlib.Path.
- lazybool, default=False
Whether to open the file lazily or not.
- dataset_key
None
,str
,list
ofstr
, default=None If None all datasets are returned. If a string or list of strings is provided only items whose path contain the string(s) are returned. For example dataset_key = [“instrument”, “Fe”] will return data entries with instrument or Fe in their hdf path.
- dataset_path
None
,str
,list
ofstr
, default=None If None, no absolute path is searched. If a string or list of strings is provided items with the absolute paths specified will be returned. For example, dataset_path = [‘/data/spectrum/Mn’], it returns the exact dataset with this path. It is not filtered by dataset_key, i.e. with dataset_key = [‘Fe’], it still returns the specific dataset at ‘/data/spectrum/Mn’. It is empty if no dataset matching the absolute path provided is present.
- metadata_key
None
,str
,list
ofstr
, default=None Only return items from the original metadata whose path contain the strings .e.g metadata_key = [“instrument”, “Fe”] will return all metadata entries with “instrument” or “Fe” in their hdf path.
- skip_array_metadatabool, default=False
Whether to skip loading metadata with an array entry. This is useful as metadata may contain large array that is redundant with the data.
- nxdata_onlybool, default=False
If True only NXdata will be converted into a signal if False NXdata and any hdf datasets will be loaded as signals.
- hardlinks_onlybool, default=False
If True any links (soft or External) will be ignored when loading.
- use_defaultbool, default=False
If True and a default NXdata is defined in the file load this as a signal. This will ignore the other keyword options. If True and no default is defined the file will be loaded according to the keyword options.
- mapping
None
ordict
Define the mapping from the original metadata to the returned metadata.
- filename
- Returns:
list
ofdict
List of dictionaries containing the following fields:
‘data’ – multidimensional
numpy.ndarray
ordask.array.Array
‘axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector
‘metadata’ – dictionary containing the parsed metadata
‘original_metadata’ – dictionary containing the full metadata tree from the input file
See also
rsciio.utils.hdf5.list_datasets_in_file
Convenience function to list datasets present in a file.
rsciio.utils.hdf5.read_metadata_from_file
Convenience function to read metadata present in a file.
Notes
Loading all datasets can result in a large number of signals Please review your datasets and use the dataset_key to target the datasets of interest. “keys” is a special keywords and prepended with “fix” in the metadata structure to avoid any issues.
Datasets are all arrays with size>2 (arrays, lists)
- rsciio.nexus.file_writer(filename, signal, save_original_metadata=True, skip_metadata_key=None, use_default=False, compression='gzip')#
Write the signal and metadata as a NeXus file.
This will save the signal in NXdata format in the file. As the form of the metadata can vary and is not validated it will be stored as an NXcollection (an unvalidated collection)
- Parameters:
- filename
str
,pathlib.Path
Filename of the file to write to or corresponding pathlib.Path.
- signal
dict
Dictionary containing the signal object. Should contain the following fields:
‘data’ – multidimensional numpy array
‘axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector
‘metadata’ – dictionary containing the metadata tree
- save_original_metadatabool , default
True
Option to save hyperspy.original_metadata with the signal. A loaded NeXus file may have a large amount of data when loaded which you may wish to omit on saving.
- skip_metadata_key
str
orlist
ofstr
, defaultNone
The key(s) to skip when saving original metadata. This is useful when some metadata keys should be ignored.
- use_defaultbool , default
False
Option to define the default dataset in the file. If set to True the signal or first signal in the list of signals will be defined as the default (following NeXus v3 data rules).
- compression
None
, ‘gzip’, ‘szip’, ‘lzf’, default=’gzip’ Compression can significantly increase the saving speed. If file size is not an issue, it can be disabled by setting
compression=None
. RosettaSciIO uses h5py for reading and writing HDF5 files and, therefore, it supports all compression filters supported by h5py. The default is'gzip'
. Also see notes below.
- filename
See also
rsciio.utils.hdf5.list_datasets_in_file
Convenience function to list datasets present in a file.
rsciio.utils.hdf5.read_metadata_from_file
Convenience function to read metadata present in a file.
Notes
It is possible to enable other compression filters such as
blosc
by installing e.g. hdf5plugin. Similarly, the availability of'szip'
depends on the HDF5 installation. If not available an error will be raised. Be aware that loading those files will require installing the package providing the compression filter and it may thus not be possible to load it on some platforms. Onlycompression=None
andcompression='gzip'
are available on all platforms. For more details, see the h5py documentation.
Reading examples#
NeXus files can contain multiple datasets within the same file, but the ordering of datasets can vary depending on the setup of an experiment or processing step when the data was collected. For example, in one experiment Fe, Ca, P, Pb were collected but in the next experiment Ca, P, K, Fe, Pb were collected. RosettaSciIO supports reading in one or more datasets and returns a list of signals but in this example case the indexing is different. To control which data or metadata is loaded and in what order some additional loading arguments are provided.
Note
Given that HDF5 files can accommodate very large datasets, setting lazy=True
is strongly recommended if the content of the HDF5 file is not known apriori.
This prevents issues with regard to loading datasets far larger than memory.
Also note that setting lazy=True
leaves the file handle to the HDF5 file
open. In Hyperspy, it can with _signals.lazy.LazySignal.close_file
or when using _signals.lazy.LazySignal.compute
with close_file=True
.
We can load a specific dataset using the dataset_path
keyword argument.
Setting it to the absolute path of the desired dataset will cause
the single dataset to be loaded:
>>> from rsciio.nexus import file_reader
>>> # Loading a specific dataset
>>> file_reader("sample.nxs", dataset_path="/entry/experiment/EDS/data")
We can also choose to load datasets based on a search key using the
dataset_key
keyword argument. This can also be used to load NXdata not
outside of the default
version 3 rules. Instead of providing an absolute
path, a string can be provided as well, and datasets with this key will be
returned. The previous example could also be written as:
>>> # Loading datasets containing the string "EDS"
>>> file_reader("sample.nxs", dataset_key="EDS")
The difference between dataset_path
and dataset_key
is illustrated
here:
>>> # Only the dataset /entry/experiment/EDS/data will be loaded
>>> file_reader("sample.nxs", dataset_path="/entry/experiment/EDS/data")
>>> # All datasets contain the entire string "/entry/experiment/EDS/data" will be loaded
>>> file_reader("sample.nxs", dataset_key="/entry/experiment/EDS/data")
Multiple datasets can be loaded by providing a number of keys:
>>> # Loading a specific dataset
>>> file_reader("sample.nxs", dataset_key=["EDS", "Fe", "Ca"])
Metadata can also be filtered in the same way using metadata_key
:
>>> # Load data with metadata matching metadata_key
>>> file_reader("sample.nxs", metadata_key="entry/instrument")
Note
The NeXus loader removes any NXdata blocks from the metadata.
Metadata that are arrays can be skipped by using skip_array_metadata
:
>>> # Load data while skipping metadata that are arrays
>>> file_reader("sample.nxs", skip_array_metadata=True)
Note
NeXus files also support parameters or dimensions that have been varied non-linearly. Since the reading of non-uniform axes is not yet implemented for the NeXus plugin, such non-linear information would be lost in the axes manager and replaced with indices.
Note
NeXus and HDF can result in large metadata structures with large datasets within the loaded
original_metadata. If lazy loading is used this may not be a concern but care must be taken
when saving the data. To control whether large datasets are loaded or saved,
use the metadata_key
to load only the most relevant information. Alternatively,
set skip_array_metadata
to True
to avoid loading those large datasets in original_metadata.
Writing examples#
Using the file_writer
method will store the NeXus file with the following structure:
├── entry1
│ ├── signal_name
│ │ ├── auxiliary
│ │ │ ├── original_metadata
│ │ │ ├── hyperspy_metadata
│ │ │ ├── learning_results
│ │ ├── signal_data
│ │ │ ├── data and axes (NXdata format)
To save multiple signals, the file_writer
can be called directly passing a
list of signals.
>>> from rsciio.nexus import file_writer
>>> file_writer("test.nxs",[signal1,signal2])
When saving multiple signals, a default signal can be defined. This can be used when storing associated data or processing steps along with a final result. All signals can be saved but a single signal can be marked as the default for easier loading using RosettaSciIO or plotting with NeXus tools. The default signal is selected as the first signal in the list:
>>> from rsciio.nexus import file_writer
>>> file_writer("test.nxs", [signal1, signal2], use_default = True)
The output will be arranged by signal name:
├── entry1 (NXentry)
│ ├── signal_name (NXentry)
│ │ ├── auxiliary (NXentry)
│ │ │ ├── original_metadata (NXcollection)
│ │ │ ├── hyperspy_metadata (NXcollection)
│ │ │ ├── learning_results (NXcollection)
│ │ ├── signal_data (NXdata format)
│ │ │ ├── data and axes
├── entry2 (NXentry)
│ ├── signal_name (NXentry)
│ │ ├── auxiliary (NXentry)
│ │ │ ├── original_metadata (NXcollection)
│ │ │ ├── hyperspy_metadata (NXcollection)
│ │ │ ├── learning_results (NXcollection)
│ │ ├── signal_data (NXdata)
│ │ │ ├── data and axes
Note
Signals saved as .nxs
by this plugin can be loaded normally in HyperSpy
and the original_metadata, signal data, axes, metadata and learning_results
will be restored. Model information is not currently stored.
NeXus does not store how the data should be displayed.
To preserve the signal details an additional navigation attribute
is added to each axis to indicate if it is a navigation axis.