hyperspy.learn.mlpca module

hyperspy.learn.mlpca.mlpca(X, varX, output_dimension, svd_solver='auto', tol=1e-10, max_iter=50000, **kwargs)

Performs maximum likelihood PCA with missing data and/or heteroskedastic noise.

Standard PCA based on a singular value decomposition (SVD) approach assumes that the data is corrupted with Gaussian, or homoskedastic noise. For many applications, this assumption does not hold. For example, count data from EDS-TEM experiments is corrupted by Poisson noise, where the noise variance depends on the underlying pixel value. Rather than scaling or transforming the data to approximately “normalize” the noise, MLPCA instead uses estimates of the data variance to perform the decomposition.

This function is a transcription of a MATLAB code obtained from [Andrews1997].

Read more in the User Guide.

Parameters:
  • X (numpy array, shape (m, n)) – Matrix of observations.

  • varX (numpy array) – Matrix of variances associated with X (zeros for missing measurements).

  • output_dimension (int) – The model dimensionality.

  • svd_solver ({"auto", "full", "arpack", "randomized"}, default "auto") –

    If auto:

    The solver is selected by a default policy based on data.shape and output_dimension: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient “randomized” method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.

    If full:

    run exact SVD, calling the standard LAPACK solver via scipy.linalg.svd(), and select the components by postprocessing

    If arpack:

    use truncated SVD, calling ARPACK solver via scipy.sparse.linalg.svds(). It requires strictly 0 < output_dimension < min(data.shape)

    If randomized:

    use truncated SVD, calling sklearn.utils.extmath.randomized_svd() to estimate a limited number of components

  • tol (float) – Tolerance of the stopping condition.

  • max_iter (int) – Maximum number of iterations before exiting without convergence.

Returns:

  • U, S, V (numpy array) – The pseudo-SVD parameters.

  • s_obj (float) – Value of the objective function.

References

[Andrews1997]

Darren T. Andrews and Peter D. Wentzell, “Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer”, Analytica Chimica Acta 350, no. 3 (September 19, 1997): 341-352.