Fitting Schemes

class clease.regression.LinearRegression[source]

fit(X: ndarray, y: ndarray) → ndarray[source]

Fit a linear model by performing ordinary least squares

y = Xc

Parameters:

X – Design matrix (NxM)
y – Data points (vector of length N)

class clease.regression.Tikhonov(alpha: float | ndarray = 1e-05, penalize_bias_term: bool = False, normalize: bool = True)[source]

Ridge regularization.

Parameters:

alpha –
regularization term
- float: A single regularization coefficient is used for all features.
  Tikhonov matrix is T = alpha * I (I = identity matrix).
- 1D array: Regularization coefficient is defined for each feature.
  Tikhonov matrix is T = diag(alpha) (the alpha values are put on the diagonal). The length of array should match the number of features.
- 2D array: Full Tikhonov matrix supplied by a user.
  The dimensions of the matrix should be M * M where M is the number of features.
normalize – If True each feature will be normalized to before fitting

fit(X: ndarray, y: ndarray) → ndarray[source]: Fit coefficients based on Ridge regularizeation.

precision_matrix(X: ndarray) → ndarray[source]: Calculate the presicion matrix.

class clease.regression.Lasso(alpha: float = 1e-05, max_iter: int = 1000000)[source]

LASSO regularization.

Parameters:

alpha – regularization coefficient
max_iter – (int) Maximum number of iterations.

fit(X: ndarray, y: ndarray) → ndarray[source]: Fit coefficients based on LASSO regularizeation.

class clease.regression.ga_fit.GAFit(cf_matrix, e_dft, mutation_prob=0.001, elitism=1, fname='ga_fit.csv', num_individuals='auto', max_num_in_init_pool=None, cost_func='aicc')[source]

Genetic Algorithm for selecting relevant clusters.

Parameters:

cf_matrix: np.ndarray: Design matrix of the linear regression (nxm) where n is the number of data points and m is the number of features
e_dft: list: Array of length n with DFT energies
elitism: int: Number of best structures that will be passed unaltered on to the next generation
fname: str: File name used to backup the population. If this file exists, the next run will load the population from the file and start from there. Another file named ‘fname’_cf_names.txt is created to store the names of selected clusters.
num_individuals: int or str: Integer with the number of inidivuals or it is equal to “auto”, in which case 10 times the number of candidate clusters is used
max_num_in_init_pool: int: If given the maximum clusters included in the initial population is given by this number. If max_num_in_init_pool=150, then solution with maximum 150 will be present in the initial pool.
cost_func: str: Use the inverse as fitness measure. Possible cost functions: bic - Bayes Information Criterion aic - Afaike Information Criterion aicc - Modified Afaikes Information Criterion (tend to avoid overfitting better than aic)

check_valid()[source]: Check that the current population is valid.

create_new_generation()[source]: Create a new generation.

design_matrix(individual)[source]: Return the corresponding design matrix.

evaluate_fitness()[source]: Evaluate fitness of all species.

static flip_one_mutation(individual)[source]: Apply mutation where one bit flips.

get_eci(individual)[source]: Calculate the LOOCV for the current individual.

index_of_selected_clusters(individual)[source]

Return the indices of the selected clusters

Parameters:

individual: int: Index of the individual

static make_valid(individual)[source]: Make sure that there is at least two active ECIs.

mutate()[source]: Introduce mutations.

plot_evolution()[source]: Create a plot of the evolution.

population_diversity()[source]: Check the diversity of the population.

run(gen_without_change=100, min_change=0.01, save_interval=100)[source]

Run the genetic algorithm.

Return a list consisting of the names of selected clusters at the end of the run.

Parameters:

gen_without_change: int: Terminate if gen_without_change are created without sufficient improvement
min_change: float: Changes a larger than this value is considered “sufficient” improvement
save_interval: int: Rate at which all the populations are backed up in a file

class clease.regression.physical_ridge.PhysicalRidge(lamb_size: float = 1e-06, lamb_dia: float = 1e-06, size_decay: str | Callable[[int], float] = 'linear', dia_decay: str | Callable[[int], float] = 'linear', normalize: bool = True, cf_names: list[str] | None = None)[source]

Physical Ridge is a special ridge regression scheme that enforces a convergent series. The physical motivation behind the choice of prior distributions is motivated by the fact that one expects that interactions strengths decays with both the number of atoms in the cluster and the diameter of the cluster. See for instance

Cao, L., Li, C. and Mueller, T., 2018. The use of cluster expansions to predict the structures and properties of surfaces and nanostructured materials. Journal of chemical information and modeling, 58(12), pp.2401-2413.

This fitting scheme uses Gaussian priors on the coefficients of the model

P(M) = P_size(M)*P_dia(M), where

P_size(M) = prod_i exp(-lamb_size*size_decay(size)*coeff_i^2) P_dia(M) = prod_i exp(-lamb_dia*dia_decay(dia)*coeff_i^2)

where size_decay and dia_decay is a monotonically increasing function of the size and diameter respectively. The product goes over all coefficients in the model M.

Parameters:

lamb_size – Prefactor in front of the size penalization
lamb_dia – Prefactor in fron the the diameter penalization
size_decay – The size_decay function in the priors explained above. It can be one of [‘linear’, ‘exponential’, ‘polyN’], where N is any integer, or a callable function with the signature f(size), where size is the number of atoms in the cluster. If polyN is given the penalization is proportional to size**N
dia_decay – The dia_decay function in the priors explained above. It can be one of [‘linear’, ‘exponential’, ‘polyN’] where N is any integer, of a callable function with the signature f(dia) where dia is the diameter. If polyN is given the penalization is proportional to dia**N
normalize –
If True the data will be normalized to unit variance and zero mean before fitting.

NOTE: Normalization works only when the first column in X corresponds to a constant. If the X matrix contains several simultaneous fits (e.g. energy, pressure, bulk moduli) there will typically be different columns that corresponds to the bias term for the different groups. It is recommended to put normalize=False for such cases.
cf_names – List of strings, used to initialize the size and diameters which will be used.

add_constraint(A: ndarray, c: ndarray) → None[source]

Adds a constraint that the coefficients (ECI) has to obey, A.dot(coeff) = c

Parameters:

A – Matrix describing the linear constraint
c – Vector representing the right hand side of constraint equations

diameters_from_names(names: list[str]) → None[source]

Extract the diameters from a list of correltion function names

Parameters:: names – List of cluster names. The length of the list has to match the number of columns in the X matrix passed to the fit method. Ex: [‘c0’, ‘c1_1’, ‘c2_d0000_0_00’]

fit(X: ndarray, y: ndarray) → ndarray[source]

Fit ECIs

Parameters:

X – Design matrix with correlation functions. The shape is N x M, where N is the number of data points and M is the number of correlation functions
y – Vector with target values. The length of this vector is N (e.g. equal to the number of rows in X)

fit_data(X: ndarray, y: ndarray) → tuple[ndarray, ndarray][source]

If normalize is True, a normalized version of the passed data is returned. Otherwise, X and y is returned as they are passed.

Parameters:

X – Design matrix
y – Target data

sizes_from_names(names: list[str]) → None[source]

Extract the sizes from a list of correlation function names

Parameters:: names – List of cluster names. The length of the list has to match the number of columns in the X matrix passed to the fit method. Ex: [‘c0’, ‘c1_1’, ‘c2_d0000_0_00’]

class clease.regression.bayesian_compressive_sensing.BayesianCompressiveSensing(shape_var=0.5, rate_var=0.5, shape_lamb=0.5, lamb_opt_start=200, variance_opt_start=100, fname='bayes_compr_sens.json', maxiter=100000, output_rate_sec=2, select_strategy='max_increase', noise=0.1, init_lamb=0.0, penalty=1e-08)[source]

Fit a sparse CE model to data. Based on the method described in

Babacan, S. Derin, Rafael Molina, and Aggelos K. Katsaggelos. “Bayesian compressive sensing using Laplace priors.” IEEE Transactions on Image Processing 19.1 (2010): 53-63.

Different values has different priors.

For the ECIs a normal distribution is assumed
(the i-th eci is: eci_i – N(J | 0, var_i)=
The inverce variance of each ECI is gamma distributed
(i.e. 1/var_i – gamma(x | 1, lambda/2))
The lambda parameter above is also gamma distributed
(i.e. lamb – gamma(x | shape_lamb/2, shape_lamb/2))
The noise parameter is uniformly distributed on the
positive axis (i.e. noise – uniform(x | 0, inf)

Parameters:

shape_var: float: Shape parameter for the gamma distribution for the inverse variance (1/var – gamma(x | shape_var/2, rate_var/2))
rate_var: float: Rate parameter for the gamma distribution for the inverse variance (1/var – gamma(x | shape_var/2, rate_var/2))
shape_lamb: float: Shape parameter for gamma distribution for the lambda parameter (lambda – gamma(x | 1, shape_lamb))
variance_opt_start: int: Optimization of inverse variance starts after this amount of iterations
lamb_opt_start: int: Optimization of lambda and shape_lamb starts after this amount of iterations. If this number is set very high, lambda will be kept at zero, making the algorithm efficitively a Relvance Vector Machine (RVM)
fname: str: Backup file for parameters
maxiter: int: Maximum number of iterations
output_rate_sec: int: Interval in seconds between status messages
select_strategy: str: Strategy for selecting new correlation function for each iteration. If ‘max_increase’ it will select the basis function that leads to the largest increase in likelihood value. If ‘random’ correlation functions are selected at random
noise: float: Initial estimate of the noise in the data
init_lamb: float: Initial value for the lambda parameter
penalty: float: Penalization value added to the diagonal of matrice to avoid singular matrices

estimate_loocv()[source]: Return an estimate of the LOOCV.

fit(X, y)[source]

Fit ECIs to the data

Parameters:

X: np.ndarray: Design matrix (NxM: N number of datapoints, M number of correlation functions)
y: np.ndarray: Array of length N with the energies

get_basis_function_index(select_strategy) → int[source]: Select a new correlation function.

log_likelihood_for_each_gamma(gammas)[source]

Log likelihood value for all gammas.

Parameters:: gammas (np.ndarray) – Value for all the gammas

mu()[source]: Calculate the expectation value for the ECIs

optimal_gamma(indx)[source]

Return the gamma value that maximize the likelihood

Parameters:

indx: int: Index of the selected correlation function

optimal_inv_variance()[source]: Calculate the optimal value for the inverse variance

optimal_lamb()[source]: Calculate the optimal value for the lambda parameter.

optimal_shape_lamb()[source]: Calculate the optimal value for the shape paremeter for lambda.

precision_matrix(X)[source]: Return the precision matrix needed by the Evaluate class. Only contributions from the correlation functions with gamma > 0 are included.

rmse()[source]: Return root mean square error.

save()[source]: Save the results from file.

show_shape_parameter()[source]: Show a plot of the transient equation for the optimal shape parameter for lambda.

todict()[source]: Convert all parameters to a dictionary.

update_quantities()[source]: Update helper parameters needed for the next iteration.

update_sigma_mu()[source]: Update sigma and mu.

class clease.regression.sequential_cluster_ridge.SequentialClusterRidge(min_alpha=1e-10, max_alpha=10.0, num_alpha=20, verbose: bool = False)[source]

SequentialClusterRidge is a fit method that optimizes the LOOCV over the regularization parameter as well as the cluster support. The method adds features in the design matrix X (see fit method) by including column by column. For each set of columns it performs a fit to a logspaced set of regularization parameters. The returned coefficients are the one from the model that has the smallest LOOCV.

Parameters:

alpha_min: float: Minimum value of the regularization parameter alpha
alpha_max: float: Maximum value of the regularization parameter alpha
num_alpha: int: Number of alpha values
verbose: bool: Print information about fit after completion

fit(X, y)[source]

Performs the fitting

Parameters:

X: np.ndarray: Design matrix of size (N x M). During the CV optimization columns of X will be added one by one starting with a model consisting of the two first columns.
y: np.ndarray: Vector of length N