Fitting Schemes¶
- class clease.regression.Tikhonov(alpha: float | ndarray = 1e-05, penalize_bias_term: bool = False, normalize: bool = True)[source]¶
Ridge regularization.
- Parameters:
alpha –
regularization term
- float: A single regularization coefficient is used for all features.
Tikhonov matrix is T = alpha * I (I = identity matrix).
- 1D array: Regularization coefficient is defined for each feature.
Tikhonov matrix is T = diag(alpha) (the alpha values are put on the diagonal). The length of array should match the number of features.
- 2D array: Full Tikhonov matrix supplied by a user.
The dimensions of the matrix should be M * M where M is the number of features.
normalize – If True each feature will be normalized to before fitting
- class clease.regression.Lasso(alpha: float = 1e-05, max_iter: int = 1000000)[source]¶
LASSO regularization.
- Parameters:
alpha – regularization coefficient
max_iter – (int) Maximum number of iterations.
- class clease.regression.ga_fit.GAFit(cf_matrix, e_dft, mutation_prob=0.001, elitism=1, fname='ga_fit.csv', num_individuals='auto', max_num_in_init_pool=None, cost_func='aicc')[source]¶
Genetic Algorithm for selecting relevant clusters.
Parameters:
- cf_matrix: np.ndarray
Design matrix of the linear regression (nxm) where n is the number of data points and m is the number of features
- e_dft: list
Array of length n with DFT energies
- elitism: int
Number of best structures that will be passed unaltered on to the next generation
- fname: str
File name used to backup the population. If this file exists, the next run will load the population from the file and start from there. Another file named ‘fname’_cf_names.txt is created to store the names of selected clusters.
- num_individuals: int or str
Integer with the number of inidivuals or it is equal to “auto”, in which case 10 times the number of candidate clusters is used
- max_num_in_init_pool: int
If given the maximum clusters included in the initial population is given by this number. If max_num_in_init_pool=150, then solution with maximum 150 will be present in the initial pool.
- cost_func: str
Use the inverse as fitness measure. Possible cost functions: bic - Bayes Information Criterion aic - Afaike Information Criterion aicc - Modified Afaikes Information Criterion (tend to avoid overfitting better than aic)
- index_of_selected_clusters(individual)[source]¶
Return the indices of the selected clusters
Parameters:
- individual: int
Index of the individual
- run(gen_without_change=100, min_change=0.01, save_interval=100)[source]¶
Run the genetic algorithm.
Return a list consisting of the names of selected clusters at the end of the run.
Parameters:
- gen_without_change: int
Terminate if gen_without_change are created without sufficient improvement
- min_change: float
Changes a larger than this value is considered “sufficient” improvement
- save_interval: int
Rate at which all the populations are backed up in a file
- class clease.regression.physical_ridge.PhysicalRidge(lamb_size: float = 1e-06, lamb_dia: float = 1e-06, size_decay: str | Callable[[int], float] = 'linear', dia_decay: str | Callable[[int], float] = 'linear', normalize: bool = True, cf_names: List[str] | None = None)[source]¶
Physical Ridge is a special ridge regression scheme that enforces a convergent series. The physical motivation behind the choice of prior distributions is motivated by the fact that one expects that interactions strengths decays with both the number of atoms in the cluster and the diameter of the cluster. See for instance
Cao, L., Li, C. and Mueller, T., 2018. The use of cluster expansions to predict the structures and properties of surfaces and nanostructured materials. Journal of chemical information and modeling, 58(12), pp.2401-2413.
This fitting scheme uses Gaussian priors on the coefficients of the model
P(M) = P_size(M)*P_dia(M), where
P_size(M) = prod_i exp(-lamb_size*size_decay(size)*coeff_i^2) P_dia(M) = prod_i exp(-lamb_dia*dia_decay(dia)*coeff_i^2)
where size_decay and dia_decay is a monotonically increasing function of the size and diameter respectively. The product goes over all coefficients in the model M.
- Parameters:
lamb_size – Prefactor in front of the size penalization
lamb_dia – Prefactor in fron the the diameter penalization
size_decay – The size_decay function in the priors explained above. It can be one of [‘linear’, ‘exponential’, ‘polyN’], where N is any integer, or a callable function with the signature f(size), where size is the number of atoms in the cluster. If polyN is given the penalization is proportional to size**N
dia_decay – The dia_decay function in the priors explained above. It can be one of [‘linear’, ‘exponential’, ‘polyN’] where N is any integer, of a callable function with the signature f(dia) where dia is the diameter. If polyN is given the penalization is proportional to dia**N
normalize –
If True the data will be normalized to unit variance and zero mean before fitting.
NOTE: Normalization works only when the first column in X corresponds to a constant. If the X matrix contains several simultaneous fits (e.g. energy, pressure, bulk moduli) there will typically be different columns that corresponds to the bias term for the different groups. It is recommended to put normalize=False for such cases.
cf_names – List of strings, used to initialize the size and diameters which will be used.
- add_constraint(A: ndarray, c: ndarray) None [source]¶
Adds a constraint that the coefficients (ECI) has to obey, A.dot(coeff) = c
- Parameters:
A – Matrix describing the linear constraint
c – Vector representing the right hand side of constraint equations
- diameters_from_names(names: List[str]) None [source]¶
Extract the diameters from a list of correltion function names
- Parameters:
names – List of cluster names. The length of the list has to match the number of columns in the X matrix passed to the fit method. Ex: [‘c0’, ‘c1_1’, ‘c2_d0000_0_00’]
- fit(X: ndarray, y: ndarray) ndarray [source]¶
Fit ECIs
- Parameters:
X – Design matrix with correlation functions. The shape is N x M, where N is the number of data points and M is the number of correlation functions
y – Vector with target values. The length of this vector is N (e.g. equal to the number of rows in X)
- class clease.regression.bayesian_compressive_sensing.BayesianCompressiveSensing(shape_var=0.5, rate_var=0.5, shape_lamb=0.5, lamb_opt_start=200, variance_opt_start=100, fname='bayes_compr_sens.json', maxiter=100000, output_rate_sec=2, select_strategy='max_increase', noise=0.1, init_lamb=0.0, penalty=1e-08)[source]¶
Fit a sparse CE model to data. Based on the method described in
Babacan, S. Derin, Rafael Molina, and Aggelos K. Katsaggelos. “Bayesian compressive sensing using Laplace priors.” IEEE Transactions on Image Processing 19.1 (2010): 53-63.
Different values has different priors.
- For the ECIs a normal distribution is assumed
(the i-th eci is: eci_i – N(J | 0, var_i)=
- The inverce variance of each ECI is gamma distributed
(i.e. 1/var_i – gamma(x | 1, lambda/2))
- The lambda parameter above is also gamma distributed
(i.e. lamb – gamma(x | shape_lamb/2, shape_lamb/2))
- The noise parameter is uniformly distributed on the
positive axis (i.e. noise – uniform(x | 0, inf)
Parameters:
- shape_var: float
Shape parameter for the gamma distribution for the inverse variance (1/var – gamma(x | shape_var/2, rate_var/2))
- rate_var: float
Rate parameter for the gamma distribution for the inverse variance (1/var – gamma(x | shape_var/2, rate_var/2))
- shape_lamb: float
Shape parameter for gamma distribution for the lambda parameter (lambda – gamma(x | 1, shape_lamb))
- variance_opt_start: int
Optimization of inverse variance starts after this amount of iterations
- lamb_opt_start: int
Optimization of lambda and shape_lamb starts after this amount of iterations. If this number is set very high, lambda will be kept at zero, making the algorithm efficitively a Relvance Vector Machine (RVM)
- fname: str
Backup file for parameters
- maxiter: int
Maximum number of iterations
- output_rate_sec: int
Interval in seconds between status messages
- select_strategy: str
Strategy for selecting new correlation function for each iteration. If ‘max_increase’ it will select the basis function that leads to the largest increase in likelihood value. If ‘random’ correlation functions are selected at random
- noise: float
Initial estimate of the noise in the data
- init_lamb: float
Initial value for the lambda parameter
- penalty: float
Penalization value added to the diagonal of matrice to avoid singular matrices
- fit(X, y)[source]¶
Fit ECIs to the data
Parameters:
- X: np.ndarray
Design matrix (NxM: N number of datapoints, M number of correlation functions)
- y: np.ndarray
Array of length N with the energies
- log_likelihood_for_each_gamma(gammas)[source]¶
Log likelihood value for all gammas.
- Parameters:
gammas (np.ndarray) – Value for all the gammas
- optimal_gamma(indx)[source]¶
Return the gamma value that maximize the likelihood
Parameters:
- indx: int
Index of the selected correlation function
- precision_matrix(X)[source]¶
Return the precision matrix needed by the Evaluate class. Only contributions from the correlation functions with gamma > 0 are included.
- class clease.regression.sequential_cluster_ridge.SequentialClusterRidge(min_alpha=1e-10, max_alpha=10.0, num_alpha=20, verbose: bool = False)[source]¶
SequentialClusterRidge is a fit method that optimizes the LOOCV over the regularization parameter as well as the cluster support. The method adds features in the design matrix X (see fit method) by including column by column. For each set of columns it performs a fit to a logspaced set of regularization parameters. The returned coefficients are the one from the model that has the smallest LOOCV.
Parameters:
- alpha_min: float
Minimum value of the regularization parameter alpha
- alpha_max: float
Maximum value of the regularization parameter alpha
- num_alpha: int
Number of alpha values
- verbose: bool
Print information about fit after completion