Fitting ECIs
The Evaluate Class
- class clease.evaluate.Evaluate(settings, prop='energy', cf_names=None, select_cond=None, parallel=False, num_core='all', fitting_scheme='ridge', alpha=1e-05, max_cluster_size=None, max_cluster_dia=None, scoring_scheme='loocv', min_weight=1.0, nsplits=10, num_repetitions=1, normalization_symbols: Sequence[str] | None = None)[source]
Evaluate RMSE/MAE of the fit and CV scores.
- Parameters:
settings – ClusterExpansionSettings object
prop – str User defined property for the fit. The property should exist in database as key-value pairs. Default is
energy
.cf_names – list Names of clusters to include in the evalutation. If None, all of the possible clusters are included.
select_cond – tuple or list of tuples (optional) Custom selection condition specified by user. Default only includes “converged=True” and “struct_type=’initial’”.
max_cluster_size – int maximum number of atoms in the cluster to include in the fit. If
None
, no restriction on the number of atoms will be imposed.max_cluster_dia – float or int maximum diameter of the cluster (in angstrom) to include in the fit. If
None
, no restriction on the diameter. Note that this diameter of the circumscribed sphere, which is slightly different from the meaning of max_cluster_dia in ClusterExpansionSettings where it refers to the the maximum internal distance between any of the atoms in the cluster.scoring_scheme – str should be one of ‘loocv’, ‘loocv_fast’ or ‘k-fold’
min_weight –
float Weight given to the data point furthest away from any structure on the convex hull. An exponential weighting function is used and the decay rate is calculated as
decay = log(min_weight)/min(sim_measure)
where sim_measure is a similarity measure used to asses how different the structure is from structures on the convex hull.
nsplits – int Number of splits to use when partitioning the dataset into training and validation data. Only used when scoring_scheme=’k-fold’
num_repetitions – int Number of repetitions used to use when calculating k-fold cross validation. The partitioning is repeated num_repetitions times and the resulting value is the average of the k-fold cross validation score obtained in each of the runs.
- alpha_CV(alpha_min=1e-07, alpha_max=1.0, num_alpha=10, scale='log', logfile=None, fitting_schemes=None)[source]
Calculate CV for a given range of alpha.
In addition to calculating CV with respect to alpha, a logfile can be used to extend the range of alpha or to add more alpha values in a given range.
Returns a list of alpha values, and a list of CV scores.
Parameters:
- alpha_min: int or float
minimum value of regularization parameter alpha.
- alpha_max: int or float
maximum value of regularization parameter alpha.
- num_alpha: int
number of alpha values to be used in the plot.
- scale: str
‘log’(default): alpha values are evenly spaced on a log scale.
‘linear’: alpha values are evenly spaced on a linear scale.
- logfile: file object, str or None.
None: logging is disabled
str: a file with that name will be opened. If ‘-’, stdout used.
file object: use the file object for logging
fitting_schemes: None or array of instance of LinearRegression.
- Note: If the file with the same name exists, it first checks if the
alpha value already exists in the logfile and evalutes the CV of the alpha values that are absent. The newly evaluated CVs are appended to the existing file.
- property atomic_concentrations
The actual atomic concentration (including background lattices) normalised against the total number of atoms
- property concentrations
The internal concentrations normalised against the ‘active’ sublattices
- cv_for_alpha(alphas: List[float]) None [source]
Calculate the CV scores for alphas using the fitting scheme specified in the Evaluate object.
- Parameters:
alphas – List of alpha values to get CV scores
- export_dataset(fname)[source]
Export the dataset used to fit a model y = Xc where y is typically the DFT energy per atom and c is the unknown ECIs. This function exports the data to a csv file with the following format
# ECIname_1, ECIname_2, …, ECIname_n, E_DFT 0.1, 0.4, …, -0.6, -2.0 0.3, 0.2, …, -0.9, -2.3
thus each row in the file contains the correlation function values and the corresponding DFT energy value.
Parameter:
- fname: str
Filename to write to. Typically this should end with .csv
- fit() None [source]
Determine the ECI with the given regressor.
This will always calculate a new fit.
- generalization_error(validation_id: List[int])[source]
Estimate the generalization error to new datapoints
- Parameters:
validation_ids – List with IDs to leave out of the dataset
- get_eci() ndarray [source]
Determine and return ECIs for a given alpha. Raises a ValueError if no fit has been performed yet.
- Returns:
A 1D array of floats with all ECI values.
- Return type:
np.ndarray
- get_eci_by_size() Dict[str, Dict[str, list]] [source]
Classify distance, eci and cf_name according to cluster body size
- Returns:
Dictionary which contains
Key: body size of cluster
Value: A dictionary with the following entries:
”distance” : distance of the cluster
”eci” : eci of the cluster
”name” : name of the cluster
”radius” : Radius of the cluster in Ångstrom.
- get_eci_dict(cutoff_tol: float = 1e-14) Dict[str, float] [source]
Determine cluster names and their corresponding ECI value and return them in a dictionary format.
- Parameters:
cutoff_tol (float, optional) – Cutoff value below which the absolute ECI value is considered to be 0. Defaults to 1e-14.
- Returns:
- Dictionary with the CF names and the corresponding
ECI value.
- Return type:
Dict[str, float]
- get_energy_predict(normalize: bool = True) ndarray [source]
Perform matrix multiplication of eci and cf_matrix
- Returns:
Energy predicted using ECIs
- load_eci(fname='eci.json') None [source]
Read in ECI values stored to a json file.
Note: this doesn’t load the scheme or the alpha value, so it will not prevent a new fit to be performed if requested, as it may be incompatible with the current fitting scheme.
- load_eci_dict(eci_dict: Dict[str, float]) None [source]
Load the ECI’s from a dictionary. Any ECI’s which are missing from the internal cf_names list are assumed to be 0.
Note: this doesn’t load the scheme or the alpha value, so it will not prevent a new fit to be performed if requested, as it may be incompatible with the current fitting scheme.
- loocv_fast()[source]
CV score based on the method in J. Phase Equilib. 23, 348 (2002).
This method has a computational complexity of order n^1.
- plot_CV(alpha_min=1e-07, alpha_max=1.0, num_alpha=10, scale='log', logfile=None, fitting_schemes=None, savefig=False, fname=None)[source]
Plot CV for a given range of alpha.
In addition to plotting CV with respect to alpha, logfile can be used to extend the range of alpha or add more alpha values in a given range. Returns an alpha value that leads to the minimum CV score within the pool of evaluated alpha values.
Parameters:
- alpha_min: int or float
minimum value of regularization parameter alpha.
- alpha_max: int or float
maximum value of regularization parameter alpha.
- num_alpha: int
number of alpha values to be used in the plot.
- scale: str
‘log’(default): alpha values are evenly spaced on a log scale.
‘linear’: alpha values are evenly spaced on a linear scale.
- logfile: file object, str or None
None: logging is disabled
str: a file with that name will be opened. If ‘-’, stdout used.
file object: use the file object for logging
fitting_schemes: None or array of instance of LinearRegression
- savefig: bool
- True: Save the plot with a file name specified in ‘fname’. This
option does not display figure.
False: Display figure without saving.
- fname: str
file name of the figure (only used when savefig = True)
- Note: If the file with the same name exists, it first checks if the
alpha value already exists in the logfile and evalutes the CV of the alpha values that are absent. The newly evaluated CVs are appended to the existing file.
- plot_ECI(ignore_sizes=(0,), interactive=True)[source]
Plot the all the ECI.
Parameters:
- ignore_sizes: list of ints
Sizes listed in this list will not be plotted. Default is to ignore the emptry cluster.
- interactive: bool
If
True
, one can interact with the plot using mouse.
- plot_fit(interactive=False, savefig=False, fname=None, show_hull=True)[source]
Plot calculated (DFT) and predicted energies for a given alpha.
Paramters:
- alpha: int or float
regularization parameter.
- savefig: bool
- True: Save the plot with a file name specified in ‘fname’.
Only works when interactive=False. This option does not display figure.
False: Display figure without saving.
- fname: str
file name of the figure (only used when savefig = True)
- show_hull: bool
whether or not to show convex hull.
- print_coverage_report(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None [source]
Prints a report of how large fraction of the possible variation in each cluster is covered by the dataset
- Parameters:
file – a file-like object (stream); defaults to the current sys.stdout.
- save_eci(fname='eci.json', **kwargs)[source]
Save a dictionary of cluster names and their corresponding ECI value in JSON file format.
Parameters:
- fname: str
json filename. If no extension if given, .json is added
- kwargs:
Extra keywords are passed on to the
get_eci_dict()
method.
- set_normalization(normalization_symbols: Sequence[str] | None = None) None [source]
Set the energy normalization factor, e.g. to normalize the final energy reports in energy per metal atom, rather than energy per atom (i.e. every atom).
- Parameters:
normalization_symbols – A list of symbols which should be included in the counting. If this is None, then the default of normalizing to energy per every atom is maintained.
Fitting ECI’s to Non-Energy Properties
Note
It is currently only possible to fit to values stored as key-value pairs in the database,
i.e. it cannot be the default built-in fmax
or similar properties, yet.
To get around this, store the desired property as a key-value pair with a (slightly) different name.
Note
The desired target property should be stored in the row belonging to the final structure.
It is possible to fit ECI’s to non-energy properties, and instead use values stored as key-value pairs.
To do this, use the prop
keyword in the evalutate class. As an example, say we already have a database
of completed DFT calculations, and we wanted to fit to the average magnetic moment (why would want to do that
you ask? In this case, for the sake of demonstration!).
Let’s assume that this quantity has not already been calculated from our database, so we first loop through our final structures, find the average magnetic moment, and insert that quantity back in the database as a key-value pair.
from ase.db import connect
import numpy as np
db = connect("clease.db") # We assume our database is called 'clease.db'
# Select all the final structures
for row in db.select(struct_type="final"):
atoms = row.toatoms()
avg_magmom = np.mean(atoms.get_magnetic_moments())
# Insert the new quantity as a key-value pair.
db.update(row.id, avg_magmom=avg_magmom)
Now we calculated the average magnetic moment of all our final structures. We can now do a fit on this
new property with our evaluate class, Evalutate(..., prop='avg_magmom')
and then proceeding as normal.