Fitting ECIs

The Evaluate Class

class clease.evaluate.Evaluate(settings, prop='energy', cf_names=None, select_cond=None, parallel=False, num_core='all', fitting_scheme='ridge', alpha=1e-05, max_cluster_size=None, max_cluster_dia=None, scoring_scheme='loocv', min_weight=1.0, nsplits=10, num_repetitions=1, normalization_symbols: Sequence[str] | None = None)[source]

Evaluate RMSE/MAE of the fit and CV scores.

Parameters:
  • settings – ClusterExpansionSettings object

  • prop – str User defined property for the fit. The property should exist in database as key-value pairs. Default is energy.

  • cf_names – list Names of clusters to include in the evalutation. If None, all of the possible clusters are included.

  • select_cond – tuple or list of tuples (optional) Custom selection condition specified by user. Default only includes “converged=True” and “struct_type=’initial’”.

  • max_cluster_size – int maximum number of atoms in the cluster to include in the fit. If None, no restriction on the number of atoms will be imposed.

  • max_cluster_dia – float or int maximum diameter of the cluster (in angstrom) to include in the fit. If None, no restriction on the diameter. Note that this diameter of the circumscribed sphere, which is slightly different from the meaning of max_cluster_dia in ClusterExpansionSettings where it refers to the the maximum internal distance between any of the atoms in the cluster.

  • scoring_scheme – str should be one of ‘loocv’, ‘loocv_fast’ or ‘k-fold’

  • min_weight

    float Weight given to the data point furthest away from any structure on the convex hull. An exponential weighting function is used and the decay rate is calculated as

    decay = log(min_weight)/min(sim_measure)

    where sim_measure is a similarity measure used to asses how different the structure is from structures on the convex hull.

  • nsplits – int Number of splits to use when partitioning the dataset into training and validation data. Only used when scoring_scheme=’k-fold’

  • num_repetitions – int Number of repetitions used to use when calculating k-fold cross validation. The partitioning is repeated num_repetitions times and the resulting value is the average of the k-fold cross validation score obtained in each of the runs.

alpha_CV(alpha_min=1e-07, alpha_max=1.0, num_alpha=10, scale='log', logfile=None, fitting_schemes=None)[source]

Calculate CV for a given range of alpha.

In addition to calculating CV with respect to alpha, a logfile can be used to extend the range of alpha or to add more alpha values in a given range.

Returns a list of alpha values, and a list of CV scores.

Parameters:

alpha_min: int or float

minimum value of regularization parameter alpha.

alpha_max: int or float

maximum value of regularization parameter alpha.

num_alpha: int

number of alpha values to be used in the plot.

scale: str
  • ‘log’(default): alpha values are evenly spaced on a log scale.

  • ‘linear’: alpha values are evenly spaced on a linear scale.

logfile: file object, str or None.
  • None: logging is disabled

  • str: a file with that name will be opened. If ‘-’, stdout used.

  • file object: use the file object for logging

fitting_schemes: None or array of instance of LinearRegression.

Note: If the file with the same name exists, it first checks if the

alpha value already exists in the logfile and evalutes the CV of the alpha values that are absent. The newly evaluated CVs are appended to the existing file.

property atomic_concentrations

The actual atomic concentration (including background lattices) normalised against the total number of atoms

property concentrations

The internal concentrations normalised against the ‘active’ sublattices

cv_for_alpha(alphas: List[float]) None[source]

Calculate the CV scores for alphas using the fitting scheme specified in the Evaluate object.

Parameters:

alphas – List of alpha values to get CV scores

export_dataset(fname)[source]

Export the dataset used to fit a model y = Xc where y is typically the DFT energy per atom and c is the unknown ECIs. This function exports the data to a csv file with the following format

# ECIname_1, ECIname_2, …, ECIname_n, E_DFT 0.1, 0.4, …, -0.6, -2.0 0.3, 0.2, …, -0.9, -2.3

thus each row in the file contains the correlation function values and the corresponding DFT energy value.

Parameter:

fname: str

Filename to write to. Typically this should end with .csv

fit() None[source]

Determine the ECI with the given regressor.

This will always calculate a new fit.

fit_required() bool[source]

Check whether we need to calculate the ECI values.

generalization_error(validation_id: List[int])[source]

Estimate the generalization error to new datapoints

Parameters:

validation_ids – List with IDs to leave out of the dataset

get_cv_score()[source]

Calculate the CV score according to the selected scheme

get_eci() ndarray[source]

Determine and return ECIs for a given alpha. Raises a ValueError if no fit has been performed yet.

Returns:

A 1D array of floats with all ECI values.

Return type:

np.ndarray

get_eci_by_size() Dict[str, Dict[str, list]][source]

Classify distance, eci and cf_name according to cluster body size

Returns:

Dictionary which contains

  • Key: body size of cluster

  • Value: A dictionary with the following entries:

    • ”distance” : distance of the cluster

    • ”eci” : eci of the cluster

    • ”name” : name of the cluster

    • ”radius” : Radius of the cluster in Ångstrom.

get_eci_dict(cutoff_tol: float = 1e-14) Dict[str, float][source]

Determine cluster names and their corresponding ECI value and return them in a dictionary format.

Parameters:

cutoff_tol (float, optional) – Cutoff value below which the absolute ECI value is considered to be 0. Defaults to 1e-14.

Returns:

Dictionary with the CF names and the corresponding

ECI value.

Return type:

Dict[str, float]

get_energy_predict(normalize: bool = True) ndarray[source]

Perform matrix multiplication of eci and cf_matrix

Returns:

Energy predicted using ECIs

k_fold_cv()[source]

Determine the k-fold cross validation.

load_eci(fname='eci.json') None[source]

Read in ECI values stored to a json file.

Note: this doesn’t load the scheme or the alpha value, so it will not prevent a new fit to be performed if requested, as it may be incompatible with the current fitting scheme.

load_eci_dict(eci_dict: Dict[str, float]) None[source]

Load the ECI’s from a dictionary. Any ECI’s which are missing from the internal cf_names list are assumed to be 0.

Note: this doesn’t load the scheme or the alpha value, so it will not prevent a new fit to be performed if requested, as it may be incompatible with the current fitting scheme.

loocv()[source]

Determine the CV score for the Leave-One-Out case.

loocv_fast()[source]

CV score based on the method in J. Phase Equilib. 23, 348 (2002).

This method has a computational complexity of order n^1.

mae()[source]

Calculate mean absolute error (MAE) of the fit.

plot_CV(alpha_min=1e-07, alpha_max=1.0, num_alpha=10, scale='log', logfile=None, fitting_schemes=None, savefig=False, fname=None)[source]

Plot CV for a given range of alpha.

In addition to plotting CV with respect to alpha, logfile can be used to extend the range of alpha or add more alpha values in a given range. Returns an alpha value that leads to the minimum CV score within the pool of evaluated alpha values.

Parameters:

alpha_min: int or float

minimum value of regularization parameter alpha.

alpha_max: int or float

maximum value of regularization parameter alpha.

num_alpha: int

number of alpha values to be used in the plot.

scale: str
  • ‘log’(default): alpha values are evenly spaced on a log scale.

  • ‘linear’: alpha values are evenly spaced on a linear scale.

logfile: file object, str or None
  • None: logging is disabled

  • str: a file with that name will be opened. If ‘-’, stdout used.

  • file object: use the file object for logging

fitting_schemes: None or array of instance of LinearRegression

savefig: bool
  • True: Save the plot with a file name specified in ‘fname’. This

    option does not display figure.

  • False: Display figure without saving.

fname: str

file name of the figure (only used when savefig = True)

Note: If the file with the same name exists, it first checks if the

alpha value already exists in the logfile and evalutes the CV of the alpha values that are absent. The newly evaluated CVs are appended to the existing file.

plot_ECI(ignore_sizes=(0,), interactive=True)[source]

Plot the all the ECI.

Parameters:

ignore_sizes: list of ints

Sizes listed in this list will not be plotted. Default is to ignore the emptry cluster.

interactive: bool

If True, one can interact with the plot using mouse.

plot_fit(interactive=False, savefig=False, fname=None, show_hull=True)[source]

Plot calculated (DFT) and predicted energies for a given alpha.

Paramters:

alpha: int or float

regularization parameter.

savefig: bool
  • True: Save the plot with a file name specified in ‘fname’.

    Only works when interactive=False. This option does not display figure.

  • False: Display figure without saving.

fname: str

file name of the figure (only used when savefig = True)

show_hull: bool

whether or not to show convex hull.

print_coverage_report(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None[source]

Prints a report of how large fraction of the possible variation in each cluster is covered by the dataset

Parameters:

file – a file-like object (stream); defaults to the current sys.stdout.

rmse()[source]

Calculate root-mean-square error (RMSE) of the fit.

save_eci(fname='eci.json', **kwargs)[source]

Save a dictionary of cluster names and their corresponding ECI value in JSON file format.

Parameters:

fname: str

json filename. If no extension if given, .json is added

kwargs:

Extra keywords are passed on to the get_eci_dict() method.

set_normalization(normalization_symbols: Sequence[str] | None = None) None[source]

Set the energy normalization factor, e.g. to normalize the final energy reports in energy per metal atom, rather than energy per atom (i.e. every atom).

Parameters:

normalization_symbols – A list of symbols which should be included in the counting. If this is None, then the default of normalizing to energy per every atom is maintained.

Fitting ECI’s to Non-Energy Properties

Note

It is currently only possible to fit to values stored as key-value pairs in the database, i.e. it cannot be the default built-in fmax or similar properties, yet. To get around this, store the desired property as a key-value pair with a (slightly) different name.

Note

The desired target property should be stored in the row belonging to the final structure.

It is possible to fit ECI’s to non-energy properties, and instead use values stored as key-value pairs. To do this, use the prop keyword in the evalutate class. As an example, say we already have a database of completed DFT calculations, and we wanted to fit to the average magnetic moment (why would want to do that you ask? In this case, for the sake of demonstration!).

Let’s assume that this quantity has not already been calculated from our database, so we first loop through our final structures, find the average magnetic moment, and insert that quantity back in the database as a key-value pair.

from ase.db import connect
import numpy as np

db = connect("clease.db")  # We assume our database is called 'clease.db'
# Select all the final structures
for row in db.select(struct_type="final"):
    atoms = row.toatoms()
    avg_magmom = np.mean(atoms.get_magnetic_moments())
    # Insert the new quantity as a key-value pair.
    db.update(row.id, avg_magmom=avg_magmom)

Now we calculated the average magnetic moment of all our final structures. We can now do a fit on this new property with our evaluate class, Evalutate(..., prop='avg_magmom') and then proceeding as normal.