Si PRX GAP example
This example will be used to highlight some of the more advanced features of the
Dataset class using the popular Si GAP dataset.
It is suggested that you go through the basic example first. The complete
code will not be shown in this example (for the complete code, see the Jupyter
notebook at colabfit/examples/Si_PRX_GAP/si_prx_gap.ipynb
); instead, only the additional features will be
discussed here.
Note that this example assumes that the raw data has already been downloaded using the following commands:
$ mkdir si_prx_gap
$ cd si_prx_gap && wget -O Si_PRX_GAP.zip https://www.repository.cam.ac.uk/bitstream/handle/1810/317974/Si_PRX_GAP.zip?sequence=1&isAllowed=yield
$ cd si_prx_gap && unzip Si_PRX_GAP.zip
Loading from a file
This example uses load_data()
to load the data
from an existing Extended XYZ file. Note that the raw data includes the
config_type
field, which is used to generate the names of the loaded
Configurations. A default_name
is also provided to handle the
configurations that do not have a config_type
field.
verbose=True
is used here since the dataset is large enough to warrant a
progress bar.
dataset.configurations = load_data(
file_path='./si_prx_gap/gp_iter6_sparse9k.xml.xyz',
file_format='xyz',
name_field='config_type', # key in Configuration.info to use as the Configuration name
elements=['Si'],
default_name='Si_PRX_GAP', # default name with `name_field` not found
verbose=True
)
Cleaning data
Some of the Configurations loaded in by load_data()
need to be cleaned before they
are ready to be used. Specifically:
1. The ‘per-atom’ field should be added to each configuration 1. Some fields are inconsistently named, using both ‘-’ and ‘_’ 2. Some fields need to be converted from strings to floats 3. Stress vectors should be reshaped to have size (3, 3)
We will address this by writing a function for modifying the Configurations in-place.
# Data stored on atoms needs to be cleaned
def tform(img):
img.info['per-atom'] = False
# Renaming some fields to be consistent
info_items = list(img.info.items())
for key, v in info_items:
if key in ['_name', '_labels', '_constraints']:
continue
del img.info[key]
img.info[key.replace('_', '-').lower()] = v
arrays_items = list(img.arrays.items())
for key, v in arrays_items:
del img.arrays[key]
img.arrays[key.replace('_', '-').lower()] = v
# Converting some string values to floats
for k in [
'md-temperature', 'md-cell-t', 'smearing-width', 'md-delta-t',
'md-ion-t', 'cut-off-energy', 'elec-energy-tol',
]:
if k in img.info:
try:
img.info[k] = float(img.info[k].split(' ')[0])
except:
pass
# Reshaping shape (9,) stress vector to (3, 3) to match definition
if 'dft-virial' in img.info:
img.info['dft-virial'] = img.info['dft-virial'].reshape((3,3))
if 'gap-virial' in img.info:
img.info['gap-virial'] = img.info['gap-virial'].reshape((3,3))
The tform()
function can be passed to insert_data()
using the
transform
argument, which will call tform()
on each Configuration
before doing any additional processing.
Handling different property settings
This Dataset contains the common energy/forces/virial data, but also includes a
large amount of additional data/information for each calculation which can be
stored as PropertySettings objects. This Dataset also has energy/forces/virial
data computed using multiple methods (DFT and a trained GAP model). In this
section we will discuss how to use the property_map
argument property
with the insert_data()
function.
To begin with, we first write a property definition for storing computed energy/forces/virial data. Note that this same definition will be used for both the DFT-computed and the GAP-computed data.
base_definition = {
'property-id': 'energy-forces-stress',
'property-title': 'Basic outputs from a static calculation',
'property-description':
'Energy, forces, and stresses from a calculation of a '\
'static configuration. Energies must be specified to be '\
'per-atom or supercell. If a reference energy has been '\
'used, this must be specified as well.',
'energy': {
'type': 'float',
'has-unit': True,
'extent': [],
'required': False,
'description':
'The potential energy of the system.'
},
'forces': {
'type': 'float',
'has-unit': True,
'extent': [":", 3],
'required': False,
'description':
'The [x,y,z] components of the force on each particle.'
},
'stress': {
'type': 'float',
'has-unit': True,
'extent': [3, 3],
'required': False,
'description':
'The full Cauchy stress tensor of the simulation cell'
},
'per-atom': {
'type': 'bool',
'has-unit': False,
'extent': [],
'required': True,
'description':
'If True, "energy" is the total energy of the system, '\
'and has NOT been divided by the number of atoms in the '\
'configuration.'
},
'reference-energy': {
'type': 'float',
'has-unit': True,
'extent': [],
'required': False,
'description':
'If provided, then "energy" is the energy (either of '\
'the whole system, or per-atom) LESS the energy of '\
'a reference configuration (E = E_0 - E_reference). '\
'Note that "reference-energy" is just provided for '\
'documentation, and that "energy" should already have '\
'this value subtracted off. The reference energy must '\
'have the same units as "energy".'
},
}
We will then prepare two separate maps. One for loading any DFT-computed properties:
dft_map = {
# Property Definition field: {'field': ASE field, 'units': ASE-readable units}
'energy': {'field': 'dft-energy', 'units': 'eV'},
'forces': {'field': 'dft-force', 'units': 'eV/Ang'},
'stress': {'field': 'dft-virial', 'units': 'GPa'},
'per-atom': {'field': 'per-atom', 'units': None},
}
And a separate one for loading GAP-computed properties:
gap_map = {
# Property Definition field: {'field': ASE field, 'units': ASE-readable units}
'energy': {'field': 'gap-energy', 'units': 'eV'},
'forces': {'field': 'gap-force', 'units': 'eV/Ang'},
'stress': {'field': 'gap-virial', 'units': 'GPa'},
'per-atom': {'field': 'per-atom', 'units': None},
}
Next, we will create a list of all of the fields that should be stored on a PropertySettings object rather than on a Property:
settings_keys = [
'mix-history-length',
'castep-file-name',
'grid-scale',
'popn-calculate',
'n-neighb',
'oldpos',
'i-step',
'md-temperature',
'positions',
'task',
'data-distribution',
'avg-ke',
'force-nlpot',
'continuation',
'castep-run-time',
'calculate-stress',
'minim-hydrostatic-strain',
'avgpos',
'frac-pos',
'hamiltonian',
'md-cell-t',
'cutoff-factor',
'momenta',
'elec-energy-tol',
'mixing-scheme',
'minim-lattice-fix',
'in-file',
'travel',
'thermostat-region',
'time',
'temperature',
'kpoints-mp-grid',
'cutoff',
'xc-functional',
'smearing-width',
'pressure',
'reuse',
'fix-occupancy',
'map-shift',
'md-num-iter',
'damp-mask',
'opt-strategy',
'spin-polarized',
'nextra-bands',
'fine-grid-scale',
'masses',
'iprint',
'finite-basis-corr',
'enthalpy',
'opt-strategy-bias',
'force-ewald',
'num-dump-cycles',
'velo',
'md-delta-t',
'md-ion-t',
'force-locpot',
'numbers',
'max-scf-cycles',
'mass',
'minim-constant-volume',
'cut-off-energy',
'virial',
'nneightol',
'max-charge-amp',
'md-thermostat',
'md-ensemble',
'acc',
]
We will also specify any units on the fields:
units = {
'energy': 'eV',
'forces': 'eV/Ang',
'virial': 'GPa',
'oldpos': 'Ang',
'md-temperature': 'K',
'positions': 'Ang',
'avg-ke': 'eV',
'force-nlpot': 'eV/Ang',
'castep-run-time': 's',
'avgpos': 'Ang',
'md-cell-t': 'ps',
'time': 's',
'temperature': 'K',
'gap-force': 'eV/Ang',
'gap-energy': 'eV',
'cutoff': 'Ang',
'smearing-width': 'eV',
'pressure': 'GPa',
'gap-virial': 'GPa',
'masses': '_amu',
'enthalpy': 'eV',
'force-ewald': 'eV/Ang',
'velo': 'Ang/s',
'md-delta-t': 'fs',
'md-ion-t': 'ps',
'force-locpot': 'eV/Ang',
'mass': 'g',
'cut-off-energy': 'eV',
'virial': 'GPa',
}
We will also create dictionaries for constructing the DFT settings:
dft_settings_map = {
k: {'field': k, 'units': units[k] if k in units else None} for k in settings_keys
}
dft_settings_map['_method'] = 'CASTEP'
dft_settings_map['_description'] = 'DFT calculations using the CASTEP software'
dft_settings_map['_files'] = None
dft_settings_map['_labels'] = ['Monkhorst-Pack']
And the GAP settings:
gap_settings_map = dict(dft_settings_map)
gap_settings_map['_method'] = 'GAP'
gap_settings_map['_description'] = 'Predictions using a trained GAP potential'
gap_settings_map['_files'] = None
gap_settings_map['_labels'] = ['GAP', 'classical']
Each of these settings maps will be attached to their corresponding property maps:
dft_map['_settings'] = dft_settings_map
gap_map['_settings'] = gap_settings_map
Finally, they will both be merged into a single map, which will be passed
directly to insert_data()
:
property_map = {
'energy-forces-stress': [
dft_map,
gap_map,
]
}
ids = client.insert_data(
images,
property_map=property_map,
transform=tform,
verbose=True
)
Manually constructed ConfigurationSets
Since this dataset was manually constructed by its authors, a large amount of additional information has been provided to better identify the Configurations (see Table I. in the original paper). In order to retain this information, we define ConfigurationSets by regex matching on the Configuration names (see Building configuration sets for more details).
configuration_set_regexes = {
'isolated_atom': 'Reference atom',
'bt': 'Beta-tin',
'dia': 'Diamond',
'sh': 'Simple hexagonal',
'hex_diamond': 'Hexagonal diamond',
'bcc': 'Body-centered-cubic',
'bc8': 'BC8',
'fcc': 'Face-centered-cubic',
'hcp': 'Hexagonal-close-packed',
'st12': 'ST12',
'liq': 'Liquid',
'amorph': 'Amorphous',
'surface_001': 'Diamond surface (001)',
'surface_110': 'Diamond surface (110)',
'surface_111': 'Diamond surface (111)',
'surface_111_pandey': 'Pandey reconstruction of diamond (111) surface',
'surface_111_3x3_das': 'Dimer-adatom-stacking-fault (DAS) reconstruction',
'111adatom': 'Configurations with adatom on (111) surface',
'crack_110_1-10': 'Small (110) crack tip',
'crack_111_1-10': 'Small (111) crack tip',
'decohesion': 'Decohesion of diamond-structure Si along various directions',
'divacancy': 'Diamond divacancy configurations',
'interstitial': 'Diamond interstitial configurations',
'screw_disloc': 'Si screw dislocation core',
'sp': 'sp bonded configurations',
'sp2': 'sp2 bonded configurations',
'vacancy': 'Diamond vacancy configurations'
}
cs_ids = []
for i, (regex, desc) in enumerate(configuration_set_regexes.items()):
co_ids = client.get_data(
'configurations',
fields='_id',
query={'names': {'$regex': regex}},
ravel=True
).tolist()
print(f'Configuration set {i}', f'({regex}):'.rjust(22), f'{len(co_ids)}'.rjust(7))
cs_id = client.insert_configuration_set(co_ids, description=desc, verbose=True)
cs_ids.append(cs_id)
Manually applied Configuration labels
Similarly, additional knowledge provided by the authors about the types of Configurations and Properties in the dataset can be used to apply metadata labels to the Configurations, which is useful for enabling querying over the data by future users. See Applying configuration labels for more details.
Second, applying labels to the Configurations based on author-provided information.
configuration_label_regexes = {
'isolated_atom': 'isolated_atom',
'bt': 'a5',
'dia': 'diamond',
'sh': 'sh',
'hex_diamond': 'sonsdaleite',
'bcc': 'bcc',
'bc8': 'bc8',
'fcc': 'fcc',
'hcp': 'hcp',
'st12': 'st12',
'liq': 'liquid',
'amorph': 'amorphous',
'surface_001': ['surface', '001'],
'surface_110': ['surface', '110'],
'surface_111': ['surface', '111'],
'surface_111_pandey': ['surface', '111'],
'surface_111_3x3_das': ['surface', '111', 'das'],
'111adatom': ['surface', '111', 'adatom'],
'crack_110_1-10': ['crack', '110'],
'crack_111_1-10': ['crac', '111'],
'decohesion': ['diamond', 'decohesion'],
'divacancy': ['diamond', 'vacancy', 'divacancy'],
'interstitial': ['diamond', 'interstitial'],
'screw_disloc': ['screw', 'dislocation'],
'sp': 'sp',
'sp2': 'sp2',
'vacancy': ['diamond', 'vacancy']
}
for regex, labels in configuration_label_regexes.items():
client.apply_labels(
dataset_id=ds_id,
collection_name='configurations',
query={'names': {'$regex': regex}},
labels=labels,
verbose=True
)