Si PRX GAP example

This example will be used to highlight some of the more advanced features of the Dataset class using the popular Si GAP dataset. It is suggested that you go through the basic example first. The complete code will not be shown in this example (for the complete code, see the Jupyter notebook at colabfit/examples/Si_PRX_GAP/si_prx_gap.ipynb); instead, only the additional features will be discussed here.

Note that this example assumes that the raw data has already been downloaded using the following commands:

$ mkdir si_prx_gap
$ cd si_prx_gap && wget -O Si_PRX_GAP.zip https://www.repository.cam.ac.uk/bitstream/handle/1810/317974/Si_PRX_GAP.zip?sequence=1&isAllowed=yield
$ cd si_prx_gap && unzip Si_PRX_GAP.zip

Loading from a file

This example uses load_data() to load the data from an existing Extended XYZ file. Note that the raw data includes the config_type field, which is used to generate the names of the loaded Configurations. A default_name is also provided to handle the configurations that do not have a config_type field. verbose=True is used here since the dataset is large enough to warrant a progress bar.

dataset.configurations = load_data(
        file_path='./si_prx_gap/gp_iter6_sparse9k.xml.xyz',
        file_format='xyz',
        name_field='config_type',  # key in Configuration.info to use as the Configuration name
        elements=['Si'],
        default_name='Si_PRX_GAP',  # default name with `name_field` not found
        verbose=True
)

Cleaning data

Some of the Configurations loaded in by load_data() need to be cleaned before they are ready to be used. Specifically:

1. The ‘per-atom’ field should be added to each configuration 1. Some fields are inconsistently named, using both ‘-’ and ‘_’ 2. Some fields need to be converted from strings to floats 3. Stress vectors should be reshaped to have size (3, 3)

We will address this by writing a function for modifying the Configurations in-place.

# Data stored on atoms needs to be cleaned
def tform(img):
        img.info['per-atom'] = False

        # Renaming some fields to be consistent
        info_items = list(img.info.items())

        for key, v in info_items:
                if key in ['_name', '_labels', '_constraints']:
                        continue

                del img.info[key]
                img.info[key.replace('_', '-').lower()] = v

        arrays_items = list(img.arrays.items())
        for key, v in arrays_items:
                del img.arrays[key]
                img.arrays[key.replace('_', '-').lower()] = v

        # Converting some string values to floats
        for k in [
                'md-temperature', 'md-cell-t', 'smearing-width', 'md-delta-t',
                'md-ion-t', 'cut-off-energy', 'elec-energy-tol',
                ]:
                if k in img.info:
                        try:
                                img.info[k] = float(img.info[k].split(' ')[0])
                        except:
                                pass

        # Reshaping shape (9,) stress vector to (3, 3) to match definition
        if 'dft-virial' in img.info:
                img.info['dft-virial'] = img.info['dft-virial'].reshape((3,3))

        if 'gap-virial' in img.info:
                        img.info['gap-virial'] = img.info['gap-virial'].reshape((3,3))

The tform() function can be passed to insert_data() using the transform argument, which will call tform() on each Configuration before doing any additional processing.

Handling different property settings

This Dataset contains the common energy/forces/virial data, but also includes a large amount of additional data/information for each calculation which can be stored as PropertySettings objects. This Dataset also has energy/forces/virial data computed using multiple methods (DFT and a trained GAP model). In this section we will discuss how to use the property_map argument property with the insert_data() function.

To begin with, we first write a property definition for storing computed energy/forces/virial data. Note that this same definition will be used for both the DFT-computed and the GAP-computed data.

base_definition = {
        'property-id': 'energy-forces-stress',
        'property-title': 'Basic outputs from a static calculation',
        'property-description':
                'Energy, forces, and stresses from a calculation of a '\
                'static configuration. Energies must be specified to be '\
                'per-atom or supercell. If a reference energy has been '\
                'used, this must be specified as well.',

        'energy': {
                'type': 'float',
                'has-unit': True,
                'extent': [],
                'required': False,
                'description':
                        'The potential energy of the system.'
        },
        'forces': {
                'type': 'float',
                'has-unit': True,
                'extent': [":", 3],
                'required': False,
                'description':
                        'The [x,y,z] components of the force on each particle.'
        },
        'stress': {
                'type': 'float',
                'has-unit': True,
                'extent': [3, 3],
                'required': False,
                'description':
                        'The full Cauchy stress tensor of the simulation cell'
        },

        'per-atom': {
                'type': 'bool',
                'has-unit': False,
                'extent': [],
                'required': True,
                'description':
                        'If True, "energy" is the total energy of the system, '\
                        'and has NOT been divided by the number of atoms in the '\
                        'configuration.'
        },
        'reference-energy': {
                'type': 'float',
                'has-unit': True,
                'extent': [],
                'required': False,
                'description':
                        'If provided, then "energy" is the energy (either of '\
                        'the whole system, or per-atom) LESS the energy of '\
                        'a reference configuration (E = E_0 - E_reference). '\
                        'Note that "reference-energy" is just provided for '\
                        'documentation, and that "energy" should already have '\
                        'this value subtracted off. The reference energy must '\
                        'have the same units as "energy".'
        },
}

We will then prepare two separate maps. One for loading any DFT-computed properties:

dft_map = {
        # Property Definition field: {'field': ASE field, 'units': ASE-readable units}
        'energy': {'field': 'dft-energy', 'units': 'eV'},
        'forces': {'field': 'dft-force',  'units': 'eV/Ang'},
        'stress': {'field': 'dft-virial', 'units': 'GPa'},
        'per-atom': {'field': 'per-atom', 'units': None},
}

And a separate one for loading GAP-computed properties:

gap_map = {
        # Property Definition field: {'field': ASE field, 'units': ASE-readable units}
        'energy': {'field': 'gap-energy', 'units': 'eV'},
        'forces': {'field': 'gap-force',  'units': 'eV/Ang'},
        'stress': {'field': 'gap-virial', 'units': 'GPa'},
        'per-atom': {'field': 'per-atom', 'units': None},
}

Next, we will create a list of all of the fields that should be stored on a PropertySettings object rather than on a Property:

settings_keys = [
        'mix-history-length',
        'castep-file-name',
        'grid-scale',
        'popn-calculate',
        'n-neighb',
        'oldpos',
        'i-step',
        'md-temperature',
        'positions',
        'task',
        'data-distribution',
        'avg-ke',
        'force-nlpot',
        'continuation',
        'castep-run-time',
        'calculate-stress',
        'minim-hydrostatic-strain',
        'avgpos',
        'frac-pos',
        'hamiltonian',
        'md-cell-t',
        'cutoff-factor',
        'momenta',
        'elec-energy-tol',
        'mixing-scheme',
        'minim-lattice-fix',
        'in-file',
        'travel',
        'thermostat-region',
        'time',
        'temperature',
        'kpoints-mp-grid',
        'cutoff',
        'xc-functional',
        'smearing-width',
        'pressure',
        'reuse',
        'fix-occupancy',
        'map-shift',
        'md-num-iter',
        'damp-mask',
        'opt-strategy',
        'spin-polarized',
        'nextra-bands',
        'fine-grid-scale',
        'masses',
        'iprint',
        'finite-basis-corr',
        'enthalpy',
        'opt-strategy-bias',
        'force-ewald',
        'num-dump-cycles',
        'velo',
        'md-delta-t',
        'md-ion-t',
        'force-locpot',
        'numbers',
        'max-scf-cycles',
        'mass',
        'minim-constant-volume',
        'cut-off-energy',
        'virial',
        'nneightol',
        'max-charge-amp',
        'md-thermostat',
        'md-ensemble',
        'acc',
]

We will also specify any units on the fields:

units = {
        'energy': 'eV',
        'forces': 'eV/Ang',
        'virial': 'GPa',
        'oldpos': 'Ang',
        'md-temperature': 'K',
        'positions': 'Ang',
        'avg-ke': 'eV',
        'force-nlpot': 'eV/Ang',
        'castep-run-time': 's',
        'avgpos': 'Ang',
        'md-cell-t': 'ps',
        'time': 's',
        'temperature': 'K',
        'gap-force': 'eV/Ang',
        'gap-energy': 'eV',
        'cutoff': 'Ang',
        'smearing-width': 'eV',
        'pressure': 'GPa',
        'gap-virial': 'GPa',
        'masses': '_amu',
        'enthalpy': 'eV',
        'force-ewald': 'eV/Ang',
        'velo': 'Ang/s',
        'md-delta-t': 'fs',
        'md-ion-t': 'ps',
        'force-locpot': 'eV/Ang',
        'mass': 'g',
        'cut-off-energy': 'eV',
        'virial': 'GPa',
}

We will also create dictionaries for constructing the DFT settings:

dft_settings_map = {
        k: {'field': k, 'units': units[k] if k in units else None} for k in settings_keys
}

dft_settings_map['_method'] = 'CASTEP'
dft_settings_map['_description'] = 'DFT calculations using the CASTEP software'
dft_settings_map['_files'] = None
dft_settings_map['_labels'] = ['Monkhorst-Pack']

And the GAP settings:

gap_settings_map = dict(dft_settings_map)

gap_settings_map['_method'] = 'GAP'
gap_settings_map['_description'] = 'Predictions using a trained GAP potential'
gap_settings_map['_files'] = None
gap_settings_map['_labels'] = ['GAP', 'classical']

Each of these settings maps will be attached to their corresponding property maps:

dft_map['_settings'] = dft_settings_map
gap_map['_settings'] = gap_settings_map

Finally, they will both be merged into a single map, which will be passed directly to insert_data():

property_map = {
        'energy-forces-stress': [
                dft_map,
                gap_map,
        ]
}

ids = client.insert_data(
        images,
        property_map=property_map,
        transform=tform,
        verbose=True
)

Manually constructed ConfigurationSets

Since this dataset was manually constructed by its authors, a large amount of additional information has been provided to better identify the Configurations (see Table I. in the original paper). In order to retain this information, we define ConfigurationSets by regex matching on the Configuration names (see Building configuration sets for more details).

configuration_set_regexes = {
    'isolated_atom': 'Reference atom',
    'bt': 'Beta-tin',
    'dia': 'Diamond',
    'sh': 'Simple hexagonal',
    'hex_diamond': 'Hexagonal diamond',
    'bcc': 'Body-centered-cubic',
    'bc8': 'BC8',
    'fcc': 'Face-centered-cubic',
    'hcp': 'Hexagonal-close-packed',
    'st12': 'ST12',
    'liq': 'Liquid',
    'amorph': 'Amorphous',
    'surface_001': 'Diamond surface (001)',
    'surface_110': 'Diamond surface (110)',
    'surface_111': 'Diamond surface (111)',
    'surface_111_pandey': 'Pandey reconstruction of diamond (111) surface',
    'surface_111_3x3_das': 'Dimer-adatom-stacking-fault (DAS) reconstruction',
    '111adatom': 'Configurations with adatom on (111) surface',
    'crack_110_1-10': 'Small (110) crack tip',
    'crack_111_1-10': 'Small (111) crack tip',
    'decohesion': 'Decohesion of diamond-structure Si along various directions',
    'divacancy': 'Diamond divacancy configurations',
    'interstitial': 'Diamond interstitial configurations',
    'screw_disloc': 'Si screw dislocation core',
    'sp': 'sp bonded configurations',
    'sp2': 'sp2 bonded configurations',
    'vacancy': 'Diamond vacancy configurations'
}

cs_ids = []

for i, (regex, desc) in enumerate(configuration_set_regexes.items()):
        co_ids = client.get_data(
                'configurations',
                fields='_id',
                query={'names': {'$regex': regex}},
                ravel=True
        ).tolist()

        print(f'Configuration set {i}', f'({regex}):'.rjust(22), f'{len(co_ids)}'.rjust(7))

        cs_id = client.insert_configuration_set(co_ids, description=desc, verbose=True)

        cs_ids.append(cs_id)

Manually applied Configuration labels

Similarly, additional knowledge provided by the authors about the types of Configurations and Properties in the dataset can be used to apply metadata labels to the Configurations, which is useful for enabling querying over the data by future users. See Applying configuration labels for more details.

Second, applying labels to the Configurations based on author-provided information.

configuration_label_regexes = {
    'isolated_atom': 'isolated_atom',
    'bt': 'a5',
    'dia': 'diamond',
    'sh': 'sh',
    'hex_diamond': 'sonsdaleite',
    'bcc': 'bcc',
    'bc8': 'bc8',
    'fcc': 'fcc',
    'hcp': 'hcp',
    'st12': 'st12',
    'liq': 'liquid',
    'amorph': 'amorphous',
    'surface_001': ['surface', '001'],
    'surface_110': ['surface', '110'],
    'surface_111': ['surface', '111'],
    'surface_111_pandey': ['surface', '111'],
    'surface_111_3x3_das': ['surface', '111', 'das'],
    '111adatom': ['surface', '111', 'adatom'],
    'crack_110_1-10': ['crack', '110'],
    'crack_111_1-10': ['crac', '111'],
    'decohesion': ['diamond', 'decohesion'],
    'divacancy': ['diamond', 'vacancy', 'divacancy'],
    'interstitial': ['diamond', 'interstitial'],
    'screw_disloc': ['screw', 'dislocation'],
    'sp': 'sp',
    'sp2': 'sp2',
    'vacancy': ['diamond', 'vacancy']
}

for regex, labels in configuration_label_regexes.items():
    client.apply_labels(
        dataset_id=ds_id,
        collection_name='configurations',
        query={'names': {'$regex': regex}},
        labels=labels,
        verbose=True
    )