================== Si PRX GAP example ================== This example will be used to highlight some of the more advanced features of the Dataset class using the popular `Si GAP dataset `_. It is suggested that you go through the :ref:`basic example ` first. The complete code will not be shown in this example (for the complete code, see the Jupyter notebook at :code:`colabfit/examples/Si_PRX_GAP/si_prx_gap.ipynb`); instead, only the additional features will be discussed here. Note that this example assumes that the raw data has already been downloaded using the following commands: .. code-block:: console $ mkdir si_prx_gap $ cd si_prx_gap && wget -O Si_PRX_GAP.zip https://www.repository.cam.ac.uk/bitstream/handle/1810/317974/Si_PRX_GAP.zip?sequence=1&isAllowed=yield $ cd si_prx_gap && unzip Si_PRX_GAP.zip Loading from a file =================== This example uses :meth:`~colabfit.tools.database.load_data` to load the data from an existing Extended XYZ file. Note that the raw data includes the :code:`config_type` field, which is used to generate the names of the loaded Configurations. A :attr:`default_name` is also provided to handle the configurations that do not have a :code:`config_type` field. :code:`verbose=True` is used here since the dataset is large enough to warrant a progress bar. .. code-block:: python dataset.configurations = load_data( file_path='./si_prx_gap/gp_iter6_sparse9k.xml.xyz', file_format='xyz', name_field='config_type', # key in Configuration.info to use as the Configuration name elements=['Si'], default_name='Si_PRX_GAP', # default name with `name_field` not found verbose=True ) Cleaning data ============= Some of the Configurations loaded in by :meth:`load_data` need to be cleaned before they are ready to be used. Specifically: 1. The `'per-atom'` field should be added to each configuration 1. Some fields are inconsistently named, using both `'-'` and `'_'` 2. Some fields need to be converted from strings to floats 3. Stress vectors should be reshaped to have size `(3, 3)` We will address this by writing a function for modifying the Configurations in-place. .. code-block:: python # Data stored on atoms needs to be cleaned def tform(img): img.info['per-atom'] = False # Renaming some fields to be consistent info_items = list(img.info.items()) for key, v in info_items: if key in ['_name', '_labels', '_constraints']: continue del img.info[key] img.info[key.replace('_', '-').lower()] = v arrays_items = list(img.arrays.items()) for key, v in arrays_items: del img.arrays[key] img.arrays[key.replace('_', '-').lower()] = v # Converting some string values to floats for k in [ 'md-temperature', 'md-cell-t', 'smearing-width', 'md-delta-t', 'md-ion-t', 'cut-off-energy', 'elec-energy-tol', ]: if k in img.info: try: img.info[k] = float(img.info[k].split(' ')[0]) except: pass # Reshaping shape (9,) stress vector to (3, 3) to match definition if 'dft-virial' in img.info: img.info['dft-virial'] = img.info['dft-virial'].reshape((3,3)) if 'gap-virial' in img.info: img.info['gap-virial'] = img.info['gap-virial'].reshape((3,3)) The :meth:`tform` function can be passed to :meth:`insert_data` using the :code:`transform` argument, which will call :meth:`tform` on each Configuration before doing any additional processing. Handling different property settings ==================================== This Dataset contains the common energy/forces/virial data, but also includes a large amount of additional data/information for each calculation which can be stored as PropertySettings objects. This Dataset also has energy/forces/virial data computed using multiple methods (DFT and a trained GAP model). In this section we will discuss how to use the :code:`property_map` argument property with the :meth:`insert_data` function. To begin with, we first write a property definition for storing computed energy/forces/virial data. Note that this same definition will be used for both the DFT-computed and the GAP-computed data. .. code-block:: python base_definition = { 'property-id': 'energy-forces-stress', 'property-title': 'Basic outputs from a static calculation', 'property-description': 'Energy, forces, and stresses from a calculation of a '\ 'static configuration. Energies must be specified to be '\ 'per-atom or supercell. If a reference energy has been '\ 'used, this must be specified as well.', 'energy': { 'type': 'float', 'has-unit': True, 'extent': [], 'required': False, 'description': 'The potential energy of the system.' }, 'forces': { 'type': 'float', 'has-unit': True, 'extent': [":", 3], 'required': False, 'description': 'The [x,y,z] components of the force on each particle.' }, 'stress': { 'type': 'float', 'has-unit': True, 'extent': [3, 3], 'required': False, 'description': 'The full Cauchy stress tensor of the simulation cell' }, 'per-atom': { 'type': 'bool', 'has-unit': False, 'extent': [], 'required': True, 'description': 'If True, "energy" is the total energy of the system, '\ 'and has NOT been divided by the number of atoms in the '\ 'configuration.' }, 'reference-energy': { 'type': 'float', 'has-unit': True, 'extent': [], 'required': False, 'description': 'If provided, then "energy" is the energy (either of '\ 'the whole system, or per-atom) LESS the energy of '\ 'a reference configuration (E = E_0 - E_reference). '\ 'Note that "reference-energy" is just provided for '\ 'documentation, and that "energy" should already have '\ 'this value subtracted off. The reference energy must '\ 'have the same units as "energy".' }, } We will then prepare two separate maps. One for loading any DFT-computed properties: .. code-block:: python dft_map = { # Property Definition field: {'field': ASE field, 'units': ASE-readable units} 'energy': {'field': 'dft-energy', 'units': 'eV'}, 'forces': {'field': 'dft-force', 'units': 'eV/Ang'}, 'stress': {'field': 'dft-virial', 'units': 'GPa'}, 'per-atom': {'field': 'per-atom', 'units': None}, } And a separate one for loading GAP-computed properties: .. code-block:: python gap_map = { # Property Definition field: {'field': ASE field, 'units': ASE-readable units} 'energy': {'field': 'gap-energy', 'units': 'eV'}, 'forces': {'field': 'gap-force', 'units': 'eV/Ang'}, 'stress': {'field': 'gap-virial', 'units': 'GPa'}, 'per-atom': {'field': 'per-atom', 'units': None}, } Next, we will create a list of all of the fields that should be stored on a PropertySettings object rather than on a Property: .. code-block:: python settings_keys = [ 'mix-history-length', 'castep-file-name', 'grid-scale', 'popn-calculate', 'n-neighb', 'oldpos', 'i-step', 'md-temperature', 'positions', 'task', 'data-distribution', 'avg-ke', 'force-nlpot', 'continuation', 'castep-run-time', 'calculate-stress', 'minim-hydrostatic-strain', 'avgpos', 'frac-pos', 'hamiltonian', 'md-cell-t', 'cutoff-factor', 'momenta', 'elec-energy-tol', 'mixing-scheme', 'minim-lattice-fix', 'in-file', 'travel', 'thermostat-region', 'time', 'temperature', 'kpoints-mp-grid', 'cutoff', 'xc-functional', 'smearing-width', 'pressure', 'reuse', 'fix-occupancy', 'map-shift', 'md-num-iter', 'damp-mask', 'opt-strategy', 'spin-polarized', 'nextra-bands', 'fine-grid-scale', 'masses', 'iprint', 'finite-basis-corr', 'enthalpy', 'opt-strategy-bias', 'force-ewald', 'num-dump-cycles', 'velo', 'md-delta-t', 'md-ion-t', 'force-locpot', 'numbers', 'max-scf-cycles', 'mass', 'minim-constant-volume', 'cut-off-energy', 'virial', 'nneightol', 'max-charge-amp', 'md-thermostat', 'md-ensemble', 'acc', ] We will also specify any units on the fields: .. code-block:: python units = { 'energy': 'eV', 'forces': 'eV/Ang', 'virial': 'GPa', 'oldpos': 'Ang', 'md-temperature': 'K', 'positions': 'Ang', 'avg-ke': 'eV', 'force-nlpot': 'eV/Ang', 'castep-run-time': 's', 'avgpos': 'Ang', 'md-cell-t': 'ps', 'time': 's', 'temperature': 'K', 'gap-force': 'eV/Ang', 'gap-energy': 'eV', 'cutoff': 'Ang', 'smearing-width': 'eV', 'pressure': 'GPa', 'gap-virial': 'GPa', 'masses': '_amu', 'enthalpy': 'eV', 'force-ewald': 'eV/Ang', 'velo': 'Ang/s', 'md-delta-t': 'fs', 'md-ion-t': 'ps', 'force-locpot': 'eV/Ang', 'mass': 'g', 'cut-off-energy': 'eV', 'virial': 'GPa', } We will also create dictionaries for constructing the DFT settings: .. code-block:: python dft_settings_map = { k: {'field': k, 'units': units[k] if k in units else None} for k in settings_keys } dft_settings_map['_method'] = 'CASTEP' dft_settings_map['_description'] = 'DFT calculations using the CASTEP software' dft_settings_map['_files'] = None dft_settings_map['_labels'] = ['Monkhorst-Pack'] And the GAP settings: .. code-block:: python gap_settings_map = dict(dft_settings_map) gap_settings_map['_method'] = 'GAP' gap_settings_map['_description'] = 'Predictions using a trained GAP potential' gap_settings_map['_files'] = None gap_settings_map['_labels'] = ['GAP', 'classical'] Each of these settings maps will be attached to their corresponding property maps: .. code-block:: python dft_map['_settings'] = dft_settings_map gap_map['_settings'] = gap_settings_map Finally, they will both be merged into a single map, which will be passed directly to :meth:`insert_data`: .. code-block:: python property_map = { 'energy-forces-stress': [ dft_map, gap_map, ] } ids = client.insert_data( images, property_map=property_map, transform=tform, verbose=True ) Manually constructed ConfigurationSets ====================================== Since this dataset was manually constructed by its authors, a large amount of additional information has been provided to better identify the Configurations (see Table I. in `the original paper `_). In order to retain this information, we define ConfigurationSets by regex matching on the Configuration names (see :ref:`Building configuration sets` for more details). .. code-block:: python configuration_set_regexes = { 'isolated_atom': 'Reference atom', 'bt': 'Beta-tin', 'dia': 'Diamond', 'sh': 'Simple hexagonal', 'hex_diamond': 'Hexagonal diamond', 'bcc': 'Body-centered-cubic', 'bc8': 'BC8', 'fcc': 'Face-centered-cubic', 'hcp': 'Hexagonal-close-packed', 'st12': 'ST12', 'liq': 'Liquid', 'amorph': 'Amorphous', 'surface_001': 'Diamond surface (001)', 'surface_110': 'Diamond surface (110)', 'surface_111': 'Diamond surface (111)', 'surface_111_pandey': 'Pandey reconstruction of diamond (111) surface', 'surface_111_3x3_das': 'Dimer-adatom-stacking-fault (DAS) reconstruction', '111adatom': 'Configurations with adatom on (111) surface', 'crack_110_1-10': 'Small (110) crack tip', 'crack_111_1-10': 'Small (111) crack tip', 'decohesion': 'Decohesion of diamond-structure Si along various directions', 'divacancy': 'Diamond divacancy configurations', 'interstitial': 'Diamond interstitial configurations', 'screw_disloc': 'Si screw dislocation core', 'sp': 'sp bonded configurations', 'sp2': 'sp2 bonded configurations', 'vacancy': 'Diamond vacancy configurations' } .. code-block:: python cs_ids = [] for i, (regex, desc) in enumerate(configuration_set_regexes.items()): co_ids = client.get_data( 'configurations', fields='_id', query={'names': {'$regex': regex}}, ravel=True ).tolist() print(f'Configuration set {i}', f'({regex}):'.rjust(22), f'{len(co_ids)}'.rjust(7)) cs_id = client.insert_configuration_set(co_ids, description=desc, verbose=True) cs_ids.append(cs_id) Manually applied Configuration labels ===================================== Similarly, additional knowledge provided by the authors about the types of Configurations and Properties in the dataset can be used to apply metadata labels to the Configurations, which is useful for enabling querying over the data by future users. See :ref:`Applying configuration labels` for more details. Second, applying labels to the Configurations based on author-provided information. .. code-block:: python configuration_label_regexes = { 'isolated_atom': 'isolated_atom', 'bt': 'a5', 'dia': 'diamond', 'sh': 'sh', 'hex_diamond': 'sonsdaleite', 'bcc': 'bcc', 'bc8': 'bc8', 'fcc': 'fcc', 'hcp': 'hcp', 'st12': 'st12', 'liq': 'liquid', 'amorph': 'amorphous', 'surface_001': ['surface', '001'], 'surface_110': ['surface', '110'], 'surface_111': ['surface', '111'], 'surface_111_pandey': ['surface', '111'], 'surface_111_3x3_das': ['surface', '111', 'das'], '111adatom': ['surface', '111', 'adatom'], 'crack_110_1-10': ['crack', '110'], 'crack_111_1-10': ['crac', '111'], 'decohesion': ['diamond', 'decohesion'], 'divacancy': ['diamond', 'vacancy', 'divacancy'], 'interstitial': ['diamond', 'interstitial'], 'screw_disloc': ['screw', 'dislocation'], 'sp': 'sp', 'sp2': 'sp2', 'vacancy': ['diamond', 'vacancy'] } .. code-block:: python for regex, labels in configuration_label_regexes.items(): client.apply_labels( dataset_id=ds_id, collection_name='configurations', query={'names': {'$regex': regex}}, labels=labels, verbose=True ) .. Filtering based on XC-functional .. ================================ .. In the Si GAP dataset, some of the data was computed using a PBE functional, .. and some was computed using a PW91 functional. This information is stored in the .. :code:`xc_functional` field of the :attr:`Configuration.info` array. .. .. code-block:: python .. set( .. client.get_data( .. 'properties', .. 'si-prx-gap-data.xc-functional', .. ravel=True .. ) .. ) .. # Output: {'PBE', 'PW91'} .. A user may want to only work with subsets of the data that were computed with .. the exact same DFT settings. To facilitate this, we break the original Dataset .. into three separate datasets using the .. :meth:`~colabfit.tools.dataset.Dataset.filter` function (see :ref:`Filtering a .. Dataset` for more details). .. .. code-block:: python .. no_xc_config_sets, no_xc_pr_ids = client.filter_on_properties( .. ds_id, .. query={'si-prx-gap-data.xc-functional.source-value': {'$exists': False}}, .. ) .. new_cs_ids = [] .. for cs in no_xc_config_sets: .. new_cs_ids.append(client.insert_configuration_set(cs.configuration_ids, cs.description, verbose=True)) .. no_xc_ds_id = client.insert_dataset( .. cs_ids=new_cs_ids, .. pr_ids=no_xc_pr_ids, .. name='Si_PRX_GAP-no-xc', .. authors=dataset.authors, .. links=dataset.links, .. description="A subset of the Si_PRX_GAP dataset that only contains data without a specified XC functional", .. resync=True, .. verbose=True, .. ) .. .. code-block:: python .. pbe_config_sets, pbe_pr_ids = client.filter_on_properties( .. ds_id, .. query={'si-prx-gap-data.xc-functional.source-value': 'PBE'}, .. ) .. new_cs_ids = [] .. for cs in pbe_config_sets: .. if cs.configuration_ids: .. new_cs_ids.append(client.insert_configuration_set(cs.configuration_ids, cs.description, verbose=True)) .. pbe_ds_id = client.insert_dataset( .. cs_ids=new_cs_ids, .. pr_ids=pbe_pr_ids, .. name='Si_PRX_GAP-pbe', .. authors=dataset.authors, .. links=dataset.links, .. description="A subset of the Si_PRX_GAP dataset that only contains data computed using the PBE XC functional", .. resync=True, .. verbose=True, .. ) .. .. code-block:: python .. pw91_config_sets, pw91_pr_ids = client.filter_on_properties( .. ds_id, .. query={'si-prx-gap-data.xc-functional.source-value': 'PW91'}, .. ) .. new_cs_ids = [] .. for cs in pw91_config_sets: .. if cs.configuration_ids: .. new_cs_ids.append(client.insert_configuration_set(cs.configuration_ids, cs.description, verbose=True)) .. pw91_ds_id = client.insert_dataset( .. cs_ids=new_cs_ids, .. pr_ids=pw91_pr_ids, .. name='Si_PRX_GAP-pw91', .. authors=dataset.authors, .. links=dataset.links, .. description="A subset of the Si_PRX_GAP dataset that only contains data computed using the PW91 XC functional", .. resync=True, .. verbose=True, .. )