Basic example

This example corresponds to the colabfit/examples/basic_example.ipynb Jupyter notebook in the GitHub repo, which can be run in Google Colab.

Initialize the database

The MongoDatabase opens a connection to a running Mongo server and attaches to a database with the given name.

from colabfit.tools.database import MongoDatabase

client = MongoDatabase('colabfit_database')

Attaching a property definition

Property definitions are data structures that are used to concretely define material properties that are stored in the Database. Note that the property-id field must be unique for all property definitions in the Database.

Ideally, an existing Property Definition from the OpenKIM Property Definition list should be used by passing the “Property Definition ID” listed on a Property Definition page to the insert_property_definition() function. For example:

client.insert_property_definition(
    'tag:staff@noreply.openkim.org,2014-04-15:property/bulk-modulus-isothermal-cubic-crystal-npt'
)

However, if none of the existing properties seem appropriate, a custom property definition can be provided by passing a valid dictionary instead. See the OpenKIM documentation for more details on how to write a valid property definition.

client.insert_property_definition({
    'property-id': 'energy-forces',
    'property-title': 'A default property for storing energies and forces',
    'property-description': 'Energies and forces computed using DFT',
    'energy': {'type': 'float', 'has-unit': True, 'extent': [], 'required': True, 'description': 'Cohesive energy'},
    'forces': {'type': 'float', 'has-unit': True, 'extent': [':',3], 'required': True, 'description': 'Atomic forces'},
})

Adding Data

Configurations and Properties can be inserted into a database by passing a list of Configurations and a property map (for parsing the data) to insert_data(). Note that a property definition must have been already attached to the Database (see above).

Loading Configurations

Load the configurations by either manually constructing the Configurations or using load_data(), which calls a pre-made BaseConverter and returns a list of Configuration objects.

# Manually

import numpy as np
from ase import Atoms

images = []
for i in range(1, 1000):
    atoms = Atoms('H'*i, positions=np.random.random((i, 3)))

    atoms.info['_name'] = 'configuration_' + str(i)

    atoms.info['dft_energy'] = i*i
    atoms.arrays['dft_forces'] = np.random.normal(size=(i, 3))

    images.append(atoms)

Note that when using load_data(), the file_format must be specified and a name_field (or None) should be provided to specify the names of the loaded configurations.

from ase.io import write

write('/tmp/example.extxyz', images)
from colabfit.tools.database import load_data

images = load_data(
        file_path='/tmp/example.extxyz',
        file_format='xyz',
        name_field='_name',
        elements=['H'],
        default_name=None,
)

Defining a property_map

A property map is used to specify how parse a Property instance from a Configuration. Below, we define a property map that extracts the ‘energy’ and ‘forces’ keys in the ‘energy-forces’ property defined above from the ‘dft_energy’ and ‘dft_forces’ fields in the info and arrays attributes of a given Configuration.

property_map = {
    # property name
    'energy-forces': {
        # property field: {'field': configuration info/arrays field, 'units': field units}
        'energy': {'field': 'dft_energy', 'units': 'eV'},
        'forces': {'field': 'dft_forces', 'units': 'eV/Ang'},
    }
}

Inserting the data

insert_data() takes in a list of Configurations and adds each Configuration into the 'configurations' collection of the Database. It also uses property_map to parse the Properties from each Configuration and add them into the 'properties' collection. insert_data() will return a list of tuples of (<configuration_id>, <property_id>), which can be useful for accessing and manipulating the new data.

from colabfit.tools.property_settings import PropertySettings

ids = list(client.insert_data(
    images,
    property_map=property_map,
    property_settings={
        'energy-forces': PropertySettings(
                            method='VASP',
                            description='A basic VASP calculation',
                            files=None,
                            labels=['PBE', 'GGA'],
                        ),
    },
    generator=False,
    verbose=True
))

all_co_ids, all_pr_ids = list(zip(*ids))

Note that the property_settings argument can also be used to specify a dictionary of PropertySettings objects for providing additional metadata regarding the Properties being loaded.

Creating a ConfigurationSet

A ConfigurationSet can be used to create groups of configurations for organizational purposes.

First, use get_data() to extract the _id fields for all of the configurations with fewer than 100 atoms. A Mongo query is passed in as the query argument (see Mongo usage for more details).

co_ids = client.get_data(
    'configurations',
    fields='_id',
    query={'_id': {'$in': all_co_ids}, 'nsites': {'$lt': 100}},
    ravel=True
).tolist()

Then use insert_configuration_set() to add the ConfigurationSet into the Database, specifying the list of Configuration IDs to include, and a description of the ConfigurationSet.

cs_id = client.insert_configuration_set(
    co_ids,
    description='Configurations with fewer than 100 atoms'
)

Note that insert_configuration_set() returns the ID of the inserted ConfigurationSet, which can be used to obtain the newly-added ConfigurationSet

cs = client.get_configuration_set(cs_id)['configuration_set']

print(cs.description)

A ConfigurationSet aggregates some key information from its linked Configurations upon insertion.

for k,v in cs.aggregated_info.items():
    print(k, v)

Creating a Dataset

A Dataset can be constructed by providing a list of ConfigurationSet and Property IDs to insert_dataset().

First, define two ConfigurationSets:

co_ids1 = client.get_data(
    'configurations',
    fields='_id',
    query={'_id': {'$in': all_co_ids}, 'nsites': {'$lt': 100}},
    ravel=True
).tolist()

co_ids2 = client.get_data(
    'configurations',
    fields='_id',
    query={'_id': {'$in': all_co_ids}, 'nsites': {'$gte': 100}},
    ravel=True
).tolist()

# Note: CS IDs depend upon the description, so cs_id1 will not match cs_id
# from above
cs_id1 = client.insert_configuration_set(co_ids1, 'Small configurations')
cs_id2 = client.insert_configuration_set(co_ids2, 'Big configurations')

Then extract the Property IDs that are linked to the given Configurations.

pr_ids = client.get_data(
    'properties',
    fields='_id',
    query={
        'relationships.configurations': {'$elemMatch': {'$in':
        co_ids1+co_ids2}}
    },
    ravel=True

).tolist()

Finally, add the Dataset into the Database

ds_id = client.insert_dataset(
    cs_ids=[cs_id1, cs_id2],
    pr_ids=pr_ids,
    name='basic_example',
    authors=['ColabFit User'],
    links=['https://colabfit.openkim.org/'],
    description="This is an example dataset",
    resync=True
)

Just as a ConfigurationSet aggregates information from a list of Configurations, a Dataset aggregates information from a list of ConfigurationSets and a list of Properties. Note the use of resync=True. This ensures that the ConfigurationSets re-aggregate all of their data from their linked Configurations before the Dataset aggregates information from the ConfigurationSets.

ds = client.get_dataset(ds_id)['dataset']

for k,v in ds.aggregated_info.items():
    print(k, v)

Applying labels to configurations

Additional metadata can be applied to individual configurations or properties using apply_labels(). This function queries on the specified collection and attaches the given labels to any matching entries.

client.apply_labels(
    dataset_id=ds_id,
    collection_name='configurations',
    query={'nsites': {'$lt': 100}},
    labels={'small'},
    verbose=True
)

Note the use of dataset_id=ds_id which ensures that the labels are only applied to the entries attached to the specified Dataset.

When extracting a ConfigurationSet whose linked Configurations have been modified, resync=True should be used to ensure that all necessary information (such as Configuration labels) is re-aggregated.

cs = client.get_configuration_set(cs_id, resync=True)['configuration_set']

Exploring the dataset

Use get_statistics() to see basic statistics about a selection of property fields:

client.get_statistics(
    # For getting statistics about all property fields on the dataset
    dataset.aggregated_info['property_fields'],
    # For getting statistics only about the properties attached to the
    # dataset
    ids=dataset.property_ids
)

Use plot_histograms() to quickly visualize the property fields.

client.plot_histograms(
    dataset.aggregated_info['property_fields'],
    ids=dataset.property_ids
)

Applying transformations to properties

apply_transformation() can be used to modify Properties that have already been loaded into the Database.

First, define the functions that will be used to update the Properties. In this example, we will show how to convert supercell energies to per-atom energies.

def transform(current_val, property_doc):
    """
    Update functions MUST take exactly two arguments.

    Args:
        current_val (object):
            The value to update. This will be the contents of the
            'source-value' field of a Property field.

        property_doc (dict):
            The full Mongo Property document with the Configuration
            specified by the `configuration_ids` argument to
            `apply_transformation` attached as the 'configuration' field.
            This argument is used to provide access to any necessary related
            data.

    Returns:
        The new value to assign to the 'source-value' field of the given
        Property field
    """

    return current_val/pr_doc['configuration']['nsites']

Then, pass the transformation functions into apply_transformation(). Note the use of ds_id and all_pr_ids to limit the operations to only act on the properties in the given Dataset with the given Property IDs. Also note the use of configuration_ids to specify which Configurations should be attached to which Properties for use in the tranformation function (see code block above).

all_co_ids, all_pr_ids = list(zip(*ids))

# Convert to per-atom energies
client.apply_transformation(
    dataset_id=ds_id,
    property_ids=all_pr_ids,
    update_map={
        'energy-forces.energy':
            lambda current_val, pr_doc: current_val/pr_doc['configuration']['nsites']
    },
    configuration_ids=all_co_ids,
)

Filtering

The filter_on_properties() and filter_on_configurations() methods can be used to filter lists of ConfigurationSets and Properties based on arbitrary criterion. This is useful for obtaining subsets of a Dataset.

Here we show an example of using filter_on_properties(). filter_on_configurations() works in a similar manner; see the documentation for more details.

First, define a function to use for filtering the Properties. This function should have a single argument which is the Property document, and should return True if the Property should be included in the filtered data. Note that the ConfigurationSets will be filtered by only including Configurations that are linked to at least one Property that returned True for the filter function.

def ff(pr_doc):
        emax = np.max(np.abs(pr_doc['energy-forces']['energy']['source-value']))
        fmax = np.max(np.abs(pr_doc['energy-forces']['forces']['source-value']))
        return (emax < 100) and (fmax < 3)

Next, get the filtered ConfigurationSets and Properties.

clean_config_sets, clean_property_ids = client.filter_on_properties(
        ds_id,
        filter_fxn=ff,
        fields=['energy-forces.energy', 'energy-forces.forces'],
        verbose=True
)

Add the newly-filtered ConfigurationSets into the Database

new_cs_ids = []
for cs in clean_config_sets:
    if len(cs.configuration_ids):
        new_cs_ids.append(
            client.insert_configuration_set(
                cs.configuration_ids,
                cs.description, verbose=True
            )
        )

And finally, define a new Dataset with the filtered data

ds_id_clean = client.insert_dataset(
    cs_ids=new_cs_ids,
    pr_ids=clean_property_ids,
    name='basic_example_filtered',
    authors=['ColabFit'],
    links=[],
    description="A dataset generated during a basic filtering example",
    resync=True,
    verbose=True,
)