Database

class colabfit.tools.database.MongoDatabase(database_name, nprocs=1, drop_database=False, user=None, pwrd=None, port=27017)

A MongoDatabase stores all of the data in Mongo documents, and provides additinal functionality like filtering and optimized queries.

The Mongo database has the following structure

/configurations
    _id
    atomic_numbers
    positions
    cell
    pbc
    names
    labels
    elements
    nelements
    elements_ratios
    chemical_formula_reduced
    chemical_formula_anonymous
    chemical_formula_hill
    nsites
    dimension_types
    nperiodic_dimensions
    latice_vectors
    last_modified
    relationships
        properties
        configuration_sets

/property_definitions
    _id
    definition

/properties
    _id
    type
    property_name
        each field in the property definition
    methods
    labels
    last_modified
    relationships
        property_settings
        configurations

/property_settings
    _id
    method
    decription
    labels
    files
        file_name
        file_contents
    relationships
        properties

/configuration_sets
    _id
    last_modified
    aggregated_info
        (from configurations)
        nconfigurations
        nsites
        nelements
        chemical_systems
        elements
        individual_elements_ratios
        total_elements_ratios
        labels
        labels_counts
        chemical_formula_reduced
        chemical_formula_anonymous
        chemical_formula_hill
        nperiodic_dimensions
        dimension_types
    relationships
        configurations
        datasets

/datasets
    _id
    last_modified
    aggregated_info
        (from configuration sets)
        nconfigurations
        nsites
        nelements
        chemical_systems
        elements
        individual_elements_ratios
        total_elements_ratios
        configuration_labels
        configuration_labels_counts
        chemical_formula_reduced
        chemical_formula_anonymous
        chemical_formula_hill
        nperiodic_dimensions
        dimension_types

        (from properties)
        property_types
        property_fields
        methods
        methods_counts
        property_labels
        property_labels_counts
    relationships
        properties
        configuration_sets

database_name

The name of the Mongo database

Type: str

configurations

A Mongo collection of configuration documents

Type: Collection

properties

A Mongo collection of property documents

Type: Collection

property_definitions

A Mongo collection of property definitions

Type: Collection

property_settings

A Mongo collection of property setting documents

Type: Collection

configuration_sets

A Mongo collection of configuration set documents

Type: Collection

datasets

A Mongo collection of dataset documents

Type: Collection

aggregate_configuration_info(ids, verbose=False)

Gathers the following information from a collection of configurations:

nconfigurations: the total number of configurations
nsites: the total number of sites
nelements: the total number of unique element types
elements: the element types
individual_elements_ratios: a set of elements ratios generated by looping over each configuration, extracting its concentration of each element, and adding the tuple of concentrations to the set
total_elements_ratios: the ratio of the total count of atoms
of each element type over nsites
labels: the union of all configuration labels
labels_counts: the total count of each label
chemical_formula_reduced: the set of all reduced chemical
formulae
chemical_formula_anonymous: the set of all anonymous chemical
formulae
chemical_formula_hill: the set of all hill chemical formulae
nperiodic_dimensions: the set of all numbers of periodic
dimensions
dimension_types: the set of all periodic boundary choices

Returns

All of the aggregated info

verbose (bool, default=False):: If True, prints a progress bar

Return type

aggregated_info (dict)

aggregate_configuration_set_info(cs_ids, resync=False, verbose=False)

Aggregates the following information from a list of configuration sets:

nconfigurations

nsites

chemical_systems

nelements

elements

individual_elements_ratios

total_elements_ratios

labels

labels_counts

chemical_formula_reduced

chemical_formula_anonymous

chemical_formula_hill

nperiodic_dimensions

dimension_types

Parameters

cs_ids (list or str) – The IDs of the configurations to aggregate information from
resync (bool, default=False) – If True, re-synchronizes each configuration set before aggregating the information.
verbose (bool, default=False) – If True, prints a progress bar

Returns

All of the aggregated info

Return type

aggregated_info (dict)

aggregate_dataset_info(ds_ids)

Aggregates information from a list of datasets.

NOTE: this will face all of the same challenges as aggregate_configuration_set_info()

you need to find the overlap of COs and PRs.

aggregate_property_info(pr_ids, verbose=False)

Aggregates the following information from a list of properties:

types

labels

labels_counts

Parameters

pr_ids (list or str) – The IDs of the configurations to aggregate information from
verbose (bool, default=False) – If True, prints a progress bar

Returns

All of the aggregated info

Return type

aggregated_info (dict)

apply_labels(dataset_id, collection_name, query, labels, verbose=False)

Applies the given labels to all objects in the specified collection that match the query and are linked to the given dataset.

Parameters

dataset_id (str) – The ID of the dataset. Used as a safety measure to only update entries for the given dataset.
collection_name (str) – One of ‘configurations’ or ‘properties’.
query (dict) – A Mongo-style query for filtering the collection. For example: query = {'nsites': {'$lt': 100}}.
labels (set or str) – A set of labels to apply to the matching entries.
verbose (bool) – If True, prints progress bar.

Pseudocode:

Get the IDs of the configurations that match the query
Use updateMany to update the MongoDB
Iterate over the HDF5 entries.

concatenate_configurations(): Concatenates the atomic_numbers, positions, cells, and pbcs groups in /configurations.

dataset_from_markdown(html_file_path, generator=False, verbose=False)

Loads a Dataset from a markdown file.

Parameters

html_file_path (str) – The full path to the markdown file
generator (bool, default=False) – If True, uses a generator when inserting data.
verbose (bool, default=False) – If True, prints progress bars

Returns

The Dataset object after adding it to the Database

Return type

dataset (Dataset)

dataset_to_markdown(ds_id, base_folder, html_file_name, data_file_name, data_format, name_field='_name', histogram_fields=None, yscale='linear')

Saves a Dataset and writes a properly formatted markdown file. In the case of a Dataset that has child Dataset objects, each child Dataset is written to a separate sub-folder.

Parameters

ds_id (str) – The ID of the dataset.
base_folder (str) – Top-level folder in which to save the markdown and data files
html_file_name (str) – Name of file to save markdown to
data_file_name (str) – Name of file to save configuration and properties to
data_format (str, default='mongo') – Format to use for data file. If ‘mongo’, does not save the configurations to a new file, and instead adds the ID of the Dataset in the Mongo Database.
name_field (str) – The name of the field that should be used to generate configuration names
histogram_fields (list, default=None) – The property fields to include in the histogram plot. If None, plots all fields.
yscale (str, default='linear') – Scaling to use for histogram plotting

filter_on_configurations(ds_id, query, verbose=False)

Searches the configuration sets of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.

The returned configuration sets will only include configurations that return True for the filter

The returned property IDs will only include properties that point to a configuration that returned True for the filter.

Parameters

ds_id (str) – The ID of the dataset to filter
query (dict) – A Mongo query that will return the desired objects. Note that the key-value pair {'_id': {'$in': ...}} will be included automatically to filter on only the objects that are already linked to the given dataset.
verbose (bool, default=False) – If True, prints progress bars

Returns

A list of configuration sets that have been pruned to only: include configurations that satisfy the filter
property_ids (list):: A list of property IDs that satisfy the filter

Return type

configuration_sets (list)

filter_on_properties(ds_id, filter_fxn=None, query=None, fields=None, verbose=False)

Searches the properties of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.

The returned configuration sets will only include configurations that are pointed to by a property that returned True for the filter

The returned property IDs will only include properties that returned True for the filter function.

Example:

configuration_sets, property_ids = database.filter_on_properties(
    ds_id=...,
    filter_fxn=lambda x: np.max(np.abs(x[']))
)

Parameters

ds_id (str) – The ID of the dataset to filter
filter_fxn (callable, default=None) – A callable function to use as filter(filter_fxn, cursor) where cursor is a Mongo cursor over all of the property documents in the given dataset. If filter_fxn is None, must specify query.
query (dict, default=None) – A Mongo query that will return the desired objects. Note that the key-value pair {'_id': {'$in': ...}} will be included automatically to filter on only the objects that are already linked to the given dataset.
fields (str or list, default=None) – The fields required by filter_fxn. Providing the minimum number of necessary fields can improve query performance.
verbose (bool, default=False) – If True, prints progress bars

Returns

A list of configuration sets that have been pruned to only: include configurations that satisfy the filter
property_ids (list):: A list of property IDs that satisfy the filter

Return type

configuration_sets (list)

get_configuration(i, property_ids=None, attach_properties=False): Returns a single configuration by calling get_configurations()

get_configuration_set(cs_id, resync=False)

Returns the configuration set with the given ID.

Parameters

cs_ids (str) – The ID of the configuration set to return
resync (bool) – If True, re-aggregates the configuration set information before returning. Default is False.

Returns

‘last_modified’: a datetime string ‘configuration_set’: the configuration set object

Return type

A dictionary with two keys

get_configurations(configuration_ids, property_ids=None, attach_properties=False, attach_settings=False, generator=False, verbose=False)

A generator that returns in-memory Configuration objects one at a time by loading the atomic numbers, positions, cells, and PBCs.

Parameters

configuration_ids (list or 'all') – A list of string IDs specifying which Configurations to return. If ‘all’, returns all of the configurations in the database.
property_ids (list, default=None) – A list of Property IDs. Used for limiting searches when attach_properties==True. If None, attach_properties will attach all linked Properties. Note that this only attaches one property per Configuration, so if multiple properties point to the same Configuration, that Configuration will be returned multiple times.
attach_properties (bool, default=False) – If True, attaches all the data of any linked properties from property_ids. The property data will either be added to the arrays dictionary on a Configuration (if it can be converted to a matrix where the first dimension is the same as the number of atoms in the Configuration) or the info dictionary (if it wasn’t added to arrays). Property fields in a list to accomodate the possibility of multiple properties of the same type pointing to the same configuration. WARNING: don’t use this option if multiple properties of the same type point to the same Configuration, but the properties don’t have values for all of their fields.
attach_settings (bool, default=False) – If True, attaches all of the fields of the property settings that are linked to the attached property instances. If attach_settings=True, must also have attach_properties=True.
generator (bool, default=False) – If True, this function returns a generator of the configurations. This is useful if the configurations can’t all fit in memory at the same time.
verbose (bool) – If True, prints progress bar

Returns

A list or generator of the re-constructed configurations

Return type

configurations (iterable)

get_data(collection_name, fields, query=None, ids=None, keep_ids=False, concatenate=False, vstack=False, ravel=False, unpack_properties=True, verbose=False)

Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.

Example:

data = database.get_data(
    collection_name='properties',
    query={'_id': {'$in': <list_of_property_IDs>}},
    fields=['property_name_1.energy', 'property_name_1.forces'],
    cache=True
)

Parameters

collection_name (str) – The name of a collection in the database.
fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the query argument.
keep_ids (bool, default=False) – If True, includes the ‘_id’ field as one of the returned values.
concatenate (bool, default=False) – If True, concatenates the data before returning.
vstack (bool, default=False) – If True, calls np.vstack on data before returning.
ravel (bool, default=False) – If True, concatenates and ravels the data before returning.
unpack_properties (bool, default=True) – If True, returns only the contents of the 'source-value' key for each field in fields (assuming 'source-value' exists). Users who wish to return the full dictionaries for fields should set unpack_properties=False.
verbose (bool, default=False) – If True, prints a progress bar

Returns

key = k for k in keys. val = in-memory data

Return type

data (dict)

get_dataset(ds_id, resync=False, verbose=False)

Returns the dataset with the given ID.

Parameters

ds_ids (str) – The ID of the dataset to return
resync (bool) – If True, re-aggregates the configuration set and property information before returning. Default is False.
verbose (bool, default=True) – If True, prints a progress bar. Only used if resync=False.

Returns

‘last_modified’: a datetime string ‘dataset’: the dataset object

Return type

A dictionary with two keys

get_property_definition(name)

get_property_settings(pso_id)

get_statistics(fields, query=None, ids=None, verbose=False)

Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.

Example:

data = database.get_data(
    collection_name='properties',
    query={'_id': {'$in': <list_of_property_IDs>}},
    fields=['property_name_1.energy', 'property_name_1.forces'],
    cache=True
)

Parameters

collection_name (str) – The name of a collection in the database.
fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the query argument.
verbose (bool, default=False) – If True, prints a progress bar during data extraction

Returns

results (dict)::

{
    f:  {
        'average': np.average(data),
        'std': np.std(data),
        'min': np.min(data),
        'max': np.max(data),
        'average_abs': np.average(np.abs(data))
    } for f in fields
}

insert_configuration_set(ids, description='', verbose=False)

Inserts the configuration set of IDs to the database.

Parameters

ids (list or str) – The IDs of the configurations to include in the configuartion set.
description (str, optional) – A human-readable description of the configuration set.
verbose (bool, default=False) – If True, prints a progress bar

insert_data(configurations, property_map=None, transform=None, generator=False, verbose=True)

A wrapper to Database.insert_data() which also adds important queryable metadata about the configurations into the Client’s server.

Note that when adding the data, the Mongo server will store the bi-directional relationships between the data. For example, a property will point to its configurations, but those configurations will also point back to any linked properties.

Parameters

configurations (list or Configuration) – The list of configurations to be added.

property_map (dict) –

A dictionary that is used to specify how to load a defined property off of a configuration. Note that the top-level keys in the map must be the names of properties that have been previously defined using add_property_definition().

Example

property_map = {
    'energy-forces-stress': {
        # ColabFit name: {'field': ASE field name, 'units': str}
        'energy':   {'field': 'energy',  'units': 'eV'},
        'forces':   {'field': 'forces',  'units': 'eV/Ang'},
        'stress':   {'field': 'virial',  'units': 'GPa'},
        'per-atom': {'field': 'per-atom', 'units': None},

        '_settings': {
            '_method': 'VASP',
            '_description': 'A static VASP calculation',
            '_files': None,
            '_labels': ['Monkhorst-Pack'],

            'xc-functional': {'field': 'xcf', 'units': None}
        }
    }
}

If None, only loads the configuration information (atomic numbers, positions, lattice vectors, and periodic boundary conditions).

The ‘_settings’ key is a special key that can be used to specify the contents of a PropertySettings object that will be constructed and linked to each associated property instance.

transform (callable, default=None) – If provided, transform will be called on each configuration in configurations as transform(configuration). Note that this happens before anything else is done. transform should modify the Configuration in-place.
generator (bool, default=False) – If True, returns a generator of the results; otherwise returns a list. If True, uses update_one instead of bulk_write to avoid having to store update documents in memory.
verbose (bool, default=False) – If True, prints a progress bar

Returns

A list of (config_id, property_id) tuples of the inserted data. If no properties were inserted, then property_id will be None.

Return type

ids (list)

insert_dataset(cs_ids, pr_ids, name, authors=None, links=None, description='', resync=False, verbose=False)

Inserts a dataset into the database.

Parameters

cs_ids (list or str) – The IDs of the configuration sets to link to the dataset.
pr_ids (list or str) – The IDs of the properties to link to the dataset
name (str) – The name of the dataset
authors (list or str or None) – The names of the authors of the dataset. If None, then no authors are added.
links (list or str or None) – External links (e.g., journal articles, Git repositories, …) to be associated with the dataset. If None, then no links are added.
description (str or None) – A human-readable description of the dataset. If None, then not description is added.
resync (bool) – If True, re-synchronizes the configuration sets and properties before adding to the dataset. Default is False.
verbose (bool, default=False) – If True, prints a progress bar

Returns

The ID of the inserted dataset

Return type

ds_id (str)

insert_property_definition(definition)

Inserts a new property definition into the database. Checks that definition is valid, then builds all necessary groups in /root/properties. Throws an error if the property already exists.

Parameters: definition (dict or string) – The map defining the property. See the example below, or the OpenKIM Properties Framework for more details. If a string is provided, it must be the name of an existing property definition from the OpenKIM Properties List.

Example definition:

property_definition = {
    'property-id': 'default',
    'property-title': 'A default property used for testing',
    'property-description': 'A description of the property',
    'energy': {'type': 'float', 'has-unit': True, 'extent': [], 'required': True, 'description': 'empty'},
    'stress': {'type': 'float', 'has-unit': True, 'extent': [6], 'required': True, 'description': 'empty'},
    'name': {'type': 'string', 'has-unit': False, 'extent': [], 'required': True, 'description': 'empty'},
    'nd-same-shape': {'type': 'float', 'has-unit': True, 'extent': [2,3,5], 'required': True, 'description': 'empty'},
    'nd-diff-shape': {'type': 'float', 'has-unit': True, 'extent': [":", ":", ":"], 'required': True, 'description': 'empty'},
    'forces': {'type': 'float', 'has-unit': True, 'extent': [":", 3], 'required': True, 'description': 'empty'},
    'nd-same-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', 2, 3], 'required': True, 'description': 'empty'},
    'nd-diff-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', ':', ':'], 'required': True, 'description': 'empty'},
}

insert_property_settings(ps_object)

Inserts a new property settings object into the database by creating and populating the necessary groups in /root/property_settings.

Parameters: ps_object (PropertySettings) – The PropertySettings object to insert into the database.
Returns: The ID of the inserted property settings object. Equals the hash of the object.
Return type: ps_id (str)

plot_histograms(fields=None, query=None, ids=None, verbose=False, nbins=100, xscale='linear', yscale='linear', method='matplotlib')

Generates histograms of the given fields.

Parameters

fields (list or str) – The names of the fields to plot
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list or str) – The IDs of the objects to plot the data for
verbose (bool, default=False) – If True, prints progress bar
nbins (int) – Number of bins per histogram
xscale (str) – Scaling for x-axes. One of [‘linear’, ‘log’].
yscale (str) – Scaling for y-axes. One of [‘linear’, ‘log’].
method (str, default='plotly') – Package to use for plotting. ‘plotly’ or ‘matplotlib’.

Returns

Returns the figure object.

resync_configuration_set(cs_id, verbose=False)

Re-synchronizes the configuration set by re-aggregating the information from the configurations.

Parameters

cs_id (str) – The ID of the configuration set to update
verbose (bool, default=False) – If True, prints a progress bar

Returns

None; updates the configuration set document in-place

resync_dataset(ds_id, verbose=False)

Re-synchronizes the dataset by aggregating all necessary data from properties and configuration sets. Note that this also calls colabfit.tools.client.resync_configuration_set()

Parameters

ds_id (str) – The ID of the dataset to update
verbose (bool, default=False) – If True, prints a progress bar

Returns

None; updates the dataset document in-place

colabfit.tools.database.load_data(file_path, file_format, name_field, elements, default_name='', labels_field=None, reader=None, glob_string=None, generator=True, verbose=False, **kwargs)

Loads a list of Configuration objects.

Parameters

file_path (str) – Path to the file or folder containing the data
file_format (str) – A string for specifying the type of Converter to use when loading the configurations. Allowed values are ‘xyz’, ‘extxyz’, ‘cfg’, or ‘folder’.
name_field (str) – Key name to use to access ase.Atoms.info[<name_field>] to obtain the name of a configuration one the atoms have been loaded from the data file. Note that if file_format == ‘folder’, name_field will be set to ‘name’.
elements (list) – A list of strings of element types
default_name (list) – Default name to be used if name_field==None.
labels_field (str) – Key name to use to access ase.Atoms.info[<labels_field>] to obtain the labels that should be applied to the configuration. This field should contain a comma-separated list of strings
reader (callable) – An optional function for loading configurations from a file. Only used for file_format == ‘folder’
glob_string (str) – A string to use with Path(file_path).rglob(glob_string) to generate a list of files to be passed to self.reader. Only used for file_format == ‘folder’.
generator (bool, default=True) – If True, returns a generator of Configurations. If False, returns a list.
verbose (bool) – If True, prints progress bar.

All other keyword arguments will be passed with converter.load(…, **kwargs)