Database

class colabfit.tools.database.MongoDatabase(database_name, nprocs=1, drop_database=False, user=None, pwrd=None, port=27017)

A MongoDatabase stores all of the data in Mongo documents, and provides additinal functionality like filtering and optimized queries.

The Mongo database has the following structure

/configurations
    _id
    atomic_numbers
    positions
    cell
    pbc
    names
    labels
    elements
    nelements
    elements_ratios
    chemical_formula_reduced
    chemical_formula_anonymous
    chemical_formula_hill
    nsites
    dimension_types
    nperiodic_dimensions
    latice_vectors
    last_modified
    relationships
        properties
        configuration_sets

/property_definitions
    _id
    definition

/properties
    _id
    type
    property_name
        each field in the property definition
    methods
    labels
    last_modified
    relationships
        property_settings
        configurations

/property_settings
    _id
    method
    decription
    labels
    files
        file_name
        file_contents
    relationships
        properties

/configuration_sets
    _id
    last_modified
    aggregated_info
        (from configurations)
        nconfigurations
        nsites
        nelements
        chemical_systems
        elements
        individual_elements_ratios
        total_elements_ratios
        labels
        labels_counts
        chemical_formula_reduced
        chemical_formula_anonymous
        chemical_formula_hill
        nperiodic_dimensions
        dimension_types
    relationships
        configurations
        datasets

/datasets
    _id
    last_modified
    aggregated_info
        (from configuration sets)
        nconfigurations
        nsites
        nelements
        chemical_systems
        elements
        individual_elements_ratios
        total_elements_ratios
        configuration_labels
        configuration_labels_counts
        chemical_formula_reduced
        chemical_formula_anonymous
        chemical_formula_hill
        nperiodic_dimensions
        dimension_types

        (from properties)
        property_types
        property_fields
        methods
        methods_counts
        property_labels
        property_labels_counts
    relationships
        properties
        configuration_sets
database_name

The name of the Mongo database

Type

str

configurations

A Mongo collection of configuration documents

Type

Collection

properties

A Mongo collection of property documents

Type

Collection

property_definitions

A Mongo collection of property definitions

Type

Collection

property_settings

A Mongo collection of property setting documents

Type

Collection

configuration_sets

A Mongo collection of configuration set documents

Type

Collection

datasets

A Mongo collection of dataset documents

Type

Collection

aggregate_configuration_info(ids, verbose=False)

Gathers the following information from a collection of configurations:

  • nconfigurations: the total number of configurations

  • nsites: the total number of sites

  • nelements: the total number of unique element types

  • elements: the element types

  • individual_elements_ratios: a set of elements ratios generated by looping over each configuration, extracting its concentration of each element, and adding the tuple of concentrations to the set

  • total_elements_ratios: the ratio of the total count of atoms

    of each element type over nsites

  • labels: the union of all configuration labels

  • labels_counts: the total count of each label

  • chemical_formula_reduced: the set of all reduced chemical

    formulae

  • chemical_formula_anonymous: the set of all anonymous chemical

    formulae

  • chemical_formula_hill: the set of all hill chemical formulae

  • nperiodic_dimensions: the set of all numbers of periodic

    dimensions

  • dimension_types: the set of all periodic boundary choices

Returns

All of the aggregated info

verbose (bool, default=False):

If True, prints a progress bar

Return type

aggregated_info (dict)

aggregate_configuration_set_info(cs_ids, resync=False, verbose=False)

Aggregates the following information from a list of configuration sets:

  • nconfigurations

  • nsites

  • chemical_systems

  • nelements

  • elements

  • individual_elements_ratios

  • total_elements_ratios

  • labels

  • labels_counts

  • chemical_formula_reduced

  • chemical_formula_anonymous

  • chemical_formula_hill

  • nperiodic_dimensions

  • dimension_types

Parameters
  • cs_ids (list or str) – The IDs of the configurations to aggregate information from

  • resync (bool, default=False) – If True, re-synchronizes each configuration set before aggregating the information.

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

All of the aggregated info

Return type

aggregated_info (dict)

aggregate_dataset_info(ds_ids)

Aggregates information from a list of datasets.

NOTE: this will face all of the same challenges as aggregate_configuration_set_info()

  • you need to find the overlap of COs and PRs.

aggregate_property_info(pr_ids, verbose=False)

Aggregates the following information from a list of properties:

  • types

  • labels

  • labels_counts

Parameters
  • pr_ids (list or str) – The IDs of the configurations to aggregate information from

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

All of the aggregated info

Return type

aggregated_info (dict)

apply_labels(dataset_id, collection_name, query, labels, verbose=False)

Applies the given labels to all objects in the specified collection that match the query and are linked to the given dataset.

Parameters
  • dataset_id (str) – The ID of the dataset. Used as a safety measure to only update entries for the given dataset.

  • collection_name (str) – One of ‘configurations’ or ‘properties’.

  • query (dict) – A Mongo-style query for filtering the collection. For example: query = {'nsites': {'$lt': 100}}.

  • labels (set or str) – A set of labels to apply to the matching entries.

  • verbose (bool) – If True, prints progress bar.

Pseudocode:
  • Get the IDs of the configurations that match the query

  • Use updateMany to update the MongoDB

  • Iterate over the HDF5 entries.

concatenate_configurations()

Concatenates the atomic_numbers, positions, cells, and pbcs groups in /configurations.

dataset_from_markdown(html_file_path, generator=False, verbose=False)

Loads a Dataset from a markdown file.

Parameters
  • html_file_path (str) – The full path to the markdown file

  • generator (bool, default=False) – If True, uses a generator when inserting data.

  • verbose (bool, default=False) – If True, prints progress bars

Returns

The Dataset object after adding it to the Database

Return type

dataset (Dataset)

dataset_to_markdown(ds_id, base_folder, html_file_name, data_file_name, data_format, name_field='_name', histogram_fields=None, yscale='linear')

Saves a Dataset and writes a properly formatted markdown file. In the case of a Dataset that has child Dataset objects, each child Dataset is written to a separate sub-folder.

Parameters
  • ds_id (str) – The ID of the dataset.

  • base_folder (str) – Top-level folder in which to save the markdown and data files

  • html_file_name (str) – Name of file to save markdown to

  • data_file_name (str) – Name of file to save configuration and properties to

  • data_format (str, default='mongo') – Format to use for data file. If ‘mongo’, does not save the configurations to a new file, and instead adds the ID of the Dataset in the Mongo Database.

  • name_field (str) – The name of the field that should be used to generate configuration names

  • histogram_fields (list, default=None) – The property fields to include in the histogram plot. If None, plots all fields.

  • yscale (str, default='linear') – Scaling to use for histogram plotting

filter_on_configurations(ds_id, query, verbose=False)

Searches the configuration sets of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.

  • The returned configuration sets will only include configurations that return True for the filter

  • The returned property IDs will only include properties that point to a configuration that returned True for the filter.

Parameters
  • ds_id (str) – The ID of the dataset to filter

  • query (dict) – A Mongo query that will return the desired objects. Note that the key-value pair {'_id': {'$in': ...}} will be included automatically to filter on only the objects that are already linked to the given dataset.

  • verbose (bool, default=False) – If True, prints progress bars

Returns

A list of configuration sets that have been pruned to only

include configurations that satisfy the filter

property_ids (list):

A list of property IDs that satisfy the filter

Return type

configuration_sets (list)

filter_on_properties(ds_id, filter_fxn=None, query=None, fields=None, verbose=False)

Searches the properties of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.

  • The returned configuration sets will only include configurations that are pointed to by a property that returned True for the filter

  • The returned property IDs will only include properties that returned True for the filter function.

Example:

configuration_sets, property_ids = database.filter_on_properties(
    ds_id=...,
    filter_fxn=lambda x: np.max(np.abs(x[']))
)
Parameters
  • ds_id (str) – The ID of the dataset to filter

  • filter_fxn (callable, default=None) – A callable function to use as filter(filter_fxn, cursor) where cursor is a Mongo cursor over all of the property documents in the given dataset. If filter_fxn is None, must specify query.

  • query (dict, default=None) – A Mongo query that will return the desired objects. Note that the key-value pair {'_id': {'$in': ...}} will be included automatically to filter on only the objects that are already linked to the given dataset.

  • fields (str or list, default=None) – The fields required by filter_fxn. Providing the minimum number of necessary fields can improve query performance.

  • verbose (bool, default=False) – If True, prints progress bars

Returns

A list of configuration sets that have been pruned to only

include configurations that satisfy the filter

property_ids (list):

A list of property IDs that satisfy the filter

Return type

configuration_sets (list)

get_configuration(i, property_ids=None, attach_properties=False)

Returns a single configuration by calling get_configurations()

get_configuration_set(cs_id, resync=False)

Returns the configuration set with the given ID.

Parameters
  • cs_ids (str) – The ID of the configuration set to return

  • resync (bool) – If True, re-aggregates the configuration set information before returning. Default is False.

Returns

‘last_modified’: a datetime string ‘configuration_set’: the configuration set object

Return type

A dictionary with two keys

get_configurations(configuration_ids, property_ids=None, attach_properties=False, attach_settings=False, generator=False, verbose=False)

A generator that returns in-memory Configuration objects one at a time by loading the atomic numbers, positions, cells, and PBCs.

Parameters
  • configuration_ids (list or 'all') – A list of string IDs specifying which Configurations to return. If ‘all’, returns all of the configurations in the database.

  • property_ids (list, default=None) – A list of Property IDs. Used for limiting searches when attach_properties==True. If None, attach_properties will attach all linked Properties. Note that this only attaches one property per Configuration, so if multiple properties point to the same Configuration, that Configuration will be returned multiple times.

  • attach_properties (bool, default=False) – If True, attaches all the data of any linked properties from property_ids. The property data will either be added to the arrays dictionary on a Configuration (if it can be converted to a matrix where the first dimension is the same as the number of atoms in the Configuration) or the info dictionary (if it wasn’t added to arrays). Property fields in a list to accomodate the possibility of multiple properties of the same type pointing to the same configuration. WARNING: don’t use this option if multiple properties of the same type point to the same Configuration, but the properties don’t have values for all of their fields.

  • attach_settings (bool, default=False) – If True, attaches all of the fields of the property settings that are linked to the attached property instances. If attach_settings=True, must also have attach_properties=True.

  • generator (bool, default=False) – If True, this function returns a generator of the configurations. This is useful if the configurations can’t all fit in memory at the same time.

  • verbose (bool) – If True, prints progress bar

Returns

A list or generator of the re-constructed configurations

Return type

configurations (iterable)

get_data(collection_name, fields, query=None, ids=None, keep_ids=False, concatenate=False, vstack=False, ravel=False, unpack_properties=True, verbose=False)

Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.

Example:

data = database.get_data(
    collection_name='properties',
    query={'_id': {'$in': <list_of_property_IDs>}},
    fields=['property_name_1.energy', 'property_name_1.forces'],
    cache=True
)
Parameters
  • collection_name (str) – The name of a collection in the database.

  • fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)

  • query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.

  • ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the query argument.

  • keep_ids (bool, default=False) – If True, includes the ‘_id’ field as one of the returned values.

  • concatenate (bool, default=False) – If True, concatenates the data before returning.

  • vstack (bool, default=False) – If True, calls np.vstack on data before returning.

  • ravel (bool, default=False) – If True, concatenates and ravels the data before returning.

  • unpack_properties (bool, default=True) – If True, returns only the contents of the 'source-value' key for each field in fields (assuming 'source-value' exists). Users who wish to return the full dictionaries for fields should set unpack_properties=False.

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

key = k for k in keys. val = in-memory data

Return type

data (dict)

get_dataset(ds_id, resync=False, verbose=False)

Returns the dataset with the given ID.

Parameters
  • ds_ids (str) – The ID of the dataset to return

  • resync (bool) – If True, re-aggregates the configuration set and property information before returning. Default is False.

  • verbose (bool, default=True) – If True, prints a progress bar. Only used if resync=False.

Returns

‘last_modified’: a datetime string ‘dataset’: the dataset object

Return type

A dictionary with two keys

get_property_definition(name)
get_property_settings(pso_id)
get_statistics(fields, query=None, ids=None, verbose=False)

Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.

Example:

data = database.get_data(
    collection_name='properties',
    query={'_id': {'$in': <list_of_property_IDs>}},
    fields=['property_name_1.energy', 'property_name_1.forces'],
    cache=True
)
Parameters
  • collection_name (str) – The name of a collection in the database.

  • fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)

  • query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.

  • ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the query argument.

  • verbose (bool, default=False) – If True, prints a progress bar during data extraction

Returns

results (dict)::
{
    f:  {
        'average': np.average(data),
        'std': np.std(data),
        'min': np.min(data),
        'max': np.max(data),
        'average_abs': np.average(np.abs(data))
    } for f in fields
}

insert_configuration_set(ids, description='', verbose=False)

Inserts the configuration set of IDs to the database.

Parameters
  • ids (list or str) – The IDs of the configurations to include in the configuartion set.

  • description (str, optional) – A human-readable description of the configuration set.

  • verbose (bool, default=False) – If True, prints a progress bar

insert_data(configurations, property_map=None, transform=None, generator=False, verbose=True)

A wrapper to Database.insert_data() which also adds important queryable metadata about the configurations into the Client’s server.

Note that when adding the data, the Mongo server will store the bi-directional relationships between the data. For example, a property will point to its configurations, but those configurations will also point back to any linked properties.

Parameters
  • configurations (list or Configuration) – The list of configurations to be added.

  • property_map (dict) –

    A dictionary that is used to specify how to load a defined property off of a configuration. Note that the top-level keys in the map must be the names of properties that have been previously defined using add_property_definition().

    Example

    property_map = {
        'energy-forces-stress': {
            # ColabFit name: {'field': ASE field name, 'units': str}
            'energy':   {'field': 'energy',  'units': 'eV'},
            'forces':   {'field': 'forces',  'units': 'eV/Ang'},
            'stress':   {'field': 'virial',  'units': 'GPa'},
            'per-atom': {'field': 'per-atom', 'units': None},
    
            '_settings': {
                '_method': 'VASP',
                '_description': 'A static VASP calculation',
                '_files': None,
                '_labels': ['Monkhorst-Pack'],
    
                'xc-functional': {'field': 'xcf', 'units': None}
            }
        }
    }
    

    If None, only loads the configuration information (atomic numbers, positions, lattice vectors, and periodic boundary conditions).

    The ‘_settings’ key is a special key that can be used to specify the contents of a PropertySettings object that will be constructed and linked to each associated property instance.

  • transform (callable, default=None) – If provided, transform will be called on each configuration in configurations as transform(configuration). Note that this happens before anything else is done. transform should modify the Configuration in-place.

  • generator (bool, default=False) – If True, returns a generator of the results; otherwise returns a list. If True, uses update_one instead of bulk_write to avoid having to store update documents in memory.

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

A list of (config_id, property_id) tuples of the inserted data. If no properties were inserted, then property_id will be None.

Return type

ids (list)

insert_dataset(cs_ids, pr_ids, name, authors=None, links=None, description='', resync=False, verbose=False)

Inserts a dataset into the database.

Parameters
  • cs_ids (list or str) – The IDs of the configuration sets to link to the dataset.

  • pr_ids (list or str) – The IDs of the properties to link to the dataset

  • name (str) – The name of the dataset

  • authors (list or str or None) – The names of the authors of the dataset. If None, then no authors are added.

  • links (list or str or None) – External links (e.g., journal articles, Git repositories, …) to be associated with the dataset. If None, then no links are added.

  • description (str or None) – A human-readable description of the dataset. If None, then not description is added.

  • resync (bool) – If True, re-synchronizes the configuration sets and properties before adding to the dataset. Default is False.

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

The ID of the inserted dataset

Return type

ds_id (str)

insert_property_definition(definition)

Inserts a new property definition into the database. Checks that definition is valid, then builds all necessary groups in /root/properties. Throws an error if the property already exists.

Parameters

definition (dict or string) – The map defining the property. See the example below, or the OpenKIM Properties Framework for more details. If a string is provided, it must be the name of an existing property definition from the OpenKIM Properties List.

Example definition:

property_definition = {
    'property-id': 'default',
    'property-title': 'A default property used for testing',
    'property-description': 'A description of the property',
    'energy': {'type': 'float', 'has-unit': True, 'extent': [], 'required': True, 'description': 'empty'},
    'stress': {'type': 'float', 'has-unit': True, 'extent': [6], 'required': True, 'description': 'empty'},
    'name': {'type': 'string', 'has-unit': False, 'extent': [], 'required': True, 'description': 'empty'},
    'nd-same-shape': {'type': 'float', 'has-unit': True, 'extent': [2,3,5], 'required': True, 'description': 'empty'},
    'nd-diff-shape': {'type': 'float', 'has-unit': True, 'extent': [":", ":", ":"], 'required': True, 'description': 'empty'},
    'forces': {'type': 'float', 'has-unit': True, 'extent': [":", 3], 'required': True, 'description': 'empty'},
    'nd-same-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', 2, 3], 'required': True, 'description': 'empty'},
    'nd-diff-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', ':', ':'], 'required': True, 'description': 'empty'},
}
insert_property_settings(ps_object)

Inserts a new property settings object into the database by creating and populating the necessary groups in /root/property_settings.

Parameters

ps_object (PropertySettings) – The PropertySettings object to insert into the database.

Returns

The ID of the inserted property settings object. Equals the hash of the object.

Return type

ps_id (str)

plot_histograms(fields=None, query=None, ids=None, verbose=False, nbins=100, xscale='linear', yscale='linear', method='matplotlib')

Generates histograms of the given fields.

Parameters
  • fields (list or str) – The names of the fields to plot

  • query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.

  • ids (list or str) – The IDs of the objects to plot the data for

  • verbose (bool, default=False) – If True, prints progress bar

  • nbins (int) – Number of bins per histogram

  • xscale (str) – Scaling for x-axes. One of [‘linear’, ‘log’].

  • yscale (str) – Scaling for y-axes. One of [‘linear’, ‘log’].

  • method (str, default='plotly') – Package to use for plotting. ‘plotly’ or ‘matplotlib’.

Returns

Returns the figure object.

resync_configuration_set(cs_id, verbose=False)

Re-synchronizes the configuration set by re-aggregating the information from the configurations.

Parameters
  • cs_id (str) – The ID of the configuration set to update

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

None; updates the configuration set document in-place

resync_dataset(ds_id, verbose=False)

Re-synchronizes the dataset by aggregating all necessary data from properties and configuration sets. Note that this also calls colabfit.tools.client.resync_configuration_set()

Parameters
  • ds_id (str) – The ID of the dataset to update

  • verbose (bool, default=False) – If True, prints a progress bar

Returns

None; updates the dataset document in-place

colabfit.tools.database.load_data(file_path, file_format, name_field, elements, default_name='', labels_field=None, reader=None, glob_string=None, generator=True, verbose=False, **kwargs)

Loads a list of Configuration objects.

Parameters
  • file_path (str) – Path to the file or folder containing the data

  • file_format (str) – A string for specifying the type of Converter to use when loading the configurations. Allowed values are ‘xyz’, ‘extxyz’, ‘cfg’, or ‘folder’.

  • name_field (str) – Key name to use to access ase.Atoms.info[<name_field>] to obtain the name of a configuration one the atoms have been loaded from the data file. Note that if file_format == ‘folder’, name_field will be set to ‘name’.

  • elements (list) – A list of strings of element types

  • default_name (list) – Default name to be used if name_field==None.

  • labels_field (str) – Key name to use to access ase.Atoms.info[<labels_field>] to obtain the labels that should be applied to the configuration. This field should contain a comma-separated list of strings

  • reader (callable) – An optional function for loading configurations from a file. Only used for file_format == ‘folder’

  • glob_string (str) – A string to use with Path(file_path).rglob(glob_string) to generate a list of files to be passed to self.reader. Only used for file_format == ‘folder’.

  • generator (bool, default=True) – If True, returns a generator of Configurations. If False, returns a list.

  • verbose (bool) – If True, prints progress bar.

All other keyword arguments will be passed with converter.load(…, **kwargs)