Database
- class colabfit.tools.database.MongoDatabase(database_name, nprocs=1, drop_database=False, user=None, pwrd=None, port=27017)
A MongoDatabase stores all of the data in Mongo documents, and provides additinal functionality like filtering and optimized queries.
The Mongo database has the following structure
/configurations _id atomic_numbers positions cell pbc names labels elements nelements elements_ratios chemical_formula_reduced chemical_formula_anonymous chemical_formula_hill nsites dimension_types nperiodic_dimensions latice_vectors last_modified relationships properties configuration_sets /property_definitions _id definition /properties _id type property_name each field in the property definition methods labels last_modified relationships property_settings configurations /property_settings _id method decription labels files file_name file_contents relationships properties /configuration_sets _id last_modified aggregated_info (from configurations) nconfigurations nsites nelements chemical_systems elements individual_elements_ratios total_elements_ratios labels labels_counts chemical_formula_reduced chemical_formula_anonymous chemical_formula_hill nperiodic_dimensions dimension_types relationships configurations datasets /datasets _id last_modified aggregated_info (from configuration sets) nconfigurations nsites nelements chemical_systems elements individual_elements_ratios total_elements_ratios configuration_labels configuration_labels_counts chemical_formula_reduced chemical_formula_anonymous chemical_formula_hill nperiodic_dimensions dimension_types (from properties) property_types property_fields methods methods_counts property_labels property_labels_counts relationships properties configuration_sets
- database_name
The name of the Mongo database
- Type
str
- configurations
A Mongo collection of configuration documents
- Type
Collection
- properties
A Mongo collection of property documents
- Type
Collection
- property_definitions
A Mongo collection of property definitions
- Type
Collection
- property_settings
A Mongo collection of property setting documents
- Type
Collection
- configuration_sets
A Mongo collection of configuration set documents
- Type
Collection
- datasets
A Mongo collection of dataset documents
- Type
Collection
- aggregate_configuration_info(ids, verbose=False)
Gathers the following information from a collection of configurations:
nconfigurations
: the total number of configurationsnsites
: the total number of sitesnelements
: the total number of unique element typeselements
: the element typesindividual_elements_ratios
: a set of elements ratios generated by looping over each configuration, extracting its concentration of each element, and adding the tuple of concentrations to the settotal_elements_ratios
: the ratio of the total count of atomsof each element type over
nsites
labels
: the union of all configuration labelslabels_counts
: the total count of each labelchemical_formula_reduced
: the set of all reduced chemicalformulae
chemical_formula_anonymous
: the set of all anonymous chemicalformulae
chemical_formula_hill
: the set of all hill chemical formulaenperiodic_dimensions
: the set of all numbers of periodicdimensions
dimension_types
: the set of all periodic boundary choices
- Returns
All of the aggregated info
- verbose (bool, default=False):
If True, prints a progress bar
- Return type
aggregated_info (dict)
- aggregate_configuration_set_info(cs_ids, resync=False, verbose=False)
Aggregates the following information from a list of configuration sets:
nconfigurations
nsites
chemical_systems
nelements
elements
individual_elements_ratios
total_elements_ratios
labels
labels_counts
chemical_formula_reduced
chemical_formula_anonymous
chemical_formula_hill
nperiodic_dimensions
dimension_types
- Parameters
cs_ids (list or str) – The IDs of the configurations to aggregate information from
resync (bool, default=False) – If True, re-synchronizes each configuration set before aggregating the information.
verbose (bool, default=False) – If True, prints a progress bar
- Returns
All of the aggregated info
- Return type
aggregated_info (dict)
- aggregate_dataset_info(ds_ids)
Aggregates information from a list of datasets.
NOTE: this will face all of the same challenges as aggregate_configuration_set_info()
you need to find the overlap of COs and PRs.
- aggregate_property_info(pr_ids, verbose=False)
Aggregates the following information from a list of properties:
types
labels
labels_counts
- Parameters
pr_ids (list or str) – The IDs of the configurations to aggregate information from
verbose (bool, default=False) – If True, prints a progress bar
- Returns
All of the aggregated info
- Return type
aggregated_info (dict)
- apply_labels(dataset_id, collection_name, query, labels, verbose=False)
Applies the given labels to all objects in the specified collection that match the query and are linked to the given dataset.
- Parameters
dataset_id (str) – The ID of the dataset. Used as a safety measure to only update entries for the given dataset.
collection_name (str) – One of ‘configurations’ or ‘properties’.
query (dict) – A Mongo-style query for filtering the collection. For example:
query = {'nsites': {'$lt': 100}}
.labels (set or str) – A set of labels to apply to the matching entries.
verbose (bool) – If True, prints progress bar.
- Pseudocode:
Get the IDs of the configurations that match the query
Use updateMany to update the MongoDB
Iterate over the HDF5 entries.
- concatenate_configurations()
Concatenates the atomic_numbers, positions, cells, and pbcs groups in /configurations.
- dataset_from_markdown(html_file_path, generator=False, verbose=False)
Loads a Dataset from a markdown file.
- Parameters
html_file_path (str) – The full path to the markdown file
generator (bool, default=False) – If True, uses a generator when inserting data.
verbose (bool, default=False) – If True, prints progress bars
- Returns
The Dataset object after adding it to the Database
- Return type
dataset (Dataset)
- dataset_to_markdown(ds_id, base_folder, html_file_name, data_file_name, data_format, name_field='_name', histogram_fields=None, yscale='linear')
Saves a Dataset and writes a properly formatted markdown file. In the case of a Dataset that has child Dataset objects, each child Dataset is written to a separate sub-folder.
- Parameters
ds_id (str) – The ID of the dataset.
base_folder (str) – Top-level folder in which to save the markdown and data files
html_file_name (str) – Name of file to save markdown to
data_file_name (str) – Name of file to save configuration and properties to
data_format (str, default='mongo') – Format to use for data file. If ‘mongo’, does not save the configurations to a new file, and instead adds the ID of the Dataset in the Mongo Database.
name_field (str) – The name of the field that should be used to generate configuration names
histogram_fields (list, default=None) – The property fields to include in the histogram plot. If None, plots all fields.
yscale (str, default='linear') – Scaling to use for histogram plotting
- filter_on_configurations(ds_id, query, verbose=False)
Searches the configuration sets of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.
The returned configuration sets will only include configurations that return True for the filter
The returned property IDs will only include properties that point to a configuration that returned True for the filter.
- Parameters
ds_id (str) – The ID of the dataset to filter
query (dict) – A Mongo query that will return the desired objects. Note that the key-value pair
{'_id': {'$in': ...}}
will be included automatically to filter on only the objects that are already linked to the given dataset.verbose (bool, default=False) – If True, prints progress bars
- Returns
- A list of configuration sets that have been pruned to only
include configurations that satisfy the filter
- property_ids (list):
A list of property IDs that satisfy the filter
- Return type
configuration_sets (list)
- filter_on_properties(ds_id, filter_fxn=None, query=None, fields=None, verbose=False)
Searches the properties of a given dataset, and returns configuration sets and properties that have been filtered based on the given criterion.
The returned configuration sets will only include configurations that are pointed to by a property that returned True for the filter
The returned property IDs will only include properties that returned True for the filter function.
Example:
configuration_sets, property_ids = database.filter_on_properties( ds_id=..., filter_fxn=lambda x: np.max(np.abs(x['])) )
- Parameters
ds_id (str) – The ID of the dataset to filter
filter_fxn (callable, default=None) – A callable function to use as
filter(filter_fxn, cursor)
wherecursor
is a Mongo cursor over all of the property documents in the given dataset. Iffilter_fxn
is None, must specifyquery
.query (dict, default=None) – A Mongo query that will return the desired objects. Note that the key-value pair
{'_id': {'$in': ...}}
will be included automatically to filter on only the objects that are already linked to the given dataset.fields (str or list, default=None) – The fields required by
filter_fxn
. Providing the minimum number of necessary fields can improve query performance.verbose (bool, default=False) – If True, prints progress bars
- Returns
- A list of configuration sets that have been pruned to only
include configurations that satisfy the filter
- property_ids (list):
A list of property IDs that satisfy the filter
- Return type
configuration_sets (list)
- get_configuration(i, property_ids=None, attach_properties=False)
Returns a single configuration by calling
get_configurations()
- get_configuration_set(cs_id, resync=False)
Returns the configuration set with the given ID.
- Parameters
cs_ids (str) – The ID of the configuration set to return
resync (bool) – If True, re-aggregates the configuration set information before returning. Default is False.
- Returns
‘last_modified’: a datetime string ‘configuration_set’: the configuration set object
- Return type
A dictionary with two keys
- get_configurations(configuration_ids, property_ids=None, attach_properties=False, attach_settings=False, generator=False, verbose=False)
A generator that returns in-memory Configuration objects one at a time by loading the atomic numbers, positions, cells, and PBCs.
- Parameters
configuration_ids (list or 'all') – A list of string IDs specifying which Configurations to return. If ‘all’, returns all of the configurations in the database.
property_ids (list, default=None) – A list of Property IDs. Used for limiting searches when
attach_properties==True
. If None,attach_properties
will attach all linked Properties. Note that this only attaches one property per Configuration, so if multiple properties point to the same Configuration, that Configuration will be returned multiple times.attach_properties (bool, default=False) – If True, attaches all the data of any linked properties from
property_ids
. The property data will either be added to thearrays
dictionary on a Configuration (if it can be converted to a matrix where the first dimension is the same as the number of atoms in the Configuration) or theinfo
dictionary (if it wasn’t added toarrays
). Property fields in a list to accomodate the possibility of multiple properties of the same type pointing to the same configuration. WARNING: don’t use this option if multiple properties of the same type point to the same Configuration, but the properties don’t have values for all of their fields.attach_settings (bool, default=False) – If True, attaches all of the fields of the property settings that are linked to the attached property instances. If
attach_settings=True
, must also haveattach_properties=True
.generator (bool, default=False) – If True, this function returns a generator of the configurations. This is useful if the configurations can’t all fit in memory at the same time.
verbose (bool) – If True, prints progress bar
- Returns
A list or generator of the re-constructed configurations
- Return type
configurations (iterable)
- get_data(collection_name, fields, query=None, ids=None, keep_ids=False, concatenate=False, vstack=False, ravel=False, unpack_properties=True, verbose=False)
Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.
Example:
data = database.get_data( collection_name='properties', query={'_id': {'$in': <list_of_property_IDs>}}, fields=['property_name_1.energy', 'property_name_1.forces'], cache=True )
- Parameters
collection_name (str) – The name of a collection in the database.
fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the
query
argument.keep_ids (bool, default=False) – If True, includes the ‘_id’ field as one of the returned values.
concatenate (bool, default=False) – If True, concatenates the data before returning.
vstack (bool, default=False) – If True, calls np.vstack on data before returning.
ravel (bool, default=False) – If True, concatenates and ravels the data before returning.
unpack_properties (bool, default=True) – If True, returns only the contents of the
'source-value'
key for each field infields
(assuming'source-value'
exists). Users who wish to return the full dictionaries for fields should setunpack_properties=False
.verbose (bool, default=False) – If True, prints a progress bar
- Returns
key = k for k in keys. val = in-memory data
- Return type
data (dict)
- get_dataset(ds_id, resync=False, verbose=False)
Returns the dataset with the given ID.
- Parameters
ds_ids (str) – The ID of the dataset to return
resync (bool) – If True, re-aggregates the configuration set and property information before returning. Default is False.
verbose (bool, default=True) – If True, prints a progress bar. Only used if
resync=False
.
- Returns
‘last_modified’: a datetime string ‘dataset’: the dataset object
- Return type
A dictionary with two keys
- get_property_definition(name)
- get_property_settings(pso_id)
- get_statistics(fields, query=None, ids=None, verbose=False)
Queries the database and returns the fields specified by keys as a list or an array of values. Returns the results in memory.
Example:
data = database.get_data( collection_name='properties', query={'_id': {'$in': <list_of_property_IDs>}}, fields=['property_name_1.energy', 'property_name_1.forces'], cache=True )
- Parameters
collection_name (str) – The name of a collection in the database.
fields (list or str) – The fields to return from the documents. Sub-fields can be returned by providing names separated by periods (‘.’)
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list) – The list of IDs to return the data for. If None, returns the data for the entire collection. Note that this information can also be provided using the
query
argument.verbose (bool, default=False) – If True, prints a progress bar during data extraction
- Returns
- results (dict)::
{ f: { 'average': np.average(data), 'std': np.std(data), 'min': np.min(data), 'max': np.max(data), 'average_abs': np.average(np.abs(data)) } for f in fields }
- insert_configuration_set(ids, description='', verbose=False)
Inserts the configuration set of IDs to the database.
- Parameters
ids (list or str) – The IDs of the configurations to include in the configuartion set.
description (str, optional) – A human-readable description of the configuration set.
verbose (bool, default=False) – If True, prints a progress bar
- insert_data(configurations, property_map=None, transform=None, generator=False, verbose=True)
A wrapper to Database.insert_data() which also adds important queryable metadata about the configurations into the Client’s server.
Note that when adding the data, the Mongo server will store the bi-directional relationships between the data. For example, a property will point to its configurations, but those configurations will also point back to any linked properties.
- Parameters
configurations (list or Configuration) – The list of configurations to be added.
property_map (dict) –
A dictionary that is used to specify how to load a defined property off of a configuration. Note that the top-level keys in the map must be the names of properties that have been previously defined using
add_property_definition()
.Example
property_map = { 'energy-forces-stress': { # ColabFit name: {'field': ASE field name, 'units': str} 'energy': {'field': 'energy', 'units': 'eV'}, 'forces': {'field': 'forces', 'units': 'eV/Ang'}, 'stress': {'field': 'virial', 'units': 'GPa'}, 'per-atom': {'field': 'per-atom', 'units': None}, '_settings': { '_method': 'VASP', '_description': 'A static VASP calculation', '_files': None, '_labels': ['Monkhorst-Pack'], 'xc-functional': {'field': 'xcf', 'units': None} } } }
If None, only loads the configuration information (atomic numbers, positions, lattice vectors, and periodic boundary conditions).
The ‘_settings’ key is a special key that can be used to specify the contents of a PropertySettings object that will be constructed and linked to each associated property instance.
transform (callable, default=None) – If provided, transform will be called on each configuration in
configurations
astransform(configuration)
. Note that this happens before anything else is done. transform should modify the Configuration in-place.generator (bool, default=False) – If True, returns a generator of the results; otherwise returns a list. If True, uses
update_one
instead ofbulk_write
to avoid having to store update documents in memory.verbose (bool, default=False) – If True, prints a progress bar
- Returns
A list of (config_id, property_id) tuples of the inserted data. If no properties were inserted, then property_id will be None.
- Return type
ids (list)
- insert_dataset(cs_ids, pr_ids, name, authors=None, links=None, description='', resync=False, verbose=False)
Inserts a dataset into the database.
- Parameters
cs_ids (list or str) – The IDs of the configuration sets to link to the dataset.
pr_ids (list or str) – The IDs of the properties to link to the dataset
name (str) – The name of the dataset
authors (list or str or None) – The names of the authors of the dataset. If None, then no authors are added.
links (list or str or None) – External links (e.g., journal articles, Git repositories, …) to be associated with the dataset. If None, then no links are added.
description (str or None) – A human-readable description of the dataset. If None, then not description is added.
resync (bool) – If True, re-synchronizes the configuration sets and properties before adding to the dataset. Default is False.
verbose (bool, default=False) – If True, prints a progress bar
- Returns
The ID of the inserted dataset
- Return type
ds_id (str)
- insert_property_definition(definition)
Inserts a new property definition into the database. Checks that definition is valid, then builds all necessary groups in
/root/properties
. Throws an error if the property already exists.- Parameters
definition (dict or string) – The map defining the property. See the example below, or the OpenKIM Properties Framework for more details. If a string is provided, it must be the name of an existing property definition from the OpenKIM Properties List.
Example definition:
property_definition = { 'property-id': 'default', 'property-title': 'A default property used for testing', 'property-description': 'A description of the property', 'energy': {'type': 'float', 'has-unit': True, 'extent': [], 'required': True, 'description': 'empty'}, 'stress': {'type': 'float', 'has-unit': True, 'extent': [6], 'required': True, 'description': 'empty'}, 'name': {'type': 'string', 'has-unit': False, 'extent': [], 'required': True, 'description': 'empty'}, 'nd-same-shape': {'type': 'float', 'has-unit': True, 'extent': [2,3,5], 'required': True, 'description': 'empty'}, 'nd-diff-shape': {'type': 'float', 'has-unit': True, 'extent': [":", ":", ":"], 'required': True, 'description': 'empty'}, 'forces': {'type': 'float', 'has-unit': True, 'extent': [":", 3], 'required': True, 'description': 'empty'}, 'nd-same-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', 2, 3], 'required': True, 'description': 'empty'}, 'nd-diff-shape-arr': {'type': 'float', 'has-unit': True, 'extent': [':', ':', ':'], 'required': True, 'description': 'empty'}, }
- insert_property_settings(ps_object)
Inserts a new property settings object into the database by creating and populating the necessary groups in
/root/property_settings
.- Parameters
ps_object (PropertySettings) – The
PropertySettings
object to insert into the database.- Returns
The ID of the inserted property settings object. Equals the hash of the object.
- Return type
ps_id (str)
- plot_histograms(fields=None, query=None, ids=None, verbose=False, nbins=100, xscale='linear', yscale='linear', method='matplotlib')
Generates histograms of the given fields.
- Parameters
fields (list or str) – The names of the fields to plot
query (dict, default=None) – A Mongo query dictionary. If None, returns the data for all of the documents in the collection.
ids (list or str) – The IDs of the objects to plot the data for
verbose (bool, default=False) – If True, prints progress bar
nbins (int) – Number of bins per histogram
xscale (str) – Scaling for x-axes. One of [‘linear’, ‘log’].
yscale (str) – Scaling for y-axes. One of [‘linear’, ‘log’].
method (str, default='plotly') – Package to use for plotting. ‘plotly’ or ‘matplotlib’.
- Returns
Returns the figure object.
- resync_configuration_set(cs_id, verbose=False)
Re-synchronizes the configuration set by re-aggregating the information from the configurations.
- Parameters
cs_id (str) – The ID of the configuration set to update
verbose (bool, default=False) – If True, prints a progress bar
- Returns
None; updates the configuration set document in-place
- resync_dataset(ds_id, verbose=False)
Re-synchronizes the dataset by aggregating all necessary data from properties and configuration sets. Note that this also calls
colabfit.tools.client.resync_configuration_set()
- Parameters
ds_id (str) – The ID of the dataset to update
verbose (bool, default=False) – If True, prints a progress bar
- Returns
None; updates the dataset document in-place
- colabfit.tools.database.load_data(file_path, file_format, name_field, elements, default_name='', labels_field=None, reader=None, glob_string=None, generator=True, verbose=False, **kwargs)
Loads a list of Configuration objects.
- Parameters
file_path (str) – Path to the file or folder containing the data
file_format (str) – A string for specifying the type of Converter to use when loading the configurations. Allowed values are ‘xyz’, ‘extxyz’, ‘cfg’, or ‘folder’.
name_field (str) – Key name to use to access ase.Atoms.info[<name_field>] to obtain the name of a configuration one the atoms have been loaded from the data file. Note that if file_format == ‘folder’, name_field will be set to ‘name’.
elements (list) – A list of strings of element types
default_name (list) – Default name to be used if name_field==None.
labels_field (str) – Key name to use to access ase.Atoms.info[<labels_field>] to obtain the labels that should be applied to the configuration. This field should contain a comma-separated list of strings
reader (callable) – An optional function for loading configurations from a file. Only used for file_format == ‘folder’
glob_string (str) – A string to use with Path(file_path).rglob(glob_string) to generate a list of files to be passed to self.reader. Only used for file_format == ‘folder’.
generator (bool, default=True) – If True, returns a generator of Configurations. If False, returns a list.
verbose (bool) – If True, prints progress bar.
All other keyword arguments will be passed with converter.load(…, **kwargs)