Mongo overview
MongoDB structure
The structure of the underlying Mongo database is described below. Lines
preceded by a forward slash (/
) denote Mongo collections.
All other lines denote queryable fields. Indentation denotes sub-fields.
/configurations
_id
: a unique string identifieratomic_numbers
: the atomic numbers for each atompositions
: the Cartesian coordinates of each atomcell
: the cell lattice vectorspbc
: periodic boundary conditions along each cell vectornames
: human-readable names for the Configurationlabels
: labels applied to the Configuration to improve querieselements
: the set of element types in the Configurationnelements
: the number of unique element types in the Configurationelements_ratios
: the relative concentrations of each element typechemical_formula_reduced
: a reduced chemical formulachemical_formula_anonymous
: an anonymous chemical formula (without specific element types)chemical_formula_hill
: the chemical formula in Hill notationnsites
: the number of sites (atoms) in the Configurationdimension_types
: same aspbc
nperiodic_dimensions
: the number of periodic dimensionslatice_vectors
: same ascell
last_modified
: timestamp of when the entry was modified lastrelationships
: pointers to linked entriesproperties
: IDs of linked Propertiesconfiguration_sets
: IDs of linked ConfigurationSets
/properties
_id
: a unique string identifiertype
: the property type<property_name>
: the property, with the same name as the contents oftype
field_name
: the values of each field in Property definition
methods
: duplication of themethod
field of any linked PropertySettingslabels
: duplication of thelabels
field of any linked PropertySettingslast_modified
: timestamp of when the entry was modified lastrelationships
: pointers to linked entriesproperty_settings
: IDs of linked PropertySettingsconfigurations
: IDs of linked Configurations
/property_definitions
_id
: a unique string identifierdefinition
: the full contents of a Property definition
/property_settings
_id
: a unique string identifiermethod
: the method used in the calculation/experimentdecription
: a human-readable description of the calculation/experimentlabels
: labels to improve queriesfiles
: linked filesfile_name
: the name of the filefile_contents
: the contents of the file
relationships
: pointers to linked entriesproperties
: IDs of linked Properties
/configuration_sets
_id
: a unique string identifierdescription
: a human-readable descriptionlast_modified
: timestamp of when the entry was modified lastaggregated_info
: information gathered by aggregating the corresponding fields from the linked Configurationsnconfigurations
nsites
nelements
chemical_systems
elements
individual_elements_ratios
total_elements_ratios
labels
labels_counts
chemical_formula_reduced
chemical_formula_anonymous
chemical_formula_hill
relationships
: pointers to linked entriesconfigurations
: IDs of linked Configurationsdatasets
: IDs of linked Datasets
/datasets
_id
: a unique string identifiername
: the name of the Datasetauthors
: the authors of the Datasetdescription
: a human-readable description of the Datasetlinks
: external inks associated with the Datasetlast_modified
: timestamp of when the entry was modified lastaggregated_info
: information gathered by aggregating the corresponding fields from the linked Configurations and Propertiesnconfigurations
nsites
nelements
chemical_systems
elements
individual_elements_ratios
total_elements_ratios
configuration_labels
configuration_labels_counts
chemical_formula_reduced
chemical_formula_anonymous
chemical_formula_hill
nperiodic_dimensions
dimension_types
property_types
property_types_counts
property_fields
property_fields_counts
methods
methods_counts
property_labels
property_labels_counts
relationships
: pointers to linked entriesproperties
: IDs of linked Propertiesconfiguration_sets
: IDs of linked ConfigurationSets
Mongo usage
This section provides examples on how to perform various operations on the Database using Mongo. For more details, it is highly suggested that you visit the MongoDB documentation.
Queries
It is extremely important to be able to understand how to formulate at least basic Mongo queries. If you are a newcomer to Mongo, one of the best places to start would be to look over some of the query tutorials from the official Mongo manual.
Structure
Recall that when opening a connection to the Database, for example with the following code:
from colabfit.tools.database import MongoDatabase
client = MongoDatabase('colabfit_database')
the client
object is a Mongo Client connected to the
'colabfit_database'
Database in a running Mongo server. This Database
will have the following collections: 'configurations'
,
'properties'
, 'property_settings'
, 'configuration_sets'
,
and 'datasets'
. which are accessible as attributes. See MongoDB structure for more details.
Find one
Get an example of a single document in a collection that satisfies the given query.
# Find a Property document that is linked to the Dataset with an ID of ds_id
client.properties.find_one({'relationships.datasets': ds_id})
Count documents
Count the number of documents in a collection.
# Count the number of Configurations in the Database
client.configurations.count_documents({})
Get all documents
Get a list of all of the Datasets in the Database, then sort by name.
sorted(
list(
client.datasets.find({}, {'name'})
),
key=lambda x: x['name'].lower()
)
Check for multiple links
Similar to what is done in detecting duplicates,
the 'relationships'
field can be useful for finding documents that are
linked to multiple other documents.
For example, for finding how many ConfigurationSets are linked to more than one Dataset:
client.configuration_sets.count_documents(
{'relationships.datasets.1': {'$exists': True}}
)
Get distinct fields
Get a set of all existing values of a given field:
# Get a list of the unique property types in the Database
client.properties.distinct('type')
Count occurrences
Aggregation pipelines can be extremely useful, but may be more difficult to understand for new users of Mongo. The example below shows how to use aggregation to count the occurrences of each Configuration label.
cursor = client.configurations.aggregate([
# by default, matches to all documents in the collection
# $unwind: create a new document, once for each value in the 'labels'
# field
{'$unwind': '$labels'},
# $group: group the documents based on their label field, and count
{'$group': {'_id': '$labels', 'count': {'$sum': 1}}}
])
sorted(cursor, key=lambda x: x['count'], reverse=True)
Get Datasets linked to ConfigurationSets
The example below shows how to use aggregation to obtain a list of all ConfigurationSets in the Database, with the names of their linked Datasets.
cursor = client.configuration_sets.aggregate([
# $project: only return the requested fields for each document
{'$project': {'relationships.datasets': 1}},
# $unwind: create a new document for each element in an array
{'$unwind': '$relationships.datasets'},
# $project: only return the renamed field
{'$project': {'ds_id': '$relationships.datasets'}},
# $lookup: pull the Dataset document with the given ID
{'$lookup': {
# pull from the 'datasets' collection
'from': 'datasets',
# match the local field 'ds_id' to the '_id' field in 'datasets'
'localField': 'ds_id',
'foreignField': '_id',
# attach the Dataset document under the name 'linked_ds'
'as': 'linked_ds'
}},
# $project: only return the name of the linke Dataset
{'$project': {'ds_name': '$linked_ds.name'}}
])
sorted(list(cursor), key=lambda x: x['ds_name'][0].lower())