Mongo overview

MongoDB structure

The structure of the underlying Mongo database is described below. Lines preceded by a forward slash (/) denote Mongo collections. All other lines denote queryable fields. Indentation denotes sub-fields.

/configurations
- _id: a unique string identifier
- atomic_numbers: the atomic numbers for each atom
- positions: the Cartesian coordinates of each atom
- cell: the cell lattice vectors
- pbc: periodic boundary conditions along each cell vector
- names: human-readable names for the Configuration
- labels: labels applied to the Configuration to improve queries
- elements: the set of element types in the Configuration
- nelements: the number of unique element types in the Configuration
- elements_ratios: the relative concentrations of each element type
- chemical_formula_reduced: a reduced chemical formula
- chemical_formula_anonymous: an anonymous chemical formula (without specific element types)
- chemical_formula_hill: the chemical formula in Hill notation
- nsites: the number of sites (atoms) in the Configuration
- dimension_types: same as pbc
- nperiodic_dimensions: the number of periodic dimensions
- latice_vectors: same as cell
- last_modified: timestamp of when the entry was modified last
- relationships: pointers to linked entries
  
  properties: IDs of linked Properties
  
  configuration_sets: IDs of linked ConfigurationSets
/properties
- _id: a unique string identifier
- type: the property type
- <property_name>: the property, with the same name as the contents of type
  
  field_name: the values of each field in Property definition
- methods: duplication of the method field of any linked PropertySettings
- labels: duplication of the labels field of any linked PropertySettings
- last_modified: timestamp of when the entry was modified last
- relationships: pointers to linked entries
  
  property_settings: IDs of linked PropertySettings
  
  configurations: IDs of linked Configurations
/property_definitions
- _id: a unique string identifier
- definition: the full contents of a Property definition
/property_settings
- _id: a unique string identifier
- method: the method used in the calculation/experiment
- decription: a human-readable description of the calculation/experiment
- labels: labels to improve queries
- files: linked files
  
  file_name: the name of the file
  
  file_contents: the contents of the file
- relationships: pointers to linked entries
  
  properties: IDs of linked Properties
/configuration_sets
- _id: a unique string identifier
- description: a human-readable description
- last_modified: timestamp of when the entry was modified last
- aggregated_info: information gathered by aggregating the corresponding fields from the linked Configurations
  
  nconfigurations
  
  nsites
  
  nelements
  
  chemical_systems
  
  elements
  
  individual_elements_ratios
  
  total_elements_ratios
  
  labels
  
  labels_counts
  
  chemical_formula_reduced
  
  chemical_formula_anonymous
  
  chemical_formula_hill
- relationships: pointers to linked entries
  
  configurations: IDs of linked Configurations
  
  datasets: IDs of linked Datasets
/datasets
- _id: a unique string identifier
- name: the name of the Dataset
- authors: the authors of the Dataset
- description: a human-readable description of the Dataset
- links: external inks associated with the Dataset
- last_modified: timestamp of when the entry was modified last
- aggregated_info: information gathered by aggregating the corresponding fields from the linked Configurations and Properties
  
  nconfigurations
  
  nsites
  
  nelements
  
  chemical_systems
  
  elements
  
  individual_elements_ratios
  
  total_elements_ratios
  
  configuration_labels
  
  configuration_labels_counts
  
  chemical_formula_reduced
  
  chemical_formula_anonymous
  
  chemical_formula_hill
  
  nperiodic_dimensions
  
  dimension_types
  
  property_types
  
  property_types_counts
  
  property_fields
  
  property_fields_counts
  
  methods
  
  methods_counts
  
  property_labels
  
  property_labels_counts
- relationships: pointers to linked entries
  
  properties: IDs of linked Properties
  
  configuration_sets: IDs of linked ConfigurationSets

Mongo usage

This section provides examples on how to perform various operations on the Database using Mongo. For more details, it is highly suggested that you visit the MongoDB documentation.

Queries

It is extremely important to be able to understand how to formulate at least basic Mongo queries. If you are a newcomer to Mongo, one of the best places to start would be to look over some of the query tutorials from the official Mongo manual.

Structure

Recall that when opening a connection to the Database, for example with the following code:

from colabfit.tools.database import MongoDatabase

client = MongoDatabase('colabfit_database')

the client object is a Mongo Client connected to the 'colabfit_database' Database in a running Mongo server. This Database will have the following collections: 'configurations', 'properties', 'property_settings', 'configuration_sets', and 'datasets'. which are accessible as attributes. See MongoDB structure for more details.

Find one

Get an example of a single document in a collection that satisfies the given query.

# Find a Property document that is linked to the Dataset with an ID of ds_id
client.properties.find_one({'relationships.datasets': ds_id})

Count documents

Count the number of documents in a collection.

# Count the number of Configurations in the Database
client.configurations.count_documents({})

Get all documents

Get a list of all of the Datasets in the Database, then sort by name.

sorted(
    list(
        client.datasets.find({}, {'name'})
    ),
    key=lambda x: x['name'].lower()
)

Check for multiple links

Similar to what is done in detecting duplicates, the 'relationships' field can be useful for finding documents that are linked to multiple other documents.

For example, for finding how many ConfigurationSets are linked to more than one Dataset:

client.configuration_sets.count_documents(
    {'relationships.datasets.1': {'$exists': True}}
)

Get distinct fields

Get a set of all existing values of a given field:

# Get a list of the unique property types in the Database
client.properties.distinct('type')

Count occurrences

Aggregation pipelines can be extremely useful, but may be more difficult to understand for new users of Mongo. The example below shows how to use aggregation to count the occurrences of each Configuration label.

cursor = client.configurations.aggregate([
# by default, matches to all documents in the collection
# $unwind: create a new document, once for each value in the 'labels'
# field
        {'$unwind': '$labels'},
# $group: group the documents based on their label field, and count
        {'$group': {'_id': '$labels', 'count': {'$sum': 1}}}
])

sorted(cursor, key=lambda x: x['count'], reverse=True)

Get Datasets linked to ConfigurationSets

The example below shows how to use aggregation to obtain a list of all ConfigurationSets in the Database, with the names of their linked Datasets.

cursor = client.configuration_sets.aggregate([
    # $project: only return the requested fields for each document
    {'$project': {'relationships.datasets': 1}},
    # $unwind: create a new document for each element in an array
    {'$unwind': '$relationships.datasets'},
    # $project: only return the renamed field
    {'$project': {'ds_id': '$relationships.datasets'}},
    # $lookup: pull the Dataset document with the given ID
    {'$lookup': {
        # pull from the 'datasets' collection
        'from': 'datasets',
        # match the local field 'ds_id' to the '_id' field in 'datasets'
        'localField': 'ds_id',
        'foreignField': '_id',
        # attach the Dataset document under the name 'linked_ds'
        'as': 'linked_ds'
    }},
    # $project: only return the name of the linke Dataset
    {'$project': {'ds_name': '$linked_ds.name'}}
])

sorted(list(cursor), key=lambda x: x['ds_name'][0].lower())