Mongo overview

MongoDB structure

The structure of the underlying Mongo database is described below. Lines preceded by a forward slash (/) denote Mongo collections. All other lines denote queryable fields. Indentation denotes sub-fields.

  • /configurations
    • _id: a unique string identifier

    • atomic_numbers: the atomic numbers for each atom

    • positions: the Cartesian coordinates of each atom

    • cell: the cell lattice vectors

    • pbc: periodic boundary conditions along each cell vector

    • names: human-readable names for the Configuration

    • labels: labels applied to the Configuration to improve queries

    • elements: the set of element types in the Configuration

    • nelements: the number of unique element types in the Configuration

    • elements_ratios: the relative concentrations of each element type

    • chemical_formula_reduced: a reduced chemical formula

    • chemical_formula_anonymous: an anonymous chemical formula (without specific element types)

    • chemical_formula_hill: the chemical formula in Hill notation

    • nsites: the number of sites (atoms) in the Configuration

    • dimension_types: same as pbc

    • nperiodic_dimensions: the number of periodic dimensions

    • latice_vectors: same as cell

    • last_modified: timestamp of when the entry was modified last

    • relationships: pointers to linked entries
      • properties: IDs of linked Properties

      • configuration_sets: IDs of linked ConfigurationSets

  • /properties
    • _id: a unique string identifier

    • type: the property type

    • <property_name>: the property, with the same name as the contents of type
      • field_name: the values of each field in Property definition

    • methods: duplication of the method field of any linked PropertySettings

    • labels: duplication of the labels field of any linked PropertySettings

    • last_modified: timestamp of when the entry was modified last

    • relationships: pointers to linked entries
      • property_settings: IDs of linked PropertySettings

      • configurations: IDs of linked Configurations

  • /property_definitions
    • _id: a unique string identifier

    • definition: the full contents of a Property definition

  • /property_settings
    • _id: a unique string identifier

    • method: the method used in the calculation/experiment

    • decription: a human-readable description of the calculation/experiment

    • labels: labels to improve queries

    • files: linked files
      • file_name: the name of the file

      • file_contents: the contents of the file

    • relationships: pointers to linked entries
      • properties: IDs of linked Properties

  • /configuration_sets
    • _id: a unique string identifier

    • description: a human-readable description

    • last_modified: timestamp of when the entry was modified last

    • aggregated_info: information gathered by aggregating the corresponding fields from the linked Configurations
      • nconfigurations

      • nsites

      • nelements

      • chemical_systems

      • elements

      • individual_elements_ratios

      • total_elements_ratios

      • labels

      • labels_counts

      • chemical_formula_reduced

      • chemical_formula_anonymous

      • chemical_formula_hill

    • relationships: pointers to linked entries
      • configurations: IDs of linked Configurations

      • datasets: IDs of linked Datasets

  • /datasets
    • _id: a unique string identifier

    • name: the name of the Dataset

    • authors: the authors of the Dataset

    • description: a human-readable description of the Dataset

    • links: external inks associated with the Dataset

    • last_modified: timestamp of when the entry was modified last

    • aggregated_info: information gathered by aggregating the corresponding fields from the linked Configurations and Properties
      • nconfigurations

      • nsites

      • nelements

      • chemical_systems

      • elements

      • individual_elements_ratios

      • total_elements_ratios

      • configuration_labels

      • configuration_labels_counts

      • chemical_formula_reduced

      • chemical_formula_anonymous

      • chemical_formula_hill

      • nperiodic_dimensions

      • dimension_types

      • property_types

      • property_types_counts

      • property_fields

      • property_fields_counts

      • methods

      • methods_counts

      • property_labels

      • property_labels_counts

    • relationships: pointers to linked entries
      • properties: IDs of linked Properties

      • configuration_sets: IDs of linked ConfigurationSets

Mongo usage

This section provides examples on how to perform various operations on the Database using Mongo. For more details, it is highly suggested that you visit the MongoDB documentation.

Queries

It is extremely important to be able to understand how to formulate at least basic Mongo queries. If you are a newcomer to Mongo, one of the best places to start would be to look over some of the query tutorials from the official Mongo manual.

Structure

Recall that when opening a connection to the Database, for example with the following code:

from colabfit.tools.database import MongoDatabase

client = MongoDatabase('colabfit_database')

the client object is a Mongo Client connected to the 'colabfit_database' Database in a running Mongo server. This Database will have the following collections: 'configurations', 'properties', 'property_settings', 'configuration_sets', and 'datasets'. which are accessible as attributes. See MongoDB structure for more details.

Find one

Get an example of a single document in a collection that satisfies the given query.

# Find a Property document that is linked to the Dataset with an ID of ds_id
client.properties.find_one({'relationships.datasets': ds_id})

Count documents

Count the number of documents in a collection.

# Count the number of Configurations in the Database
client.configurations.count_documents({})

Get all documents

Get a list of all of the Datasets in the Database, then sort by name.

sorted(
    list(
        client.datasets.find({}, {'name'})
    ),
    key=lambda x: x['name'].lower()
)

Get distinct fields

Get a set of all existing values of a given field:

# Get a list of the unique property types in the Database
client.properties.distinct('type')

Count occurrences

Aggregation pipelines can be extremely useful, but may be more difficult to understand for new users of Mongo. The example below shows how to use aggregation to count the occurrences of each Configuration label.

cursor = client.configurations.aggregate([
# by default, matches to all documents in the collection
# $unwind: create a new document, once for each value in the 'labels'
# field
        {'$unwind': '$labels'},
# $group: group the documents based on their label field, and count
        {'$group': {'_id': '$labels', 'count': {'$sum': 1}}}
])

sorted(cursor, key=lambda x: x['count'], reverse=True)

Get Datasets linked to ConfigurationSets

The example below shows how to use aggregation to obtain a list of all ConfigurationSets in the Database, with the names of their linked Datasets.

cursor = client.configuration_sets.aggregate([
    # $project: only return the requested fields for each document
    {'$project': {'relationships.datasets': 1}},
    # $unwind: create a new document for each element in an array
    {'$unwind': '$relationships.datasets'},
    # $project: only return the renamed field
    {'$project': {'ds_id': '$relationships.datasets'}},
    # $lookup: pull the Dataset document with the given ID
    {'$lookup': {
        # pull from the 'datasets' collection
        'from': 'datasets',
        # match the local field 'ds_id' to the '_id' field in 'datasets'
        'localField': 'ds_id',
        'foreignField': '_id',
        # attach the Dataset document under the name 'linked_ds'
        'as': 'linked_ds'
    }},
    # $project: only return the name of the linke Dataset
    {'$project': {'ds_name': '$linked_ds.name'}}
])

sorted(list(cursor), key=lambda x: x['ds_name'][0].lower())