===== Usage ===== This section describes the core usage of the :code:`colabfit-tools` package. Parsing data ============ Properties can be added into the Database is by creating or loading a Configuration object, attaching the Property data to the :attr:`info` and/or :attr:`arrays` dictionaries of the Configuration (see :ref:`here ` for more details), then using the :meth:`~colabfit.tools.database.MongoDatabase.insert_data` method. In addition to the Configuration with attached Property information, :meth:`~colabfit.tools.database.MongoDatabase.insert_data` will also require a property map describing how to map the attached Property information onto an existing property definition (see :ref:`Property definitions`). A property map should have the following structure: .. code-block:: python { : { : { 'field': , 'units': } } } See below for the definitions of each of the above keys/values: * :attr:`` should be one of the following: 1. The name of an OpenKIM Property Definition from the `list of approved OpenKIM Property Definitions `_ 2. The name of a locally-defined property (see :ref:`Property definitions`) that has been added using :meth:`~colabfit.tools.database.MongoDatabase.insert_property_definition` * :attr:`` should be the name of a field from an property definition * :attr:`` should be a key for indexing the :attr:`info` or :attr:`arrays` dictionaries on a Configuration (see :ref:`Configuration info and arrays fields`) * :attr:`'field'` is used to specify the key for extracting the property from :attr:`Configuration.info` or :attr:`Configuartion.arrays`. * :attr:`'units'` should be a string matching one of the units names in `ase.units `_. Note that :meth:`insert_data` will attempt to load every Property specified in :code:`property_map` for each Configuration. This means that if there are :code:`P` properties in :code:`property_map` and :code:`C` Configurations, a maximum of `P*C` Properties will be loaded in total. If a Configuration does not have the necessary data for loading a given Property, that Property is skipped for the given Configuration and a warning is raised. Detecting duplicates ==================== All entities in the Database (Configuration, Property, PropertySetting, ConfigurationSet, and Dataset) are stored with a unique ID that is generated by hashing the corresponding entity (see the documentation for their respective hash functions for more detail). Because of this, if two entities are identical, they will only be stored once in the Database. This can be useful when trying to remove duplicate data in a Dataset, as it ensures that no duplicates are added during :meth:`~colabfit.tools.database.MongoDatabase.insert_data`. For example, this enables a user to take advantage of the Mongo :meth:`count_documents` function to see if a Configuration is linked to more than one Property .. code-block:: python client.configurations.count_documents( {'relationships.properties.2': {'$exists': True}} ) Similarly, the :code:`'relationships'` fields of entities can be used to check if two Datasets have any common Configurations. Synchronizing a Dataset ======================= When working with a Dataset, it is important to make sure that the dataset has been "synchronized" in order to ensure that all of the data (configuration labels, configuration sets, aggregated metadata, ...) have been properly updated to reflect any recent changes to the Dataset. There are three points in the Database where data aggregation is performed (and thus where care must be taken to ensure synchronization): * ConfigurationSets aggregating Configuration information (:meth:`~colabfit.tools.database.MongoDatabase.aggregate_configuration_info`) * Datasets aggregating ConfigurationSet information (:meth:`~colabfit.tools.database.MongoDatabase.aggregate_configuration_set_info`) * Datasets aggregating Property information (:meth:`~colabfit.tools.database.MongoDatabase.aggregate_property_info`) Synchronization can be performed by using the :code:`resync=True` argument when calling :meth:`~colabfit.tools.database.MongoDatabase.get_configuration_set` and :meth:`~colabfit.tools.database.MongoDatabase.get_dataset`. Aggregation is automatically performed when inserting ConfigurationSets. Aggregation is *not* automatically performed when inserting Datasets in order to avoid re-aggregating ConfigurationSet information unnecessarily; therefore :code:`resync=True` may also be used for :meth:`~colabfit.tools.database.MongoDatabase.insert_dataset`. In order to re-synchronize the entities without using the :meth:`get_*` methods, call the :meth:`aggregate_*` methods directly. A common scenario where this may be necessary is when using :meth:`~colabfit.tools.database.MongoDatabase.apply_labels` in order to make sure that the changes are reflected in the ConfigurationSets and Datasets. Applying configuration labels ============================= Configuration labels should be applied using the :meth:`~colabfit.tools.database.MongoDatabase.apply_labels` method. An example :attr:`configuration_label_regexes`: .. code-block:: python client.apply_labels( dataset_id=ds_id, collection_name='configurations', query={'nsites': {'$lt': 100}}, labels={'small'}, verbose=True ) See :ref:`the Si PRX GAP tutorial ` for a more complete example. Building configuration sets =========================== There are two steps to building a ConfigurationSet: First, extracting the IDs of the Configurations that should be included in the ConfigurationSet: .. code-block:: python co_ids = client.get_data( 'configurations', fields='_id', query={'_id': {'$in': }, 'nsites': {'$lt': 100}}, ravel=True ).tolist() And second, calling :meth:`~colabfit.tools.database.MongoDatabase.insert_configuration_set`: .. code-block:: python cs_id = client.insert_configuration_set( co_ids, description='Configurations with fewer than 100 atoms' ) Note that in the first step, :code:`'_id': {'$in': }` is used in order to limit the query to include only the Configurations within a given Dataset, rather than all of the Configurations in the entire Database. See :ref:`the Si PRX GAP tutorial ` for a more complete example. Attaching property settings =========================== A :class:`~colabfit.tools.property_settings.PropertySettings` object can be attached to a Property by specifying the :code:`property_settings` argument in :meth:`~colabfit.tools.database.MongoDatabase.insert_data`. .. code-block:: python pso = PropertySettings( method='VASP', description='A basic VASP calculation', files=None, labels=['PBE', 'GGA'], ) client.insert_data( images, property_map=..., property_settings={ : pso, } ) Data exploration ================ The :meth:`~colabfit.tools.database.MongoDatabase.get_data`, :meth:`~colabfit.tools.database.MongoDatabase.plot_histograms`, and :meth:`~colabfit.tools.database.MongoDatabase.get_statistics` functions can be extremely useful for quickly visualizing your data and detecting outliers. .. code-block:: python energies = client.get_data('properties', '.energy', ravel=True) forces = client.get_data('properties', '.forces', concatenate=True) .. code-block:: python # From the QM9 example client.get_statistics( ['qm9-property.a', 'qm9-property.b', 'qm9-property.c'], ids=dataset.property_ids, verbose=True ) client.plot_histograms( ['qm9-property.a', 'qm9-property.b', 'qm9-property.c',], ids=dataset.property_ids ) .. image:: qm9_histograms.png :align: center See the :ref:`QM9 example` and the :ref:`Si PRX GAP example` to further explore the benefits of these functions. Since Configurations inherit from :class:`ase.Atoms` objects, they work seamlessly with ASE's visualization tools. .. code-block:: python # Run inside of a Jupyter Notebook configurations = client.get_configurations(configuration_ids) from ase.visualize import view # Creates a Jupyter Widget; may require `pip install nglview` first view(configurations, viewer='nglview') .. image:: db_viewer_example.gif :align: center Filtering a Dataset =================== Datasets can be easily filtered to remove unwanted entries or extract subsets of interest. Filtering can be done using the :meth:`~colabfit.tools.database.MongoDatabase.filter_on_properties` or :meth:`~colabfit.tools.database.MongoDatabase.filter_on_configurations` methods. .. code-block:: python # From the QM9 example clean_config_sets, clean_property_ids = client.filter_on_properties( ds_id=ds_id, filter_fxn=lambda x: (x['qm9-property']['a']['source-value'] < 20) and x['qm9-property']['b']['source-value'] < 10, fields=['qm9-property.a.source-value', 'qm9-property.b.source-value'], verbose=True ) Note the use of the :code:`ds_id` argument, which makes sure that the returned ConfigurationSet IDs and Property IDs are only those that are contained within the given Dataset. Data transformations ==================== It is often necessary to transform the data in a Dataset in order to improve performance when fitting models to the data, or to convert the data into a different format. This can be done using the `transfom` argument of the :meth:`~colabfit.tools.database.insert_data` function. The `transform` argument should be a callable function that modifies the Configuration in-place: .. code-block:: python def per_atom(c): c.info['energy'] /= len(c) client.insert_data( configurations, property_map=property_map, property_settings=property_settings, transform=per_atom, ) Supported file formats ====================== Ideally, raw data should be stored in `Extended XYZ format `_. This is the default format used by :code:`colabfit-tools`, and should be suitable for almost all use cases. CFG files (used by Moment Tensor Potentials) are also supported, but are not recommended. Data that is in a custom format (e.g., JSON, HDF5, ...) that cannot be easily read by `ase.io.read `_ will require the use of a :class:`~colabfit.tools.converters.FolderConverter` instance, which needs to be supplied with a custom :meth:`reader` function for parsing the data.