The dataset viewer is not available because its heuristics could not detect any supported data files. You can try uploading some data files, or configuring the data files location manually.
MofasaDB
The MofasaDB is a publicly available dataset containing 200.000+ de novo generated MOF (Metal-Organic Framework) structures from Mofasa trained on QMOF (up to 170 atoms), along with their geometry-relaxed counterparts. The database is released alongside the paper Mofasa: A Step Change in Metal-Organic Framework Generation. A user-friendly web interface for search and discovery can be accessed at https://mofux.ai/.
Database Overview
The database contains unconditionally generated MOF structures from Mofasa, along with their geometry-relaxed counterparts.
Files
| File | Description |
|---|---|
samples.db |
Original generated MOF structures |
relaxed.db |
Geometry-relaxed versions of the samples |
sample_latents/ |
ORB latent embeddings for samples |
relaxed_latents/ |
ORB latent embeddings for relaxed structures |
Data Alignment
The databases are row-aligned: row i in samples.db corresponds to row i in relaxed.db.
Indexing:
- ASE databases are 1-indexed: first row is
db.get(1) - NumPy arrays are 0-indexed: first element is
array[0] - Therefore:
latent[i]corresponds todb.get(i + 1)
Quick Start
Load a Structure
from ase.db import connect
db = connect("samples.db")
row_id = 1
row = db.get(row_id) # Get first structure (1-indexed)
atoms = row.toatoms() # Convert to ASE Atoms object
print(atoms.get_chemical_formula())
Access Properties
# Get energy per atom
energy = row.data['properties']['orb_properties']['orb_energy_per_atom']
# Get pore diameter
lcd = row.data['properties']['pyzeo_geometric_properties']['lcd']
# Get topology (top-level property)
topology = row.data['topology']
Load Orb Latent Embeddings
import numpy as np
latents = np.load("sample_latents/orb_latent_4_graph.npy")
latent = latents[row_id - 1] # Convert 1-indexed row to 0-indexed array
Compare Sample and Relaxed
sample_db = connect("samples.db")
relaxed_db = connect("relaxed.db")
# Row i in both databases correspond to the same structure
row_id = 100
sample_atoms = sample_db.get(row_id).toatoms()
relaxed_atoms = relaxed_db.get(row_id).toatoms()
print(f"Sample formula: {sample_atoms.get_chemical_formula()}")
print(f"Relaxed formula: {relaxed_atoms.get_chemical_formula()}")
Handle Missing Data
Not all properties are available for every structure. Common causes include:
- MOFID failure: If MOFID cannot identify the MOF building blocks (nodes, linkers, topology), these properties are set to
"UNKNOWN","ERROR", or empty lists for missing SMILES strings. - Zeo++ non-porous: If Zeo++ determines a structure has insufficient porosity for probe access, geometric properties (
lcd,pld, accessible volume/surface area) may be missing, zero, orNone. - Component absence: Latent embeddings for
bound_solventandfree_solventare zero vectors when structures contain no solvent molecules.
Property Reference
Properties are stored in row.data with nested paths. Some examples:
PROPERTY_PATHS = {
# ORB model properties
'orb_energy_per_atom': 'properties.orb_properties.orb_energy_per_atom',
'orb_max_force': 'properties.orb_properties.orb_max_force',
# Zeo++ geometric properties
'lcd': 'properties.pyzeo_geometric_properties.lcd',
'pld': 'properties.pyzeo_geometric_properties.pld',
'dif': 'properties.pyzeo_geometric_properties.dif',
'av_volume_fraction': 'properties.pyzeo_geometric_properties.av_volume_fraction',
'av_cm3_per_g': 'properties.pyzeo_geometric_properties.av_cm3_per_g',
'nav_volume_fraction': 'properties.pyzeo_geometric_properties.nav_volume_fraction',
'asa_m2_per_g': 'properties.pyzeo_geometric_properties.asa_m2_per_g',
'number_of_channels': 'properties.pyzeo_geometric_properties.number_of_channels',
'number_of_pockets': 'properties.pyzeo_geometric_properties.number_of_pockets',
# Crystal symmetry
'spacegroup_number': 'properties.crystal_symmetry.symprec_0.01/spacegroup_number',
'pointgroup': 'properties.crystal_symmetry.symprec_0.01/pointgroup',
# MOFID properties
'mofid': 'mofid',
'mofkey': 'mofkey',
'topology': 'topology',
'smiles_nodes': 'smiles_nodes',
'smiles_linkers': 'smiles_linkers',
'cat': 'cat',
# MOFChecker
'mofchecker': 'properties.mofchecker',
'mofchecker_valid': 'properties.mofchecker.mofchecker_valid',
}
Structural Properties
Lattice Parameters
| Key | Type | Description |
|---|---|---|
lattice_a |
float | Unit cell length along the a-axis (Å) |
lattice_b |
float | Unit cell length along the b-axis (Å) |
lattice_c |
float | Unit cell length along the c-axis (Å) |
lattice_alpha |
float | Angle between b and c axes (degrees) |
lattice_beta |
float | Angle between a and c axes (degrees) |
lattice_gamma |
float | Angle between a and b axes (degrees) |
Chemical Composition
| Key | Type | Description |
|---|---|---|
reduced_formula |
str | Empirical (reduced) chemical formula of the structure |
MOFID Properties
MOFID is a standardized identifier for MOF structures that encodes topology, nodes, linkers, and catenation information.
| Key | Type | Description |
|---|---|---|
mofid |
str | Full MOFID identifier string. Format: {nodes}.{linkers} MOFid-v1.{topology}.cat{n}. |
mofkey |
str | MOFKey identifier (a hash-based representation of the MOF structure). Format: {hash}.{topology}.MOFkey-v1.{short_code}. |
smiles_nodes |
str | Concatenated SMILES strings of all distinct metal nodes (.-separated). |
smiles_linkers |
str | Concatenated SMILES strings of all distinct organic linkers (.-separated). |
topology |
str | Three-letter RCSR topology code (e.g., "pcu", "dia", "fcu"). |
topology_v2 |
str | Alternative topology assignment (may differ from primary if ambiguous) |
cat |
int | Catenation number (degree of interpenetration). 0 = non-catenated, n = n-fold catenated |
Crystal Symmetry
Computed using pymatgen's SpacegroupAnalyzer.
| Key | Type | Description |
|---|---|---|
spacegroup |
str | Crystal system from space group analysis at symprec=0.01 (e.g., "cubic", "triclinic") |
spacegroup_v2 |
str | Crystal system from space group analysis at symprec=0.1 (more tolerant symmetry detection) |
Detailed Crystal Symmetry (nested under properties.crystal_symmetry)
| Key | Type | Description |
|---|---|---|
symprec_0.01/pointgroup |
str | Point group symbol (Hermann-Mauguin notation) |
symprec_0.01/spacegroup |
str | Space group symbol (Hermann-Mauguin notation) |
symprec_0.01/spacegroup_number |
int | International Tables space group number (1-230) |
symprec_0.01/spacegroup_crystal |
str | Crystal system name |
symprec_0.1/pointgroup |
str | Point group symbol (at looser tolerance) |
symprec_0.1/spacegroup |
str | Space group symbol (at looser tolerance) |
symprec_0.1/spacegroup_number |
int | Space group number (at looser tolerance) |
symprec_0.1/spacegroup_crystal |
str | Crystal system name (at looser tolerance) |
Zeo++ Geometric Properties
Computed using Zeo++ via the pyzeo wrapper. These properties characterize the pore geometry and accessibility using a spherical probe (default: N₂ probe radius of 1.86 Å).
Pore Descriptors
| Key | Type | Unit | Description |
|---|---|---|---|
lcd |
float | Å | Largest Cavity Diameter – Diameter of the largest sphere that can fit in the pore without overlapping framework atoms |
pld |
float | Å | Pore Limiting Diameter – Diameter of the largest sphere that can percolate through the framework (i.e., the narrowest point along the largest channel) |
dif |
float | Å | Diameter of Included sphere along Free path – Diameter of the largest sphere that can diffuse along the accessible path |
number_of_channels |
int | — | Number of distinct connected channel systems in the framework |
number_of_pockets |
int | — | Number of isolated pores (inaccessible to the probe molecule) |
Volume Properties
| Key | Type | Unit | Description |
|---|---|---|---|
av_volume_fraction |
float | — | Fraction of unit cell volume that is accessible to the probe |
av_cm3_per_g |
float | cm³/g | Accessible pore volume per gram of framework |
nav_volume_fraction |
float | — | Fraction of unit cell volume that is non-accessible (pocket volume) |
nav_cm3_per_g |
float | cm³/g | Non-accessible volume per gram of framework |
channel_volume_fraction |
float | — | Fraction of total void volume that belongs to channels |
pocket_volume_fraction |
float | — | Fraction of total void volume that belongs to pockets |
Surface Area Properties
| Key | Type | Unit | Description |
|---|---|---|---|
asa_m2_per_cm3 |
float | m²/cm³ | Accessible surface area per unit volume |
asa_m2_per_g |
float | m²/g | Accessible Surface Area per gram (comparable to BET surface area) |
nasa_m2_per_cm3 |
float | m²/cm³ | Non-accessible surface area per unit volume |
nasa_m2_per_g |
float | m²/g | Non-accessible surface area per gram |
channel_surface_area_fraction |
float | — | Fraction of total surface area belonging to channels |
pocket_surface_area_fraction |
float | — | Fraction of total surface area belonging to pockets |
ORB Properties
Properties computed using the ORB machine-learned interatomic potential.
Energy and Forces
| Key | Type | Unit | Description |
|---|---|---|---|
orb_energy_per_atom |
float | eV/atom | Total predicted potential energy divided by number of atoms |
orb_max_force |
float | eV/Å | Maximum force magnitude on any atom in the structure |
ORB Latent Embeddings
ORB latent embeddings are stored as NumPy files in the sample_latents/ and relaxed_latents/ directories.
File naming: orb_latent_{layer}_{component}.npy
| File Pattern | Shape | Description |
|---|---|---|
orb_latent_{0-4}_graph |
(N, 256) | Graph-level pooled latent |
orb_latent_{0-4}_nodes_and_bridges |
(N, 256) | Mean-pooled over metal nodes |
orb_latent_{0-4}_linkers |
(N, 256) | Mean-pooled over organic linkers |
orb_latent_{0-4}_bound_solvent |
(N, 256) | Mean-pooled over bound solvents |
orb_latent_{0-4}_free_solvent |
(N, 256) | Mean-pooled over free solvents |
- Layers 0-4 correspond to different depths in the ORB GNN (layer 4 = final layer)
- Zero vectors indicate missing data (e.g., structures without solvents)
MOFChecker Properties
Computed using MOFChecker, a tool for validating MOF structures. All keys are prefixed with mofchecker_.
Validity Checks (Binary)
These descriptors are used to determine overall MOF validity. True indicates a problem (except where noted).
| Key | Type | Description |
|---|---|---|
mofchecker_valid |
bool | Overall validity flag. True if structure passes all validity checks. |
mofchecker_no_carbon |
bool | True if structure contains no carbon atoms (invalid for organic-based MOFs) |
mofchecker_no_hydrogen |
bool | True if structure contains no hydrogen atoms |
mofchecker_no_metal |
bool | True if structure contains no metal atoms |
mofchecker_has_atomic_overlaps |
bool | True if any atoms are too close together |
mofchecker_has_lone_molecule |
bool | True if structure contains disconnected molecular fragments |
mofchecker_has_overcoordinated_c |
bool | True if any carbon has too many bonds |
mofchecker_has_overcoordinated_n |
bool | True if any nitrogen has too many bonds |
mofchecker_has_overcoordinated_h |
bool | True if any hydrogen has too many bonds |
mofchecker_has_undercoordinated_c |
bool | True if any carbon has too few bonds |
mofchecker_has_undercoordinated_n |
bool | True if any nitrogen has too few bonds |
mofchecker_has_undercoordinated_rare_earth |
bool | True if any rare earth metal is undercoordinated |
mofchecker_has_undercoordinated_alkali_alkaline |
bool | True if any alkali/alkaline earth metal is undercoordinated |
mofchecker_has_suspicious_terminal_oxo |
bool | True if structure has potentially incorrect terminal oxo groups on metals |
mofchecker_has_geometrically_exposed_metal |
bool | True if any metal has unusual coordination geometry |
mofchecker_has_high_charges |
bool | True if computed partial charges are unusually high |
Informative Checks (Binary, not used for validity)
| Key | Type | Description |
|---|---|---|
mofchecker_has_oms |
bool | True if structure has Open Metal Sites (coordinatively unsaturated metals) |
mofchecker_has_3d_connected_graph |
bool | True if the framework is 3D-connected (expected for MOFs) |
Structure Hashes
| Key | Type | Description |
|---|---|---|
mofchecker_graph_hash |
str | Hash of the full structure graph (atoms + bonds) |
mofchecker_undecorated_graph_hash |
str | Hash of graph with hydrogen atoms removed |
mofchecker_decorated_scaffold_hash |
str | Hash of framework scaffold with decorations |
mofchecker_undecorated_scaffold_hash |
str | Hash of bare framework scaffold |
mofchecker_symmetry_hash |
str | Hash encoding symmetry information |
MOF Fragment Properties
Properties of the decomposed MOF components (nodes, linkers, solvents). Stored under properties.mof_fragments.
Component Types
MOF structures are decomposed into four component types:
- nodes_and_bridges: Metal nodes and bridging groups
- linkers: Organic linker molecules
- bound_solvent: Solvent molecules coordinated to metal centers
- free_solvent: Unbound solvent molecules in pores
Fragment Formulas
| Key | Type | Description |
|---|---|---|
{component}_formulas |
List[str] | Chemical formulas of each fragment of this component type |
Example: nodes_and_bridges_formulas = ["Zn4O", "Zn4O"] for a structure with two identical zinc nodes
Linker SMILES
| Key | Type | Description |
|---|---|---|
linkers_smiles |
List[str] | Full SMILES strings for each linker fragment, including stereochemistry and charges where applicable |
linkers_simple_smiles |
List[str] | Simplified SMILES (scaffold only, no stereochemistry). More robust for parsing but less chemically accurate |
Linker Properties
Molecular descriptors and fingerprints for organic linker molecules. Stored under properties.linker_properties.
Morgan Fingerprints
Morgan (circular) fingerprints are stored as NumPy files. For similarity search, use the standardized versions.
| File | Description |
|---|---|
linkers_morgan_ecfp4.npy |
ECFP4 (radius=2), 2048-bit |
linkers_morgan_ecfp6.npy |
ECFP6 (radius=3), 2048-bit |
linkers_morgan_ecfp4_standardized.npy |
ECFP4 from standardized molecules |
linkers_morgan_ecfp6_standardized.npy |
ECFP6 from standardized molecules |
Scalar metadata:
| Key | Type | Description |
|---|---|---|
linkers_smiles_used |
List[str] | Which SMILES string was successfully parsed for each linker (original, fixed, or simple) |
linkers_smiles_standardized |
List[str] | Chemically standardized SMILES (neutralized, canonical tautomer) |
linkers_morgan_count_sum |
List[int] | Sum of Morgan fingerprint bit counts (molecular complexity proxy) |
linkers_morgan_count_sum_max |
List[int] | Maximum count in Morgan fingerprint (indicates highly represented substructures) |
linkers_morgan_count_sum_standardized |
List[int] | Sum of counts for standardized fingerprints |
linkers_morgan_count_sum_max_standardized |
List[int] | Maximum count for standardized fingerprints |
Molecular Descriptors
Computed on standardized molecules using RDKit.
| Key | Type | Description |
|---|---|---|
linkers_rotatable_bonds |
List[int] | Number of rotatable bonds per linker (flexibility metric) |
linkers_ring_count |
List[int] | Number of rings per linker |
Coordination Site Descriptors
Counts of metal-coordinating functional groups (computed on as-parsed molecules).
| Key | Type | Description |
|---|---|---|
linkers_coordination_site_count |
List[int] | Total number of potential metal coordination sites per linker |
linkers_coordination_site_breakdown |
List[Dict] | Breakdown by coordination site type |
linkers_carboxylate_count |
List[int] | Number of carboxylate groups (-COO⁻/-COOH) |
linkers_pyridine_count |
List[int] | Number of aromatic nitrogen sites |
linkers_imidazole_n_count |
List[int] | Number of imidazole/triazole NH groups |
linkers_primary_amine_count |
List[int] | Number of primary amine groups (-NH₂) |
linkers_secondary_amine_count |
List[int] | Number of secondary amine groups (-NH-) |
linkers_tertiary_amine_count |
List[int] | Number of tertiary amine groups (-N<) |
linkers_phosphonate_count |
List[int] | Number of phosphonate groups |
linkers_sulfonate_count |
List[int] | Number of sulfonate groups |
linkers_phenolic_oh_count |
List[int] | Number of phenolic hydroxyl groups |
linkers_alcoholic_oh_count |
List[int] | Number of alcoholic hydroxyl groups |
linkers_thiol_count |
List[int] | Number of thiol groups (-SH) |
linkers_nitrile_count |
List[int] | Number of nitrile groups (-C≡N) |
Validation Metrics
Binary metrics used to assess structure quality.
| Key | Type | Description |
|---|---|---|
no_atom_too_close |
bool | True if all interatomic distances are physically reasonable |
smact_valid |
bool | True if composition passes SMACT electronegativity/charge balance checks |
reconstruction_failed |
bool | True if structure reconstruction from latent space failed |
License
References
- MOFID: Bucior, B. J., et al. (2019). Identification Schemes for Metal-Organic Frameworks...
- Zeo++: Willems, T. F., et al. (2012). Algorithms and tools for high-throughput geometry-based analysis...
- MOFChecker: Ongari, D., et al. (2019). Building a Consistent and Reproducible Database for Adsorption Evaluation...
- QMOF Andrew S. R., et al. (2021). Paper can be found at Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery and corresponding dataset release on GitHub
- ORB: Orbital ORB v3 Force Field
- RDKit Morgan Fingerprints: RDKit Documentation
- Downloads last month
- 68