# Data underlying the publication: ChemSpaX: Exploration of Chemical Space by Automated Functionalization of Molecular Scaffold
Authors: Adarsh V. Kalikadien, Evgeny A. Pidko and Vivek Sina  
  
The data is contained in three folders.
* **full_datasets.zip** contains calculated thermochemical and electronic 
properties calculated by GFN2-xTB and DFT for all use cases shown in the publication. 
* **geometry_files.zip** contains .mol or .xyz files for geometries generated by ChemSpaX. 
GFN2-xTB or DFT optimized geometries are also included.
* **Coulomb_matrix_HL_gap_Mn_pincers.zip** contains names, Coulomb matrix representations, and DFT computed HOMO-LUMO gap of Mn-CNC and Mn-PCP complexes based on FF geometries, and DFT optimized geometries. 

All data was parsed from xtb.out or DFT.out files using either Bash or Python scripts.
These datasets will be explained in more detail below.

## geometry_files.zip
In this dataset 4 types of folders can be found which contain 
coordinates for a geometry in the form of .mol or .xyz files. For functionalized structures, 
the name of the used skeleton is contained in the first part of the filename.
* **FF**: These are geometries generated by ChemSpaX after functionalization of a skeleton. 
Each run indicates that a new set of substituents is placed on a skeleton. 
* **GFN2**: These are GFN2-xTB optimized geometries.
* **DFT**: These are DFT optimized geometries.
* **skeletons**: Skeletons used to create functionalized geometries using ChemSpaX. 
* **manually_made**: Geometries that were made manually by replacing a single atom. 
These are not reliable for usage without an optimization method applied to them.

## full_datasets.zip
### RuPNP
Where available the run, name of the geometry file of the complex and substituent on each 
R or L group on the geometry is indicated.
* **energies_of_reaction_dft**: This file contains electronic energy (delta_e) or Gibbs free energy (delta_g) of 
reaction per geometry in kcal/mol calculated using DFT.  
In column M (delta_g_bp86_thf) to S (delta_e_gfn2_gas) the results of (free) energy SP (_thf) or full frequency calculations (_gas) using 
various functionals (\_bp86\_, \_pbe1pbe\_, \_gfn2\_) is shown.  
In column T (delta_e_ff_start_bp86_thf) to V (abs_diff_E_ff_bp86) SP calculations on a DFT optimization trajectory of a 
ChemSpaX generated geometry are done. Where \_start\_ indicates that it is a SP on the starting geometry (ChemSpaX generated and FF optimized geometry) 
and \_opt\_ is a SP on the fully (BP86(GAS) DFT optimized geometry with frequency calculations). The \_abs\_diff\_ column shows the 
energy difference between the SPs at the start and end of the trajectory.  
Column W (delta_e_xtbopt_start_bp86_thf) to Y (abs_diff_E_xtb_bp86) show a similar energy difference along the optimization trajectory 
except that it is now the optimization of a GFN2-xTB optimized geometry. \_start\_ thus is a DFT SP on a GFN2-xTB optimized geometry and 
\_opt\_ is the SP on the BP86(GAS) DFT optimized starting geometry, similar to column T to V. 
* **energies_of_reaction_xtb**: This file contains electronic energy (delta_e) or Gibbs free (delta_g) energy of 
reaction per geometry in kcal/mol calculated using GFN2-xTB. Here, \_thf in the column name indicates that 
GBSA was used during the GFN2-xTB optimization.
* **h_l_gaps**: This file contains HOMO-LUMO gaps in eV calculated using GFN2-xTB and DFT.  
Column C (h_l_gap_gfn2_xtb_thf) contains the HOMO-LUMO gap calculated using GFN2-xTB. The other columns contain 
HOMO-LUMO gaps calculated using DFT where the same nomenclature to indicate functional (\_bp86\_, \_pbe1pbe\_) and SP (\_thf) 
or full frequency calculation (\_gas) is used as mentioned above.
* **rmsds_vs_dft**: This file contains (h)RMSDs in Angstrom. The column name indicates which optimization methods are compared. 
Where ff are ChemSpaX generated geometries, xtb are GFN2-xTB optimized geometries and bp86 are BP86(GAS) DFT optimized geometries. 
* **rmsds_vs_xtb**: Idem.
* **substituents_placed_per_run**: This file indicates the 12 substituents (that are currently included in the source code of ChemSpaX) 
that were placed on a skeleton for each functionalization run of ChemSpaX.

### Mn_pincers
Where available the name of the geometry file of the complex and substituent on each site of the geometry is indicated.
* **additional_data_dft_optimization_trajectory**: This file contains various data for DFT optimized geometries 
where energy_first and energy_last are in hartree, HOMO and LUMO in hartree, HOMO-LUMO gap in eV and CPU optimization time in seconds. 
For each adduct a seperate tab is made in the excel file, where pristine indicates that no adduct is present. energy_first and energy_last are the 
energy of the starting geometry (ChemSpaX generated) and fully DFT optimized (BP86(GAS)) geometry. These energies 
are used in the delta_e tab to calculate the difference between the start and end of the optimization trajectory.  
All energies in the delta_e tab are in kcal/mol. The first part of the column name indicates the adduct. Then, it is indicated whether this is the 
start or end of the optimization trajectory. In a seperate column the absolute difference for the starting and ending energy of reaction of the optimization trajectory 
is calculated.
* **energies_of_reaction_xtb_and_dft**: This file contains the electronic energy (delta_e) or Gibbs free energy (delta_g) in kcal/mol.  
The first part of the column name indicates the adduct and \_sp\_ or \_thf\_ in the column name indicates that a SP calculation was done. 
The used functional/calculation method is also contained in the column name (xtb/GFN2-xtb, bp86, pbe1pbe) 
* **h_l_gaps_xtb_and_dft**: This file contains HOMO-LUMO gaps in eV calculated using GFN2-xTB and DFT. 
The first part of the column name indicate the adduct. bp86\_thf indicates that a DFT SP on a BP86(GAS) optimized geometry was used. 
\_xtb\_thf indicates that a GFN2-xTB optimization using GBSA was done.
* **rmsds**: See explanation for RuPNP.

### Co_Por
Where available the name of the geometry file of the complex, substituent on each site and index ("How many functionalizations are done?") is indicated.
* **co_por_xtb_and_dft_data**: In this file various data is contained, each functionalization run has a separate tab. 
All runs contain (h)RMSDs of ChemSpaX generated (FF) geometries versus GFN2-xTB optimized geometries in Angstrom and the number of atoms contained in a geometry.  
For run2 DFT SP was done, so the HOMO-LUMO gap in eV for beta and alpha spin calculated by DFT is shown in column I (dft_h_l_gap_alpha) and 
J (dft_h_l_gap_beta). The next columns show the xtb (GFN2-xTB) and DFT (BP86(THF)) calculated HOMO-LUMO gap in eV. The difference in eV between these 
calculations is shown in the next column. Then, a classification within a cluster (low, medium, high) of the hRMSD (between ChemSpaX generated 
geometries and GFN2-xTB optimized geometries) is shown in the last column.



### M2L4 cage
* **rmsds**: See explanation for RuPNP.