### What is this data deposit?

This data deposit is associated to the manuscript *"A data-driven impact-based analysis 
stemming from first responders reports to predict wind damage to urban trees"*, currently
(Oct 24) under revision for [Frontiers in Climate](https://www.frontiersin.org/journals/climate).  We choose to publish our m
anuscript in an Open Access journal (i.e. [CC-BY license](https://creativecommons.org/licenses/by/4.0/), 
because the authors endorse Open Science and fundamental principles as [FAIR](https://www.go-fair.org/fair-principles/).
The contents of the deposit should suffice to reproduce the figures presented in the published paper.
We put care at preparing this dataset, but of course problems might arise. If you have read
this documentation and yet you are unable to run the examples, drop us a line at garciamarti [at] 
knmi.nl, and we can help.

### Description of the research

In this research, we combine storm damage reports with an array of weather and environmental
variables, including high-resolution tree data. We model this enriched dataset of damage reports
with one-class machine learning methods (i.e. One-Class SVM) to learn hyper local conditions triggering
damage to urban trees. To do so, we wrap a model selection process within a sensitivity analysis. 
The model selection part systematically explores the parameter space computing a performance metric
and helping at visually identifying optimal models for the problem. The sensitivity analysis includes
designing **four experiments** including/excluding some of the high-resolution features of the 
dataset. This allows mapping the hourly suitability for wind damage with models that are more
informed (or, in contrast, poorly informed) of the local conditions and explore the side effects
when predicting for unseen regions of the geographic space. The current methodology using
damage observations from public services (i.e. safety regions; in Dutch "Veiligheidregio's")
shows potential at mapping fine-grained effects of severe weather conditions and opens the door
to create hyperlocal climate services. 

### Content of the data deposit

This section describes the file structure of the data deposit (i.e. what to find in each 
of the folders) and some information on how to run the examples. Note that all the code in this 
data deposit is `python` and has been run in a consumer-grade computer. It might be 
necessary for the user to install some of the packages listed below. 

#### Folder structure

The data deposit has the following structure:
```
mod-storm-dmg/
├── code
│   └── 00_model_selection.py
│   └── 01_training_ensemble_oc_bootstrap.py
│   └── 02_mapping_predictions.py
│   └── utils.py
    
└── data
    ├── features
    ├── geospatial
    ├── grid
    ├── params
    └── predictions
```

Here a brief description of each of the folders:

| Folder               | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `code`               | Contains 4 python scripts, 3 of them reproduce the results in the submitted paper                        and the last one is utility file used by the others.                                                                                                                                                                                                                                                                                                                                                                      |
| `data/features`      | Here the user can find the central file used during the research. It contains a list of damage reports, provided by the two participating Dutch safety regions, characterized with the array of weather and environmental variables described in the paper (including the high-resolution tree data from the company Cobra Groeninzicht). The one-class models used in this research train on this dataset and are subsequently applied to the geographic space, characterized in `data/grid`.                                     |
| `data/geospatial`    | This folder contains two publicly available Dutch datasets on roads (i.e. `NWB_wegen`) and railways (i.e. `NATREG_SpoorWegen`) that are used for visualization purposes. For more info, kindly find these datasets through the Dutch Spatial Data Infrastructure portal, [PDOK.nl](https://www.pdok.nl/))                                                                                                                                                                                                                          |
| `data/grid`          | In this folder we provide a representation of the study area at two spatial resolutions (i.e. 1000m and 5000m). For each of the grid cells at each spatial resolution, we have applied the same process of feature engineering as in `data/features`. We provide the table of features associated to the two hourly frames found in the paper (i.e. 18th Feb. 2022 03:00, 26th Feb. 2022 03:00). Once the model has been trained with the storm damages reports, it is applied to these characterizations of the geographic space. |
| `data/params`        | Contains a file with a detail description of each of the 200 models that conform the ensemble used in this research. Here the user can check the hyperparameters of each model for each of the four selected experiments.                                                                                                                                                                                                                                                                                                          |
| `data/predictions`   | After applying the models to the geographic space (at two resolutions), the predicted values are stored in this folder. The predictions are originally stored as `csv` but then they are transformed to `nc` (raster format) and `png` to ease the preparation of visualizations. Note that for each of the four experiments and two spatial resolution we create two files (in three formats).                                                                                                                                    |




#### Requirements

This research has been run within a `conda` environment using `Python 3.11.9`. The code
examples are executed using Pycharm 2024.2.2 (Community Edition) as IDE. The user 
might need to install one or more of the following `python` packages: `cartopy`, 
`collections`, `datetime`, `geocube`, `itertools`, `json`, `geopandas`, `matplotlib`, 
`numpy`, `pandas`, `rioxarray`, `shutup`, `sklearn`, `xarray`.  We also provide a basic 
utility file containing a few functions used in this research: `utils.py`

#### Running the examples

Assuming a `python` installation with the above-mentioned packages, running the examples should
as easy as executing each of the files, via the console or your favorite IDE. However, there 
are a few details to consider for files `01_training_ensemble_oc_bootstrap.py` and 
`02_mapping_predictions.py`:

- These files are prepared to run at the two selected spatial resolutions right away. Check the
list initialization with `spaces=[1000, 5000]` and remove one element to run only one resolution.
- Similarly, you can tune the start and end dates of the analysis, by modifying the variables
`sd` and `ed` of the files. Note, however, that this data deposit only provides two hourly time
slices, so the code will not yield results outside these time windows. 
- At the time of running the examples for `00_model_selection.py`, you might have the figures
presented in a maximized screen and then closing. You can prevent the closing looking at the
function `maximize()` in `utils.py` and changing toggling the boolean `block` parameter.






