Feature selection¶

Author: Nele Albers

Date: June 2024

Here we select three of the basis state features.

Required files:

  • Data/all_states
  • Data/data_rl_samples.csv

Authored by Nele Albers, Francisco S. Melo, Mark A. Neerincx, Olya Kudina, and Willem-Paul Brinkman.

Import packages¶

Let's first import the packages we need.

In [5]:
import numpy as np
import pandas as pd
import pickle
import random

# For RL-related computations
import compute_dynamics_feat_sel as dyn

Constants¶

And we define constants we use throughout.

In [6]:
NUM_ACTIONS = 2

Load data¶

Let's load the data. This includes the dataframe with the non-abstracted RL samples as well as a file that contains all non-abstracted states (i.e., also the states of people who did not arrive in session 2). We need the latter to compute percentiles for the state features.

In [8]:
def eval_with_nan(x):
    x = x.replace('nan', '"nan"')
    lst = pd.eval(x)
    lst = [np.nan if i == 'nan' else i for i in lst]
    return lst

with open("Data/all_states", "rb") as f:
    all_states_for_abstraction = pickle.load(f)

data = pd.read_csv("Data/data_rl_samples.csv", 
                    converters={'s0': eval_with_nan, 's1': eval_with_nan})

data.head()
Out[8]:
rand_id session s0 a effort s1 activity dropout_response s0_imp s0_se ... s0_diff s0_session s1_imp s1_se s1_hs s1_energy s1_diff s1_session cons_id Prev_Feedback_Count
0 P622 1 [4, 4, 5, 5, 1.75, 0] 0 8 [6.0, 5.0, 9.0, 7.0, 2.6319444444444446, 1] 4 0 4 4 ... 1.750000 0 6.0 5.0 9.0 7.0 2.631944 1 0 0
1 P904 1 [1, 4, 0, 6, 1.8125, 0] 0 0 [3.0, 4.0, -1.0, 8.0, 2.25, 1] 15 0 1 4 ... 1.812500 0 3.0 4.0 -1.0 8.0 2.250000 1 2 0
2 P665 1 [8, 9, 9, 3, 2.2916666666666665, 0] 0 8 [9.0, 8.0, 10.0, 7.0, 1.8125, 1] 29 3 8 9 ... 2.291667 0 9.0 8.0 10.0 7.0 1.812500 1 1 0
3 P991 1 [9, 9, 10, 5, 1.8611111111111112, 0] 0 8 [10.0, 8.0, 10.0, 5.0, 0.5902777777777778, 1] 36 0 9 9 ... 1.861111 0 10.0 8.0 10.0 5.0 0.590278 1 3 0
4 P239 1 [9, 6, 9, 6, 1.0138888888888888, 0] 0 9 [10.0, 8.0, 10.0, 7.0, 1.7430555555555556, 1] 22 4 9 6 ... 1.013889 0 10.0 8.0 10.0 7.0 1.743056 1 4 0

5 rows × 22 columns

Below we give an overview of the number of samples we have and the number of people they are from.

In [9]:
all_people = list(data['rand_id'].unique())
NUM_PEOPLE = len(all_people)

print("Total number of samples: " + str(len(data)) + ".")
print("Total number of people: " + str(NUM_PEOPLE) + ".")
Total number of samples: 2326.
Total number of people: 679.

Feature selection¶

Now we select base state features.

We have these candidate features:

0: importance of preparing for quitting smoking/vaping
1: self-efficacy for preparing for quitting smoking/vaping
2: appreciation of human feedback
3: energy
4: difficulty of assigned activity
5: phase of the intervention
In [10]:
MAX_TRAIN_SESSION = 4  # 4th session
MIN_TRAIN_SESSION = 1 # First session
NUM_FEAT_TO_SELECT = 3  # Number of features to select
CANDIDATE_FEATURES = [0, 1, 2, 3, 4, 5]  # Features to select from
VALS_PER_FEAT_TO_SELECT = [3, 2, 2]  # Number of values per selected feature

# Use only data on specified sessions as training data
data_train = data.copy(deep=True)
data_train = data_train[(data_train['session'] <= MAX_TRAIN_SESSION) & (data_train["session"] >= MIN_TRAIN_SESSION)]

# Mean effort spent
effort_mean = data_train["effort"].mean()
print("Average effort response: " + str(round(effort_mean, 2)))

# Select state features
random.seed(1)  # For reproducibility
feat_sel, _ = dyn.feature_selection_notabstracted(data_train[["s0", "s1", "a", "effort"]].values.tolist(), 
                                                  effort_mean,
                                                  CANDIDATE_FEATURES,
                                                  vals_per_feat_to_select = VALS_PER_FEAT_TO_SELECT, 
                                                  all_states_for_abstraction = all_states_for_abstraction,
                                                  num_act = NUM_ACTIONS)

# In case we remove non-last features from candidate list, indices must be adapted.
feat_sel = [CANDIDATE_FEATURES[feat_sel[i]] for i in range(NUM_FEAT_TO_SELECT)]

print("\nChosen features:", feat_sel)
Average effort response: 5.74
First feature selected: 0 First feature -> min. p-value: 0.0007
Feature selected: 2
Criterion: Min p-value: 0.0
Feature selected: 1
Criterion: Min p-value: 0.1444

Chosen features: [0, 2, 1]
In [ ]: