Author: Nele Albers
Date: January 2025
Let's compare the effects of different human feedback costs over time. In doing so, we also reproduce the policies from Table 2, Figure 3, Supplementary Figure 7, and Supplementary Figure 8.
We perform two types of simulations. First, we perform the simulations to reproduce Figure 3. And then, we also perform simulations with our hypothetical live application described for RQ3, thus reproducing Supplementary Figure 8.
Required files:
- Data/all_abstract_states_with_session.csv
- Data/data_rl_samples_abstracted[0, 1, 2][3, 2, 2].csv
- Intermediate_Results/_qvals_012_0.85_[3, 2, 2]
- Intermediate_Results/_reward_func_012_0.85_[3, 2, 2]
- Intermediate_Results/_trans_func_012_0.85_[3, 2, 2]
- for each cost in [0.1, 0.02, 0.03, 0.05, 0.07, 0.09, 0.12, 0.18, 0.055, 0.102]:
- Intermediate_Results/_qvals_012_0.85_[3, 2, 2]_cost<cost>
Created files:
- Figures/initial_state_distribution_study.pdf (Supplementary Figure 7)
- Figures/policy_comp_stable_population_feedback_amount.pdf (Supplementary Figure 8b)
- Figures/policy_comp_stable_population_reward.pdf (Supplementary Figure 8a)
- Figures/policy_comp_with_costs_cost.pdf (Figure 3b)
- Figures/policy_comp_with_costs_reward.pdf (Figure 3a)
Authored by Nele Albers, Francisco S. Melo, Mark A. Neerincx, Olya Kudina, and Willem-Paul Brinkman.
Let's load the packages that we need.
from copy import deepcopy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import random
import seaborn as sns
And we define some variables we use throughout.
FEAT_SEL = [0, 1, 2]
NUM_VALS_PER_FEATURE = [3, 2, 2]
DISCOUNT_FACTOR = 0.85
COSTS = [0.02, 0.03, 0.05, 0.055, 0.07, 0.09, 0.1, 0.102, 0.12, 0.18] # human feedback costs
PATH = "Intermediate_Results/" # pre-fix for path for storing results
path_to_save = str(str(FEAT_SEL[0]) + str(FEAT_SEL[1]) + str(FEAT_SEL[2]) + "_" + str(DISCOUNT_FACTOR) + "_" + str(NUM_VALS_PER_FEATURE))
states = [[i, j, k] for i in range(NUM_VALS_PER_FEATURE[0]) for j in range(NUM_VALS_PER_FEATURE[1]) for k in range(NUM_VALS_PER_FEATURE[2])]
num_states = len(states)
And we set the styling for our figures.
sns.set()
sns.set_style("white")
med_fontsize = 22
small_fontsize = 18
extrasmall_fontsize = 15
sns.set_context("paper", rc={"font.size":small_fontsize,"axes.titlesize":med_fontsize,"axes.labelsize":med_fontsize,
'xtick.labelsize':small_fontsize, 'ytick.labelsize':small_fontsize,
'legend.fontsize':extrasmall_fontsize,'legend.title_fontsize': extrasmall_fontsize})
# Colors for different human feedback costs
colors = ["#d0e7ca", "#b5ddd8",
"#9bd2e1", "#81c4e7",
"#7eb2e4", "#FFC20A",
"#E66100", "#9398d2",
"#9d7db2", "#906388",
"#684957"]
Let's load the data on the transitions.
data = pd.read_csv("Data/data_rl_samples_abstracted" + str(FEAT_SEL) + str(NUM_VALS_PER_FEATURE) + ".csv",
converters={'s0': eval, 's1': eval})
And we also load the data on all states, which includes the first states of people with no data from session 2.
df_all_states = pd.read_csv("Data/all_abstract_states_with_session.csv",
converters={'state': eval})
We load the previously computed reward function, transition function, and Q-values.
with open(PATH + "_reward_func_" + path_to_save, "rb") as f:
reward_func = pickle.load(f)
with open(PATH + "_trans_func_" + path_to_save, "rb") as f:
trans_func = pickle.load(f)
with open(PATH + "_qvals_" + path_to_save, "rb") as f:
q_vals = pickle.load(f)
Now we compute the policies and resulting transitions and rewards for different human feedback costs. This reproduces policies from Table 2 in the paper.
reward_func_policiescost = []
trans_func_policiescost = []
policies_cost = []
opt_policy = np.array([np.argmax(q_vals[state]) for state in range(len(q_vals))])
policies_cost.append(opt_policy)
trans_func_policiescost.append(np.array([trans_func[state][opt_policy[state]] for state in range(num_states)]))
reward_func_policiescost.append(np.array([reward_func[state][opt_policy[state]] for state in range(num_states)]))
for cost in COSTS:
with open(PATH + "_qvals_" + path_to_save + "_cost" + str(cost), "rb") as f:
q_vals_cost = pickle.load(f)
policycost = np.array([np.argmax(q_vals_cost[state]) for state in range(len(q_vals_cost))])
policies_cost.append(policycost)
trans_func_policiescost.append(np.array([trans_func[state][policycost[state]] for state in range(num_states)]))
reward_func_policiescost.append(np.array([reward_func[state][policycost[state]] for state in range(num_states)]))
print("States with human feedback for cost " + str(cost) + ":")
for state_idx, state in enumerate(states):
if policycost[state_idx] == 1:
print(state)
States with human feedback for cost 0.02: [0, 0, 0] [0, 1, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 0, 0] [2, 1, 1] States with human feedback for cost 0.03: [0, 1, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 0, 0] [2, 1, 1] States with human feedback for cost 0.05: [0, 1, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 1, 1] States with human feedback for cost 0.055: [0, 1, 0] [0, 1, 1] [1, 0, 0] [1, 1, 0] [1, 1, 1] [2, 1, 1] States with human feedback for cost 0.07: [0, 1, 0] [0, 1, 1] [1, 1, 0] [1, 1, 1] [2, 1, 1] States with human feedback for cost 0.09: [0, 1, 0] [0, 1, 1] [1, 1, 0] [1, 1, 1] States with human feedback for cost 0.1: [0, 1, 0] [1, 1, 0] [1, 1, 1] States with human feedback for cost 0.102: [1, 1, 0] [1, 1, 1] States with human feedback for cost 0.12: [1, 1, 1] States with human feedback for cost 0.18:
As starting population for our simulation we want to use the distribution of people across the base states that we observed in the first session of our study.
all_states_s1 = df_all_states[df_all_states['session'] == 1]
all_states_s1 = all_states_s1.reset_index(drop=True)
all_states_count = np.zeros(num_states)
for p in range(len(all_states_s1)):
state = list(np.take(all_states_s1.iloc[p]["state"], FEAT_SEL))
state_idx = states.index(state)
all_states_count[state_idx] += 1
all_states_frac = all_states_count/sum(all_states_count)
print("Fraction of people in each state in session 1:", np.round(all_states_frac, 2))
Fraction of people in each state in session 1: [0.34 0.17 0.06 0.05 0.07 0.05 0.02 0.04 0.03 0.07 0.02 0.07]
And let's also plot this starting population. This reproduces Supplementary Figure 7.
fig, ax = plt.subplots(figsize=(10,5))
x_vals = np.arange(num_states)
plt.bar(x_vals, all_states_frac, color = '#bbccee')
ax.set_ylim([0, 1])
ax.set_ylabel("Fraction of people")
ax.set_xlabel("Combination of state feature values")
plt.xticks(x_vals, ["000", "001", "010", "011", "100", "101", "110", "111", "200", "201", "210", "211"])
plt.legend(ncols=1, loc='upper left', bbox_to_anchor=(1, 0.7))
plt.savefig("Figures/initial_state_distribution_study.pdf", dpi=1500,
bbox_inches='tight', pad_inches=0)
/tmp/ipykernel_4295/899958538.py:12: UserWarning: No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument. plt.legend(ncols=1, loc='upper left', bbox_to_anchor=(1, 0.7))
num_steps_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
initial_pop = all_states_frac
initial_pop_size = sum(initial_pop)
policy_names = ["0"] + COSTS
rew_policy_lists = []
cost_policy_lists = []
# For the optimal policy and all cost-adapted policies
for policy in range(len(COSTS) + 1):
print("\nOptimal policy based on cost", policy_names[policy])
rew_policy_list = []
cost_policy_list = []
for num_steps in num_steps_list:
print("\n Number of time steps:", num_steps)
pop_policy = initial_pop
rew_policy = 0
cost_policy = 0
for t in range(num_steps):
cost_policy += sum(np.array(pop_policy) * np.array(policies_cost[policy]))
trans_time_rew = trans_func_policiescost[policy] * reward_func_policiescost[policy] # element-wise multiplication
rew_policy += sum((trans_time_rew.T).dot(pop_policy)) # total reward for transitions
pop_policy = (trans_func_policiescost[policy].T).dot(pop_policy) # new population
print(" -> mean reward:", round(rew_policy/(num_steps * initial_pop_size), 2))
rew_policy_list.append(rew_policy/(num_steps * initial_pop_size))
print(" -> mean fraction with human feedback:", round(cost_policy/(num_steps * initial_pop_size), 2))
cost_policy_list.append(cost_policy/(num_steps * initial_pop_size))
rew_policy_lists.append(rew_policy_list)
cost_policy_lists.append(cost_policy_list)
Optimal policy based on cost 0 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.91 Number of time steps: 2 -> mean reward: 0.56 -> mean fraction with human feedback: 0.91 Number of time steps: 3 -> mean reward: 0.57 -> mean fraction with human feedback: 0.9 Number of time steps: 5 -> mean reward: 0.59 -> mean fraction with human feedback: 0.89 Number of time steps: 10 -> mean reward: 0.61 -> mean fraction with human feedback: 0.89 Number of time steps: 20 -> mean reward: 0.63 -> mean fraction with human feedback: 0.89 Number of time steps: 30 -> mean reward: 0.64 -> mean fraction with human feedback: 0.89 Number of time steps: 50 -> mean reward: 0.65 -> mean fraction with human feedback: 0.89 Number of time steps: 100 -> mean reward: 0.66 -> mean fraction with human feedback: 0.89 Optimal policy based on cost 0.02 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.74 Number of time steps: 2 -> mean reward: 0.56 -> mean fraction with human feedback: 0.74 Number of time steps: 3 -> mean reward: 0.57 -> mean fraction with human feedback: 0.74 Number of time steps: 5 -> mean reward: 0.59 -> mean fraction with human feedback: 0.75 Number of time steps: 10 -> mean reward: 0.61 -> mean fraction with human feedback: 0.76 Number of time steps: 20 -> mean reward: 0.63 -> mean fraction with human feedback: 0.78 Number of time steps: 30 -> mean reward: 0.64 -> mean fraction with human feedback: 0.79 Number of time steps: 50 -> mean reward: 0.65 -> mean fraction with human feedback: 0.79 Number of time steps: 100 -> mean reward: 0.65 -> mean fraction with human feedback: 0.8 Optimal policy based on cost 0.03 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.4 Number of time steps: 2 -> mean reward: 0.56 -> mean fraction with human feedback: 0.42 Number of time steps: 3 -> mean reward: 0.57 -> mean fraction with human feedback: 0.44 Number of time steps: 5 -> mean reward: 0.58 -> mean fraction with human feedback: 0.47 Number of time steps: 10 -> mean reward: 0.6 -> mean fraction with human feedback: 0.52 Number of time steps: 20 -> mean reward: 0.62 -> mean fraction with human feedback: 0.57 Number of time steps: 30 -> mean reward: 0.63 -> mean fraction with human feedback: 0.6 Number of time steps: 50 -> mean reward: 0.64 -> mean fraction with human feedback: 0.62 Number of time steps: 100 -> mean reward: 0.64 -> mean fraction with human feedback: 0.63 Optimal policy based on cost 0.05 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.37 Number of time steps: 2 -> mean reward: 0.56 -> mean fraction with human feedback: 0.39 Number of time steps: 3 -> mean reward: 0.57 -> mean fraction with human feedback: 0.41 Number of time steps: 5 -> mean reward: 0.58 -> mean fraction with human feedback: 0.44 Number of time steps: 10 -> mean reward: 0.6 -> mean fraction with human feedback: 0.49 Number of time steps: 20 -> mean reward: 0.62 -> mean fraction with human feedback: 0.55 Number of time steps: 30 -> mean reward: 0.63 -> mean fraction with human feedback: 0.57 Number of time steps: 50 -> mean reward: 0.64 -> mean fraction with human feedback: 0.59 Number of time steps: 100 -> mean reward: 0.64 -> mean fraction with human feedback: 0.61 Optimal policy based on cost 0.055 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.31 Number of time steps: 2 -> mean reward: 0.56 -> mean fraction with human feedback: 0.33 Number of time steps: 3 -> mean reward: 0.56 -> mean fraction with human feedback: 0.35 Number of time steps: 5 -> mean reward: 0.58 -> mean fraction with human feedback: 0.38 Number of time steps: 10 -> mean reward: 0.6 -> mean fraction with human feedback: 0.44 Number of time steps: 20 -> mean reward: 0.62 -> mean fraction with human feedback: 0.5 Number of time steps: 30 -> mean reward: 0.63 -> mean fraction with human feedback: 0.53 Number of time steps: 50 -> mean reward: 0.63 -> mean fraction with human feedback: 0.55 Number of time steps: 100 -> mean reward: 0.64 -> mean fraction with human feedback: 0.56 Optimal policy based on cost 0.07 Number of time steps: 1 -> mean reward: 0.55 -> mean fraction with human feedback: 0.25 Number of time steps: 2 -> mean reward: 0.55 -> mean fraction with human feedback: 0.27 Number of time steps: 3 -> mean reward: 0.56 -> mean fraction with human feedback: 0.29 Number of time steps: 5 -> mean reward: 0.57 -> mean fraction with human feedback: 0.33 Number of time steps: 10 -> mean reward: 0.59 -> mean fraction with human feedback: 0.39 Number of time steps: 20 -> mean reward: 0.61 -> mean fraction with human feedback: 0.45 Number of time steps: 30 -> mean reward: 0.62 -> mean fraction with human feedback: 0.48 Number of time steps: 50 -> mean reward: 0.63 -> mean fraction with human feedback: 0.51 Number of time steps: 100 -> mean reward: 0.63 -> mean fraction with human feedback: 0.53 Optimal policy based on cost 0.09 Number of time steps: 1 -> mean reward: 0.54 -> mean fraction with human feedback: 0.18 Number of time steps: 2 -> mean reward: 0.54 -> mean fraction with human feedback: 0.17 Number of time steps: 3 -> mean reward: 0.55 -> mean fraction with human feedback: 0.17 Number of time steps: 5 -> mean reward: 0.56 -> mean fraction with human feedback: 0.16 Number of time steps: 10 -> mean reward: 0.57 -> mean fraction with human feedback: 0.16 Number of time steps: 20 -> mean reward: 0.59 -> mean fraction with human feedback: 0.16 Number of time steps: 30 -> mean reward: 0.59 -> mean fraction with human feedback: 0.16 Number of time steps: 50 -> mean reward: 0.6 -> mean fraction with human feedback: 0.16 Number of time steps: 100 -> mean reward: 0.6 -> mean fraction with human feedback: 0.16 Optimal policy based on cost 0.1 Number of time steps: 1 -> mean reward: 0.53 -> mean fraction with human feedback: 0.12 Number of time steps: 2 -> mean reward: 0.54 -> mean fraction with human feedback: 0.13 Number of time steps: 3 -> mean reward: 0.55 -> mean fraction with human feedback: 0.13 Number of time steps: 5 -> mean reward: 0.55 -> mean fraction with human feedback: 0.13 Number of time steps: 10 -> mean reward: 0.57 -> mean fraction with human feedback: 0.13 Number of time steps: 20 -> mean reward: 0.58 -> mean fraction with human feedback: 0.13 Number of time steps: 30 -> mean reward: 0.59 -> mean fraction with human feedback: 0.13 Number of time steps: 50 -> mean reward: 0.6 -> mean fraction with human feedback: 0.13 Number of time steps: 100 -> mean reward: 0.6 -> mean fraction with human feedback: 0.13 Optimal policy based on cost 0.102 Number of time steps: 1 -> mean reward: 0.53 -> mean fraction with human feedback: 0.07 Number of time steps: 2 -> mean reward: 0.54 -> mean fraction with human feedback: 0.07 Number of time steps: 3 -> mean reward: 0.54 -> mean fraction with human feedback: 0.08 Number of time steps: 5 -> mean reward: 0.55 -> mean fraction with human feedback: 0.08 Number of time steps: 10 -> mean reward: 0.56 -> mean fraction with human feedback: 0.08 Number of time steps: 20 -> mean reward: 0.58 -> mean fraction with human feedback: 0.09 Number of time steps: 30 -> mean reward: 0.58 -> mean fraction with human feedback: 0.09 Number of time steps: 50 -> mean reward: 0.59 -> mean fraction with human feedback: 0.09 Number of time steps: 100 -> mean reward: 0.59 -> mean fraction with human feedback: 0.09 Optimal policy based on cost 0.12 Number of time steps: 1 -> mean reward: 0.53 -> mean fraction with human feedback: 0.04 Number of time steps: 2 -> mean reward: 0.54 -> mean fraction with human feedback: 0.05 Number of time steps: 3 -> mean reward: 0.54 -> mean fraction with human feedback: 0.05 Number of time steps: 5 -> mean reward: 0.55 -> mean fraction with human feedback: 0.05 Number of time steps: 10 -> mean reward: 0.56 -> mean fraction with human feedback: 0.06 Number of time steps: 20 -> mean reward: 0.57 -> mean fraction with human feedback: 0.06 Number of time steps: 30 -> mean reward: 0.58 -> mean fraction with human feedback: 0.06 Number of time steps: 50 -> mean reward: 0.58 -> mean fraction with human feedback: 0.06 Number of time steps: 100 -> mean reward: 0.59 -> mean fraction with human feedback: 0.07 Optimal policy based on cost 0.18 Number of time steps: 1 -> mean reward: 0.53 -> mean fraction with human feedback: 0.0 Number of time steps: 2 -> mean reward: 0.53 -> mean fraction with human feedback: 0.0 Number of time steps: 3 -> mean reward: 0.53 -> mean fraction with human feedback: 0.0 Number of time steps: 5 -> mean reward: 0.54 -> mean fraction with human feedback: 0.0 Number of time steps: 10 -> mean reward: 0.55 -> mean fraction with human feedback: 0.0 Number of time steps: 20 -> mean reward: 0.56 -> mean fraction with human feedback: 0.0 Number of time steps: 30 -> mean reward: 0.56 -> mean fraction with human feedback: 0.0 Number of time steps: 50 -> mean reward: 0.57 -> mean fraction with human feedback: 0.0 Number of time steps: 100 -> mean reward: 0.57 -> mean fraction with human feedback: 0.0
Let's plot the mean reward per transition.
fig = plt.figure(figsize=(10,10))
fig_lower_bound = 0 # y-axis lower limit for figure
fig_higher_bound = 1 # y-axis upper limit for figure
x_vals = np.arange(len(num_steps_list))
ax1 = fig.add_subplot(111)
# Plot average reward for policies
for policy in range(len(COSTS) + 1):
plt.plot(x_vals, rew_policy_lists[policy], color = colors[policy], label = policy_names[policy])
ax1.legend(ncols = 5, bbox_to_anchor = (0.985, 1.15))
ax1.set_ylabel("Mean reward per previous activity assignment")
ax1.set_xlabel("Number of time steps")
ax1.set_ylim([fig_lower_bound, fig_higher_bound])
ax1.set_xticks(x_vals, num_steps_list)
# Create zoomed-in plot
anchor_x = .20
anchor_y = .2
height_ax2 = 0.2
width_ax2 = 0.376
ax2 = plt.axes([anchor_x, anchor_y, width_ax2, height_ax2])
for policy in range(len(COSTS) + 1):
ax2.plot(x_vals[4:], rew_policy_lists[policy][4:], color = colors[policy])
plt.setp(ax2, xticks=[], yticks=[])
# Highlight where zoomed-in plot is
buffer_y = 0.02
buffer_x = 0.1
ax1.plot([4 - buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[-1][4] - buffer_y], color = 'black')
ax1.plot([4 - buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[0][-1] + buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
ax1.plot([4 - buffer_x, 4 - buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
ax1.plot([x_vals[-1] + buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
# Dashed lines
ax1.plot([4 - buffer_x, 0.45],
[rew_policy_lists[0][-1] + buffer_y, anchor_y + height_ax2 - 0.025],
color = 'black',
linestyle = 'dashed')
ax1.plot([x_vals[-1] + buffer_x, 4.8],
[rew_policy_lists[-1][4] - buffer_y, anchor_y + height_ax2 - height_ax2 - 0.08],
color = 'black',
linestyle = 'dashed')
plt.savefig("Figures/policy_comp_with_costs_reward.pdf", dpi=1500,
bbox_inches='tight', pad_inches=0)
And we also plot the mean fraction of previous activity assignments with feedback.
plt.figure(figsize=(10,10))
fig_lower_bound = -0.05 # y-axis lower limit for figure
fig_higher_bound = 1 # y-axis upper limit for figure
x_vals = np.arange(len(num_steps_list))
# Plot mean cost per policy
for policy in range(len(COSTS) + 1):
plt.plot(x_vals, cost_policy_lists[policy], color = colors[policy], label = policy_names[policy],
linewidth = 2)
plt.legend(ncols = 5, bbox_to_anchor = (0.985, 1.15))
plt.ylabel("Mean fraction of previous\nactivity assignments with feedback")
plt.xlabel("Number of time steps")
plt.ylim([fig_lower_bound, fig_higher_bound])
plt.xticks(x_vals, num_steps_list)
plt.savefig("Figures/policy_comp_with_costs_cost.pdf", dpi=1500,
bbox_inches='tight', pad_inches=0)
def simulate(num_timesteps,
dropout,
capacity,
max_add_at_once,
max_timesteps,
num_starting_people,
opt_policy,
num_sessions_per_waitingtime,
all_states_frac,
trans_func,
reward_func,
all_states,
num_vals_feat_waitingtime,
random_seed = 1):
"""
num_timesteps: max number of sessions per person
dropout: ratio of people dropping out randomly each session
capacity: max number of people that can be in the application at a time
max_add_at_once: max. number of new people we add at each time step to fill available spots
max_timesteps: max. number of timesteps we want to track the application for
num_starting_people: number of people starting in first session
opt_policy: policy to use
num_sessions_per_waitingtime: number of sessions per waiting time feature value
all_states_frac: fraction of people in each state at start
trans_func: transition function
reward_func: reward function
all_states: all possible states
num_vals_feat_waitingtime: number of values for the waiting time state feature
random_seed: random seed
"""
random.seed(random_seed)
# Create a starting population of participants
pop = random.choices(all_states, k = num_starting_people, weights = all_states_frac)
# Set waiting time feature to 0
for p in range(len(pop)):
pop[p] = deepcopy(pop[p]) + [0]
pop_all = [pop]
sessions_curr = [0] * num_starting_people # keeps track of current session of each person
waitingtimes_curr = [0] * num_starting_people # keeps track of waiting times of each person since last human support
action_sum = [] # number of human involvement per time step
reward_sum = [] # total reward per time step
num_people_prev = num_starting_people
t = 0 # time step
# while there are people left
while len(sessions_curr) > 0 and t < max_timesteps:
pop_new = []
action_sum.append(0)
reward_sum.append(0)
# add new people if there is capacity left and we can still finish those people
if len(sessions_curr) < capacity and t < (max_timesteps - num_timesteps + 1):
num_to_add = min(capacity - len(sessions_curr), max_add_at_once)
pop_added = random.choices(all_states, k = num_to_add, weights = all_states_frac)
# Set waiting time state feature to 0
for p in range(len(pop_added)):
pop_added[p] = deepcopy(pop_added[p]) + [0]
pop_all[t] += pop_added
# also keep track of current session and of waiting times for new people
sessions_curr = sessions_curr + [0 for i in range(num_to_add)]
waitingtimes_curr = waitingtimes_curr + [0 for i in range(num_to_add)]
num_people_prev += num_to_add
# get action and next state and reward for each person
for p in range(num_people_prev):
action = opt_policy[all_states.index(pop_all[t][p][0:3])]
action_sum[t] += action
# get next state
base_state_new = random.choices(all_states, k = 1, weights = trans_func[all_states.index(pop_all[t][p][0:3])][action])
# add waiting time feature
state_new = deepcopy(base_state_new[0]) + [0]
pop_new.append(state_new)
reward_sum[t] += reward_func[all_states.index(pop_all[t][p][0:3])][action]
# set waiting time to 0 when human is involved
if action == 1:
pop_new[p][3] = 0
waitingtimes_curr[p] = 0
# increase waiting time when no human is involved
elif action == 0:
# move to next waiting phase when user is in last session of a not-last waiting time value
if waitingtimes_curr[p] in [i * num_sessions_per_waitingtime - 1 for i in range(1, num_vals_feat_waitingtime)]:
pop_new[p][3] += 1
waitingtimes_curr[p] += 1
# increase number of sessions completed per person
sessions_curr = [i + 1 for i in sessions_curr]
# remove people who have had all sessions
pop_new = [pop_new[p_idx] for p_idx in range(len(pop_new)) if sessions_curr[p_idx] < num_timesteps]
waitingtimes_curr = [waitingtimes_curr[p_idx] for p_idx in range(len(waitingtimes_curr)) if sessions_curr[p_idx] < num_timesteps]
sessions_curr = [p for p in sessions_curr if p < num_timesteps]
num_people_prev = len(sessions_curr)
# some people drop out after the session, i.e. do not arrive in next session
# number of people to drop out
num_people_dropout = round(num_people_prev* dropout)
# randomly choose people drop out
dropout_ids = random.sample(range(0, len(pop_new)), num_people_dropout)
# remove indices of those people
pop_new = [pop_new[p_idx] for p_idx in range(len(pop_new)) if not p_idx in dropout_ids]
sessions_curr = [sessions_curr[p_idx] for p_idx in range(len(sessions_curr)) if not p_idx in dropout_ids]
waitingtimes_curr = [waitingtimes_curr[p_idx] for p_idx in range(len(waitingtimes_curr)) if not p_idx in dropout_ids]
pop_all.append(pop_new)
num_people_prev = num_people_prev - num_people_dropout
t += 1
return action_sum, reward_sum
To define a dropout rate for our simulation, we look at the return likelihood rating from our study. We assume that people with a negative return likelihood would have dropped out.
total = 0
for session in range(1, 5):
data_s = data[data["session"] == session]
data_return_negative_s = data_s[data_s["dropout_response"] < 0]
fraction = len(data_return_negative_s)/len(data_s)
total += fraction
print("Potential dropout after session", session, ":", len(data_return_negative_s), "out of", len(data_s), "->", round(fraction * 100, 2), "%")
avg_dropout_rate = total/4
print("\nAverage:", round(avg_dropout_rate * 100, 2), "%")
Potential dropout after session 1 : 112 out of 679 -> 16.49 % Potential dropout after session 2 : 94 out of 599 -> 15.69 % Potential dropout after session 3 : 90 out of 544 -> 16.54 % Potential dropout after session 4 : 66 out of 504 -> 13.1 % Average: 15.46 %
And now we perform the simulations.
capacity = 166 # number of people we can have in the application at the same time
max_add_at_once = 100 # max. number of people we add at once
max_timesteps = 365 # max. number of timesteps we want to follow the application
num_starting_people = max_add_at_once # number of people to start with
num_sessions_per_waitingtime = 3
num_vals_feat_waitingtime = 3
max_sessions_per_person = num_sessions_per_waitingtime * num_vals_feat_waitingtime
repetitions = 10 # number of times to repeat the simulation
action_sums_policies = []
reward_sums_policies = []
for cost_idx in range(len(COSTS) + 1):
print("... simulating policy with cost:", policy_names[cost_idx])
action_sums = np.zeros((repetitions, max_timesteps))
reward_sums = np.zeros((repetitions, max_timesteps))
for repetition in range(repetitions):
action_sum, reward_sum = simulate(max_sessions_per_person,
avg_dropout_rate,
capacity,
max_add_at_once,
max_timesteps,
num_starting_people,
policies_cost[cost_idx],
num_sessions_per_waitingtime,
all_states_frac,
trans_func,
reward_func,
states,
num_vals_feat_waitingtime,
random_seed = repetition)
for i in range(len(action_sum), max_timesteps):
action_sum.append(0)
action_sums[repetition] = np.array(action_sum)
for i in range(len(reward_sum), max_timesteps):
reward_sum.append(0)
reward_sums[repetition] = np.array(reward_sum)
action_sums_policies.append(action_sums)
reward_sums_policies.append(reward_sums)
... simulating policy with cost: 0 ... simulating policy with cost: 0.02 ... simulating policy with cost: 0.03 ... simulating policy with cost: 0.05 ... simulating policy with cost: 0.055 ... simulating policy with cost: 0.07 ... simulating policy with cost: 0.09 ... simulating policy with cost: 0.1 ... simulating policy with cost: 0.102 ... simulating policy with cost: 0.12 ... simulating policy with cost: 0.18
Let's create combined figures.
First, we plot the percentage of people getting human feedback per day.
plt.figure(figsize=(10,10))
for policy_idx in range(len(COSTS) + 1):
actions_sums_mean = np.mean(action_sums_policies[policy_idx], axis = 0)/capacity * 100
error = np.array([np.std(action_sums_policies[policy_idx][:, i]/capacity * 100) for i in range(max_timesteps)])
plt.plot([i for i in range(max_timesteps)], actions_sums_mean, color = colors[policy_idx], label = policy_names[policy_idx])
plt.fill_between([i for i in range(max_timesteps)], actions_sums_mean - error, actions_sums_mean + error, alpha = 0.1,
color = colors[policy_idx])
plt.ylabel("Percentage of people\ngetting human feedback")
plt.xlabel("Day")
plt.legend(ncols = 4, bbox_to_anchor = (0.95, 1.15))
plt.savefig("Figures/policy_comp_stable_population_feedback_amount.pdf", dpi=1500,
bbox_inches='tight', pad_inches=0)
And we also create a figure for the mean reward per day.
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(111)
for policy_idx in range(len(COSTS) + 1):
reward_sums_mean = np.mean(reward_sums_policies[policy_idx], axis = 0)/capacity
error = np.array([np.std(reward_sums_policies[policy_idx][:, i]/capacity) for i in range(max_timesteps)])
ax1.plot([i for i in range(max_timesteps)], reward_sums_mean,
color = colors[policy_idx], label = policy_names[policy_idx])
ax1.fill_between([i for i in range(max_timesteps)], reward_sums_mean - error, reward_sums_mean + error, alpha = 0.1,
color = colors[policy_idx])
ax1.set_ylabel("Mean reward")
ax1.set_xlabel("Day")
ax1.legend(ncols = 4, bbox_to_anchor = (0.95, 1.15))
ax1.set_ylim(0, 0.8)
# Create zoomed-in plot
anchor_x = .20
anchor_y = .15
height_ax2 = 0.376
width_ax2 = 0.376
ax2 = plt.axes([anchor_x, anchor_y, width_ax2, height_ax2])
ax2_val_cutoff_1 = 50
ax2_val_cutoff_2 = 100
for policy_idx in range(len(COSTS) + 1):
reward_sums_mean = (np.mean(reward_sums_policies[policy_idx], axis = 0)/capacity)[ax2_val_cutoff_1:ax2_val_cutoff_2]
error = np.array([np.std(reward_sums_policies[policy_idx][:, i]/capacity) for i in range(max_timesteps)])[ax2_val_cutoff_1:ax2_val_cutoff_2]
ax2.plot([i for i in range(max_timesteps)][ax2_val_cutoff_1:ax2_val_cutoff_2], reward_sums_mean,
color = colors[policy_idx], label = policy_names[policy_idx])
ax2.fill_between([i for i in range(max_timesteps)][ax2_val_cutoff_1:ax2_val_cutoff_2], reward_sums_mean - error, reward_sums_mean + error, alpha = 0.1,
color = colors[policy_idx])
plt.setp(ax2, xticks=[], yticks=[])
# Highlight where zoomed-in plot is
buffer_y = 0.031
buffer_x = 0
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_2 + buffer_x],
[(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y],
color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_2 + buffer_x],
[(np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y],
color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_1 - buffer_x],
[(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y],
color = 'black')
ax1.plot([ax2_val_cutoff_2 + buffer_x, ax2_val_cutoff_2 + buffer_x],
[(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y],
color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, 20],
[(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, anchor_y + height_ax2 - 0.095],
color = 'black',
linestyle = 'dashed')
ax1.plot([ax2_val_cutoff_2 + buffer_x, 216],
[(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, anchor_y + height_ax2 - 0.095],
color = 'black',
linestyle = 'dashed')
plt.savefig("Figures/policy_comp_stable_population_reward.pdf", dpi=1500,
bbox_inches='tight', pad_inches=0)