Long-term effect of limited human feedback¶

Author: Nele Albers

Date: January 2025

Let's compare the effects of different human feedback costs over time. In doing so, we also reproduce the policies from Table 2, Figure 3, Supplementary Figure 7, and Supplementary Figure 8.

We perform two types of simulations. First, we perform the simulations to reproduce Figure 3. And then, we also perform simulations with our hypothetical live application described for RQ3, thus reproducing Supplementary Figure 8.

Required files:

- Data/all_abstract_states_with_session.csv
- Data/data_rl_samples_abstracted[0, 1, 2][3, 2, 2].csv
- Intermediate_Results/_qvals_012_0.85_[3, 2, 2]
- Intermediate_Results/_reward_func_012_0.85_[3, 2, 2]
- Intermediate_Results/_trans_func_012_0.85_[3, 2, 2]
- for each cost in [0.1, 0.02, 0.03, 0.05, 0.07, 0.09, 0.12, 0.18, 0.055, 0.102]:
    - Intermediate_Results/_qvals_012_0.85_[3, 2, 2]_cost<cost>


Created files:

- Figures/initial_state_distribution_study.pdf (Supplementary Figure 7)
- Figures/policy_comp_stable_population_feedback_amount.pdf (Supplementary Figure 8b)
- Figures/policy_comp_stable_population_reward.pdf (Supplementary Figure 8a)
- Figures/policy_comp_with_costs_cost.pdf (Figure 3b)
- Figures/policy_comp_with_costs_reward.pdf (Figure 3a)

Authored by Nele Albers, Francisco S. Melo, Mark A. Neerincx, Olya Kudina, and Willem-Paul Brinkman.

Setup¶

Load packages¶

Let's load the packages that we need.

In [1]:
from copy import deepcopy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import random
import seaborn as sns

Define variables¶

And we define some variables we use throughout.

In [2]:
FEAT_SEL = [0, 1, 2]
NUM_VALS_PER_FEATURE = [3, 2, 2]
DISCOUNT_FACTOR = 0.85
COSTS = [0.02, 0.03, 0.05, 0.055, 0.07, 0.09, 0.1, 0.102, 0.12, 0.18]  # human feedback costs

PATH = "Intermediate_Results/" # pre-fix for path for storing results
path_to_save =  str(str(FEAT_SEL[0]) + str(FEAT_SEL[1]) + str(FEAT_SEL[2]) + "_" + str(DISCOUNT_FACTOR) + "_" + str(NUM_VALS_PER_FEATURE))

states = [[i, j, k] for i in range(NUM_VALS_PER_FEATURE[0]) for j in range(NUM_VALS_PER_FEATURE[1]) for k in range(NUM_VALS_PER_FEATURE[2])]
num_states = len(states)

Figure styling¶

And we set the styling for our figures.

In [3]:
sns.set()
sns.set_style("white")

med_fontsize = 22
small_fontsize = 18
extrasmall_fontsize = 15
sns.set_context("paper", rc={"font.size":small_fontsize,"axes.titlesize":med_fontsize,"axes.labelsize":med_fontsize,
                            'xtick.labelsize':small_fontsize, 'ytick.labelsize':small_fontsize,
                            'legend.fontsize':extrasmall_fontsize,'legend.title_fontsize': extrasmall_fontsize})   

# Colors for different human feedback costs
colors = ["#d0e7ca", "#b5ddd8", 
          "#9bd2e1", "#81c4e7",
          "#7eb2e4", "#FFC20A", 
          "#E66100", "#9398d2", 
          "#9d7db2", "#906388", 
          "#684957"]

Load transition data¶

Let's load the data on the transitions.

In [4]:
data = pd.read_csv("Data/data_rl_samples_abstracted" + str(FEAT_SEL) + str(NUM_VALS_PER_FEATURE) + ".csv", 
                   converters={'s0': eval, 's1': eval})

Load data on all states¶

And we also load the data on all states, which includes the first states of people with no data from session 2.

In [5]:
df_all_states = pd.read_csv("Data/all_abstract_states_with_session.csv",
                            converters={'state': eval})

Load previously computed dynamics and Q-values¶

We load the previously computed reward function, transition function, and Q-values.

In [6]:
with open(PATH + "_reward_func_" + path_to_save, "rb") as f:
    reward_func = pickle.load(f)
with open(PATH + "_trans_func_" + path_to_save, "rb") as f:
    trans_func = pickle.load(f)
with open(PATH + "_qvals_" + path_to_save, "rb") as f:
    q_vals = pickle.load(f)

Compute policies for different human feedback costs¶

Now we compute the policies and resulting transitions and rewards for different human feedback costs. This reproduces policies from Table 2 in the paper.

In [7]:
reward_func_policiescost = []
trans_func_policiescost = []
policies_cost = []

opt_policy = np.array([np.argmax(q_vals[state]) for state in range(len(q_vals))])
policies_cost.append(opt_policy)
trans_func_policiescost.append(np.array([trans_func[state][opt_policy[state]] for state in range(num_states)]))
reward_func_policiescost.append(np.array([reward_func[state][opt_policy[state]] for state in range(num_states)]))

for cost in COSTS:

    with open(PATH + "_qvals_" + path_to_save + "_cost" + str(cost), "rb") as f:
        q_vals_cost = pickle.load(f)
    policycost = np.array([np.argmax(q_vals_cost[state]) for state in range(len(q_vals_cost))])
    policies_cost.append(policycost)
    trans_func_policiescost.append(np.array([trans_func[state][policycost[state]] for state in range(num_states)]))
    reward_func_policiescost.append(np.array([reward_func[state][policycost[state]] for state in range(num_states)]))
    
    print("States with human feedback for cost " + str(cost) + ":")
    for state_idx, state in enumerate(states):
        if policycost[state_idx] == 1:
            print(state)
States with human feedback for cost 0.02:
[0, 0, 0]
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 0, 0]
[2, 1, 1]
States with human feedback for cost 0.03:
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 0, 0]
[2, 1, 1]
States with human feedback for cost 0.05:
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 1, 1]
States with human feedback for cost 0.055:
[0, 1, 0]
[0, 1, 1]
[1, 0, 0]
[1, 1, 0]
[1, 1, 1]
[2, 1, 1]
States with human feedback for cost 0.07:
[0, 1, 0]
[0, 1, 1]
[1, 1, 0]
[1, 1, 1]
[2, 1, 1]
States with human feedback for cost 0.09:
[0, 1, 0]
[0, 1, 1]
[1, 1, 0]
[1, 1, 1]
States with human feedback for cost 0.1:
[0, 1, 0]
[1, 1, 0]
[1, 1, 1]
States with human feedback for cost 0.102:
[1, 1, 0]
[1, 1, 1]
States with human feedback for cost 0.12:
[1, 1, 1]
States with human feedback for cost 0.18:

Starting population¶

As starting population for our simulation we want to use the distribution of people across the base states that we observed in the first session of our study.

In [8]:
all_states_s1 = df_all_states[df_all_states['session'] == 1]
all_states_s1 = all_states_s1.reset_index(drop=True)

all_states_count = np.zeros(num_states)

for p in range(len(all_states_s1)):
    state = list(np.take(all_states_s1.iloc[p]["state"], FEAT_SEL))
    state_idx = states.index(state)
    all_states_count[state_idx] += 1
    
all_states_frac = all_states_count/sum(all_states_count)

print("Fraction of people in each state in session 1:", np.round(all_states_frac, 2))
Fraction of people in each state in session 1: [0.34 0.17 0.06 0.05 0.07 0.05 0.02 0.04 0.03 0.07 0.02 0.07]

And let's also plot this starting population. This reproduces Supplementary Figure 7.

In [9]:
fig, ax = plt.subplots(figsize=(10,5))

x_vals = np.arange(num_states)
plt.bar(x_vals, all_states_frac, color = '#bbccee')

ax.set_ylim([0, 1])
ax.set_ylabel("Fraction of people")
ax.set_xlabel("Combination of state feature values")

plt.xticks(x_vals, ["000", "001", "010", "011", "100", "101", "110", "111", "200", "201", "210", "211"])

plt.legend(ncols=1, loc='upper left', bbox_to_anchor=(1, 0.7))

plt.savefig("Figures/initial_state_distribution_study.pdf", dpi=1500,
            bbox_inches='tight', pad_inches=0)
/tmp/ipykernel_4295/899958538.py:12: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(ncols=1, loc='upper left', bbox_to_anchor=(1, 0.7))

Simulate based on one group of people starting at the same time¶

Here we perform simulations based on a single group of people that all start at the same time.

Simulate mean cost and reward per transition¶

Now we simulate the mean cost and reward per transition.

In [10]:
num_steps_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
initial_pop = all_states_frac
initial_pop_size = sum(initial_pop)

policy_names = ["0"] + COSTS
rew_policy_lists = []
cost_policy_lists = []

# For the optimal policy and all cost-adapted policies
for policy in range(len(COSTS) + 1):
    
    print("\nOptimal policy based on cost", policy_names[policy])
    
    rew_policy_list = []
    cost_policy_list = []

    for num_steps in num_steps_list:
        print("\n   Number of time steps:", num_steps)

        pop_policy = initial_pop

        rew_policy = 0
        cost_policy = 0

        for t in range(num_steps):

            cost_policy += sum(np.array(pop_policy) * np.array(policies_cost[policy]))
            trans_time_rew = trans_func_policiescost[policy] * reward_func_policiescost[policy]  # element-wise multiplication
            rew_policy += sum((trans_time_rew.T).dot(pop_policy))  # total reward for transitions
            pop_policy = (trans_func_policiescost[policy].T).dot(pop_policy)  # new population


        print("    -> mean reward:", round(rew_policy/(num_steps * initial_pop_size), 2))
        rew_policy_list.append(rew_policy/(num_steps * initial_pop_size))
        print("    -> mean fraction with human feedback:", round(cost_policy/(num_steps * initial_pop_size), 2))
        cost_policy_list.append(cost_policy/(num_steps * initial_pop_size))
    
    rew_policy_lists.append(rew_policy_list)
    cost_policy_lists.append(cost_policy_list)
Optimal policy based on cost 0

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.91

   Number of time steps: 2
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.91

   Number of time steps: 3
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.9

   Number of time steps: 5
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.89

   Number of time steps: 10
    -> mean reward: 0.61
    -> mean fraction with human feedback: 0.89

   Number of time steps: 20
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.89

   Number of time steps: 30
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.89

   Number of time steps: 50
    -> mean reward: 0.65
    -> mean fraction with human feedback: 0.89

   Number of time steps: 100
    -> mean reward: 0.66
    -> mean fraction with human feedback: 0.89

Optimal policy based on cost 0.02

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.74

   Number of time steps: 2
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.74

   Number of time steps: 3
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.74

   Number of time steps: 5
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.75

   Number of time steps: 10
    -> mean reward: 0.61
    -> mean fraction with human feedback: 0.76

   Number of time steps: 20
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.78

   Number of time steps: 30
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.79

   Number of time steps: 50
    -> mean reward: 0.65
    -> mean fraction with human feedback: 0.79

   Number of time steps: 100
    -> mean reward: 0.65
    -> mean fraction with human feedback: 0.8

Optimal policy based on cost 0.03

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.4

   Number of time steps: 2
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.42

   Number of time steps: 3
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.44

   Number of time steps: 5
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.47

   Number of time steps: 10
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.52

   Number of time steps: 20
    -> mean reward: 0.62
    -> mean fraction with human feedback: 0.57

   Number of time steps: 30
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.6

   Number of time steps: 50
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.62

   Number of time steps: 100
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.63

Optimal policy based on cost 0.05

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.37

   Number of time steps: 2
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.39

   Number of time steps: 3
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.41

   Number of time steps: 5
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.44

   Number of time steps: 10
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.49

   Number of time steps: 20
    -> mean reward: 0.62
    -> mean fraction with human feedback: 0.55

   Number of time steps: 30
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.57

   Number of time steps: 50
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.59

   Number of time steps: 100
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.61

Optimal policy based on cost 0.055

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.31

   Number of time steps: 2
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.33

   Number of time steps: 3
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.35

   Number of time steps: 5
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.38

   Number of time steps: 10
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.44

   Number of time steps: 20
    -> mean reward: 0.62
    -> mean fraction with human feedback: 0.5

   Number of time steps: 30
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.53

   Number of time steps: 50
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.55

   Number of time steps: 100
    -> mean reward: 0.64
    -> mean fraction with human feedback: 0.56

Optimal policy based on cost 0.07

   Number of time steps: 1
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.25

   Number of time steps: 2
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.27

   Number of time steps: 3
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.29

   Number of time steps: 5
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.33

   Number of time steps: 10
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.39

   Number of time steps: 20
    -> mean reward: 0.61
    -> mean fraction with human feedback: 0.45

   Number of time steps: 30
    -> mean reward: 0.62
    -> mean fraction with human feedback: 0.48

   Number of time steps: 50
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.51

   Number of time steps: 100
    -> mean reward: 0.63
    -> mean fraction with human feedback: 0.53

Optimal policy based on cost 0.09

   Number of time steps: 1
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.18

   Number of time steps: 2
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.17

   Number of time steps: 3
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.17

   Number of time steps: 5
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.16

   Number of time steps: 10
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.16

   Number of time steps: 20
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.16

   Number of time steps: 30
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.16

   Number of time steps: 50
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.16

   Number of time steps: 100
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.16

Optimal policy based on cost 0.1

   Number of time steps: 1
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.12

   Number of time steps: 2
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.13

   Number of time steps: 3
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.13

   Number of time steps: 5
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.13

   Number of time steps: 10
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.13

   Number of time steps: 20
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.13

   Number of time steps: 30
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.13

   Number of time steps: 50
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.13

   Number of time steps: 100
    -> mean reward: 0.6
    -> mean fraction with human feedback: 0.13

Optimal policy based on cost 0.102

   Number of time steps: 1
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.07

   Number of time steps: 2
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.07

   Number of time steps: 3
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.08

   Number of time steps: 5
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.08

   Number of time steps: 10
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.08

   Number of time steps: 20
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.09

   Number of time steps: 30
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.09

   Number of time steps: 50
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.09

   Number of time steps: 100
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.09

Optimal policy based on cost 0.12

   Number of time steps: 1
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.04

   Number of time steps: 2
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.05

   Number of time steps: 3
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.05

   Number of time steps: 5
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.05

   Number of time steps: 10
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.06

   Number of time steps: 20
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.06

   Number of time steps: 30
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.06

   Number of time steps: 50
    -> mean reward: 0.58
    -> mean fraction with human feedback: 0.06

   Number of time steps: 100
    -> mean reward: 0.59
    -> mean fraction with human feedback: 0.07

Optimal policy based on cost 0.18

   Number of time steps: 1
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.0

   Number of time steps: 2
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.0

   Number of time steps: 3
    -> mean reward: 0.53
    -> mean fraction with human feedback: 0.0

   Number of time steps: 5
    -> mean reward: 0.54
    -> mean fraction with human feedback: 0.0

   Number of time steps: 10
    -> mean reward: 0.55
    -> mean fraction with human feedback: 0.0

   Number of time steps: 20
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.0

   Number of time steps: 30
    -> mean reward: 0.56
    -> mean fraction with human feedback: 0.0

   Number of time steps: 50
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.0

   Number of time steps: 100
    -> mean reward: 0.57
    -> mean fraction with human feedback: 0.0

Create figure for reward¶

Let's plot the mean reward per transition.

In [11]:
fig = plt.figure(figsize=(10,10))

fig_lower_bound = 0  # y-axis lower limit for figure
fig_higher_bound = 1  # y-axis upper limit for figure

x_vals = np.arange(len(num_steps_list))

ax1 = fig.add_subplot(111)

# Plot average reward for policies
for policy in range(len(COSTS) + 1):
    plt.plot(x_vals, rew_policy_lists[policy], color = colors[policy], label = policy_names[policy])
    
ax1.legend(ncols = 5, bbox_to_anchor = (0.985, 1.15))
ax1.set_ylabel("Mean reward per previous activity assignment")
ax1.set_xlabel("Number of time steps")
ax1.set_ylim([fig_lower_bound, fig_higher_bound])
ax1.set_xticks(x_vals, num_steps_list)

# Create zoomed-in plot
anchor_x = .20
anchor_y = .2
height_ax2 = 0.2
width_ax2 = 0.376
ax2 = plt.axes([anchor_x, anchor_y, width_ax2, height_ax2])
for policy in range(len(COSTS) + 1):
    ax2.plot(x_vals[4:], rew_policy_lists[policy][4:], color = colors[policy])
plt.setp(ax2, xticks=[], yticks=[])

# Highlight where zoomed-in plot is
buffer_y = 0.02
buffer_x = 0.1
ax1.plot([4 - buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[-1][4] - buffer_y], color = 'black')
ax1.plot([4 - buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[0][-1] + buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
ax1.plot([4 - buffer_x, 4 - buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
ax1.plot([x_vals[-1] + buffer_x, x_vals[-1] + buffer_x], [rew_policy_lists[-1][4] - buffer_y, rew_policy_lists[0][-1] + buffer_y], color = 'black')
# Dashed lines
ax1.plot([4 - buffer_x, 0.45], 
         [rew_policy_lists[0][-1] + buffer_y,  anchor_y + height_ax2 - 0.025],
         color = 'black',
         linestyle = 'dashed')
ax1.plot([x_vals[-1] + buffer_x, 4.8], 
         [rew_policy_lists[-1][4] - buffer_y,  anchor_y + height_ax2 - height_ax2 - 0.08],
         color = 'black',
         linestyle = 'dashed')


plt.savefig("Figures/policy_comp_with_costs_reward.pdf", dpi=1500,
            bbox_inches='tight', pad_inches=0)

Figure for fraction of activity assignments with feedback¶

And we also plot the mean fraction of previous activity assignments with feedback.

In [12]:
plt.figure(figsize=(10,10))

fig_lower_bound = -0.05  # y-axis lower limit for figure
fig_higher_bound = 1  # y-axis upper limit for figure

x_vals = np.arange(len(num_steps_list))

# Plot mean cost per policy
for policy in range(len(COSTS) + 1):
    plt.plot(x_vals, cost_policy_lists[policy], color = colors[policy], label = policy_names[policy],
            linewidth = 2)

plt.legend(ncols = 5, bbox_to_anchor = (0.985, 1.15))

plt.ylabel("Mean fraction of previous\nactivity assignments with feedback")
plt.xlabel("Number of time steps")

plt.ylim([fig_lower_bound, fig_higher_bound])
plt.xticks(x_vals, num_steps_list)

plt.savefig("Figures/policy_comp_with_costs_cost.pdf", dpi=1500,
            bbox_inches='tight', pad_inches=0)

Simulate hypothetical live application¶

Now we perform simulations with our hypothetical live application described for RQ3.

Simulation function¶

Here we define our simulation function.

In [13]:
def simulate(num_timesteps, 
             dropout, 
             capacity, 
             max_add_at_once, 
             max_timesteps, 
             num_starting_people, 
             opt_policy,
             num_sessions_per_waitingtime,
             all_states_frac, 
             trans_func, 
             reward_func,
             all_states,
             num_vals_feat_waitingtime,
             random_seed = 1):
    """
    num_timesteps: max number of sessions per person
    dropout: ratio of people dropping out randomly each session
    capacity: max number of people that can be in the application at a time
    max_add_at_once: max. number of new people we add at each time step to fill available spots
    max_timesteps: max. number of timesteps we want to track the application for
    num_starting_people: number of people starting in first session
    opt_policy: policy to use
    num_sessions_per_waitingtime: number of sessions per waiting time feature value
    all_states_frac: fraction of people in each state at start
    trans_func: transition function
    reward_func: reward function
    all_states: all possible states
    num_vals_feat_waitingtime: number of values for the waiting time state feature
    random_seed: random seed
    """
    
    random.seed(random_seed)

    # Create a starting population of participants
    pop = random.choices(all_states, k = num_starting_people, weights = all_states_frac)
    # Set waiting time feature to 0
    for p in range(len(pop)):
        pop[p] = deepcopy(pop[p]) + [0]

    pop_all = [pop]
    sessions_curr = [0] * num_starting_people  # keeps track of current session of each person
    waitingtimes_curr = [0] * num_starting_people  # keeps track of waiting times of each person since last human support

    action_sum = []  # number of human involvement per time step
    reward_sum = []  # total reward per time step
    
    num_people_prev = num_starting_people
    t = 0  # time step

    # while there are people left
    while len(sessions_curr) > 0 and t < max_timesteps:

        pop_new = []
        action_sum.append(0)
        reward_sum.append(0)

        # add new people if there is capacity left and we can still finish those people
        if len(sessions_curr) < capacity and t < (max_timesteps - num_timesteps + 1):
            
            num_to_add = min(capacity - len(sessions_curr), max_add_at_once)
            pop_added = random.choices(all_states, k = num_to_add, weights = all_states_frac)
            # Set waiting time state feature to 0
            for p in range(len(pop_added)):
                pop_added[p] = deepcopy(pop_added[p]) + [0]
                
            pop_all[t] += pop_added

            # also keep track of current session and of waiting times for new people
            sessions_curr = sessions_curr + [0 for i in range(num_to_add)]
            waitingtimes_curr = waitingtimes_curr + [0 for i in range(num_to_add)]

            num_people_prev += num_to_add

        # get action and next state and reward for each person
        for p in range(num_people_prev):
            action = opt_policy[all_states.index(pop_all[t][p][0:3])]
            action_sum[t] += action
            
            # get next state
            base_state_new = random.choices(all_states, k = 1, weights = trans_func[all_states.index(pop_all[t][p][0:3])][action])
            # add waiting time feature
            state_new = deepcopy(base_state_new[0]) + [0]
            
            pop_new.append(state_new)
            
            reward_sum[t] += reward_func[all_states.index(pop_all[t][p][0:3])][action]
                
            # set waiting time to 0 when human is involved
            if action == 1:
                pop_new[p][3] = 0
                waitingtimes_curr[p] = 0
            # increase waiting time when no human is involved
            elif action == 0:
                # move to next waiting phase when user is in last session of a not-last waiting time value
                if waitingtimes_curr[p] in [i * num_sessions_per_waitingtime - 1 for i in range(1, num_vals_feat_waitingtime)]:
                    pop_new[p][3] += 1
                waitingtimes_curr[p] += 1

        # increase number of sessions completed per person
        sessions_curr = [i + 1 for i in sessions_curr]

        # remove people who have had all sessions
        pop_new = [pop_new[p_idx] for p_idx in range(len(pop_new)) if sessions_curr[p_idx] < num_timesteps]
        waitingtimes_curr = [waitingtimes_curr[p_idx] for p_idx in range(len(waitingtimes_curr)) if sessions_curr[p_idx] < num_timesteps]
        sessions_curr = [p for p in sessions_curr if p < num_timesteps]
        num_people_prev = len(sessions_curr)

        # some people drop out after the session, i.e. do not arrive in next session
        # number of people to drop out
        num_people_dropout = round(num_people_prev* dropout)
        # randomly choose people drop out
        dropout_ids = random.sample(range(0, len(pop_new)), num_people_dropout)
        # remove indices of those people
        pop_new = [pop_new[p_idx] for p_idx in range(len(pop_new)) if not p_idx in dropout_ids]
        sessions_curr = [sessions_curr[p_idx] for p_idx in range(len(sessions_curr)) if not p_idx in dropout_ids]
        waitingtimes_curr = [waitingtimes_curr[p_idx] for p_idx in range(len(waitingtimes_curr)) if not p_idx in dropout_ids]

        pop_all.append(pop_new)
        num_people_prev = num_people_prev - num_people_dropout

        t += 1
        
    return action_sum, reward_sum

Determine empirical study dropout¶

To define a dropout rate for our simulation, we look at the return likelihood rating from our study. We assume that people with a negative return likelihood would have dropped out.

In [14]:
total = 0

for session in range(1, 5):
    data_s = data[data["session"] == session]
    data_return_negative_s = data_s[data_s["dropout_response"] < 0]
    
    fraction = len(data_return_negative_s)/len(data_s)
    total += fraction

    print("Potential dropout after session", session, ":", len(data_return_negative_s), "out of", len(data_s), "->", round(fraction * 100, 2), "%")

avg_dropout_rate = total/4

print("\nAverage:", round(avg_dropout_rate * 100, 2), "%")
Potential dropout after session 1 : 112 out of 679 -> 16.49 %
Potential dropout after session 2 : 94 out of 599 -> 15.69 %
Potential dropout after session 3 : 90 out of 544 -> 16.54 %
Potential dropout after session 4 : 66 out of 504 -> 13.1 %

Average: 15.46 %

Perform simulations¶

And now we perform the simulations.

In [15]:
capacity = 166  # number of people we can have in the application at the same time
max_add_at_once = 100  # max. number of people we add at once
max_timesteps = 365 # max. number of timesteps we want to follow the application
num_starting_people = max_add_at_once  # number of people to start with
num_sessions_per_waitingtime = 3
num_vals_feat_waitingtime = 3
max_sessions_per_person = num_sessions_per_waitingtime * num_vals_feat_waitingtime

repetitions = 10  # number of times to repeat the simulation

action_sums_policies = []
reward_sums_policies = []

for cost_idx in range(len(COSTS) + 1):
    
    print("... simulating policy with cost:", policy_names[cost_idx])
    
    action_sums = np.zeros((repetitions, max_timesteps))
    reward_sums = np.zeros((repetitions, max_timesteps))

    for repetition in range(repetitions):
        action_sum, reward_sum = simulate(max_sessions_per_person, 
                                          avg_dropout_rate, 
                                          capacity, 
                                          max_add_at_once, 
                                          max_timesteps, 
                                          num_starting_people, 
                                          policies_cost[cost_idx],
                                          num_sessions_per_waitingtime, 
                                          all_states_frac,
                                          trans_func, 
                                          reward_func, 
                                          states,
                                          num_vals_feat_waitingtime,
                                          random_seed = repetition)

        for i in range(len(action_sum), max_timesteps):
            action_sum.append(0)
        action_sums[repetition] = np.array(action_sum)

        for i in range(len(reward_sum), max_timesteps):
            reward_sum.append(0)
        reward_sums[repetition] = np.array(reward_sum)

    action_sums_policies.append(action_sums)
    reward_sums_policies.append(reward_sums)
... simulating policy with cost: 0
... simulating policy with cost: 0.02
... simulating policy with cost: 0.03
... simulating policy with cost: 0.05
... simulating policy with cost: 0.055
... simulating policy with cost: 0.07
... simulating policy with cost: 0.09
... simulating policy with cost: 0.1
... simulating policy with cost: 0.102
... simulating policy with cost: 0.12
... simulating policy with cost: 0.18

Create combined figures¶

Let's create combined figures.

First, we plot the percentage of people getting human feedback per day.

In [16]:
plt.figure(figsize=(10,10))

for policy_idx in range(len(COSTS) + 1):
    actions_sums_mean = np.mean(action_sums_policies[policy_idx], axis = 0)/capacity * 100
    error = np.array([np.std(action_sums_policies[policy_idx][:, i]/capacity * 100) for i in range(max_timesteps)])
    
    plt.plot([i for i in range(max_timesteps)], actions_sums_mean, color = colors[policy_idx], label = policy_names[policy_idx])
    plt.fill_between([i for i in range(max_timesteps)], actions_sums_mean - error, actions_sums_mean + error, alpha = 0.1,
                     color = colors[policy_idx])

plt.ylabel("Percentage of people\ngetting human feedback")
plt.xlabel("Day")
plt.legend(ncols = 4, bbox_to_anchor = (0.95, 1.15))

plt.savefig("Figures/policy_comp_stable_population_feedback_amount.pdf", dpi=1500,
            bbox_inches='tight', pad_inches=0)

And we also create a figure for the mean reward per day.

In [17]:
fig = plt.figure(figsize=(10,10))

ax1 = fig.add_subplot(111)

for policy_idx in range(len(COSTS) + 1):

    reward_sums_mean = np.mean(reward_sums_policies[policy_idx], axis = 0)/capacity
    error = np.array([np.std(reward_sums_policies[policy_idx][:, i]/capacity) for i in range(max_timesteps)])
    ax1.plot([i for i in range(max_timesteps)], reward_sums_mean, 
            color = colors[policy_idx], label = policy_names[policy_idx])
    ax1.fill_between([i for i in range(max_timesteps)], reward_sums_mean - error, reward_sums_mean + error, alpha = 0.1, 
                    color = colors[policy_idx])

ax1.set_ylabel("Mean reward")        
ax1.set_xlabel("Day")
ax1.legend(ncols = 4, bbox_to_anchor = (0.95, 1.15))
ax1.set_ylim(0, 0.8)

# Create zoomed-in plot
anchor_x = .20
anchor_y = .15
height_ax2 = 0.376
width_ax2 = 0.376
ax2 = plt.axes([anchor_x, anchor_y, width_ax2, height_ax2])
ax2_val_cutoff_1 = 50
ax2_val_cutoff_2 = 100
for policy_idx in range(len(COSTS) + 1):
    
    reward_sums_mean = (np.mean(reward_sums_policies[policy_idx], axis = 0)/capacity)[ax2_val_cutoff_1:ax2_val_cutoff_2]
    error = np.array([np.std(reward_sums_policies[policy_idx][:, i]/capacity) for i in range(max_timesteps)])[ax2_val_cutoff_1:ax2_val_cutoff_2]
    ax2.plot([i for i in range(max_timesteps)][ax2_val_cutoff_1:ax2_val_cutoff_2], reward_sums_mean, 
            color = colors[policy_idx], label = policy_names[policy_idx])
    ax2.fill_between([i for i in range(max_timesteps)][ax2_val_cutoff_1:ax2_val_cutoff_2], reward_sums_mean - error, reward_sums_mean + error, alpha = 0.1, 
                    color = colors[policy_idx])
plt.setp(ax2, xticks=[], yticks=[])

# Highlight where zoomed-in plot is
buffer_y = 0.031
buffer_x = 0
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_2 + buffer_x], 
         [(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y], 
         color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_2 + buffer_x], 
         [(np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y], 
         color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, ax2_val_cutoff_1 - buffer_x], 
         [(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y], 
         color = 'black')
ax1.plot([ax2_val_cutoff_2 + buffer_x, ax2_val_cutoff_2 + buffer_x], 
         [(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y, (np.mean(reward_sums_policies[0], axis = 0)/capacity)[ax2_val_cutoff_1] + buffer_y], 
         color = 'black')
ax1.plot([ax2_val_cutoff_1 - buffer_x, 20], 
         [(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y,  anchor_y + height_ax2 - 0.095],
         color = 'black',
         linestyle = 'dashed')
ax1.plot([ax2_val_cutoff_2 + buffer_x, 216], 
         [(np.mean(reward_sums_policies[-1], axis = 0)/capacity)[ax2_val_cutoff_1] - buffer_y,  anchor_y + height_ax2 - 0.095],
         color = 'black',
         linestyle = 'dashed')

plt.savefig("Figures/policy_comp_stable_population_reward.pdf", dpi=1500,
            bbox_inches='tight', pad_inches=0)
In [ ]: