Author: Nele Albers
Date: January 2025
Let's show the policies when using the return likelihood as basis for the reward. This reproduces Supplementary Table 2. We use the state features that were selected for the effort-based reward.
Required files:
- Intermediate_Results_Return_Effortfeatures/_qvals_012_0.85_[3, 2, 2]
- for each cost in [0.01, 0.02, 0.05, 0.06, 0.09, 0.137, 0.14, 0.18, 0.22]:
- Intermediate_Results_Return_Effortfeatures/_qvals_012_0.85_[3, 2, 2]_cost<cost>
Authored by Nele Albers, Francisco S. Melo, Mark A. Neerincx, Olya Kudina, and Willem-Paul Brinkman.
Let's load the packages that we need.
import numpy as np
import pickle
And we define some variables we use throughout.
FEAT_SEL = [0, 1, 2]
NUM_VALS_PER_FEATURE = [3, 2, 2]
DISCOUNT_FACTOR = 0.85
COSTS = [0.01, 0.02, 0.05, 0.06, 0.09, 0.137, 0.14, 0.18, 0.22]
PATH = "Intermediate_Results_Return_Effortfeatures/" # pre-fix for path for storing results
path_to_save = str(str(FEAT_SEL[0]) + str(FEAT_SEL[1]) + str(FEAT_SEL[2]) + "_" + str(DISCOUNT_FACTOR) + "_" + str(NUM_VALS_PER_FEATURE))
states = [[i, j, k] for i in range(NUM_VALS_PER_FEATURE[0]) for j in range(NUM_VALS_PER_FEATURE[1]) for k in range(NUM_VALS_PER_FEATURE[2])]
We load the previously computed Q-values for a cost of 0.
with open(PATH + "_qvals_" + path_to_save, "rb") as f:
q_vals = pickle.load(f)
Now we compute the policies for different human feedback costs.
reward_func_policiescost = []
trans_func_policiescost = []
policies_cost = []
opt_policy = np.array([np.argmax(q_vals[state]) for state in range(len(q_vals))])
print("Optimal policy:", opt_policy)
for cost in COSTS:
with open(PATH + "_qvals_" + path_to_save + "_cost" + str(cost), "rb") as f:
q_vals_cost = pickle.load(f)
policycost = np.array([np.argmax(q_vals_cost[state]) for state in range(len(q_vals_cost))])
print("States with human feedback for cost " + str(cost) + ":")
for state_idx, state in enumerate(states):
if policycost[state_idx] == 1:
print(state)
Optimal policy: [1 0 0 1 1 1 1 1 1 0 1 1] States with human feedback for cost 0.01: [0, 0, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 1, 0] [2, 1, 1] States with human feedback for cost 0.02: [0, 0, 0] [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 1, 0] States with human feedback for cost 0.05: [0, 1, 1] [1, 0, 0] [1, 0, 1] [1, 1, 0] [1, 1, 1] [2, 1, 0] States with human feedback for cost 0.06: [0, 1, 1] [1, 0, 0] [1, 1, 0] [1, 1, 1] [2, 1, 0] States with human feedback for cost 0.09: [0, 1, 1] [1, 0, 0] [1, 1, 0] [1, 1, 1] States with human feedback for cost 0.137: [0, 1, 1] [1, 0, 0] [1, 1, 1] States with human feedback for cost 0.14: [1, 0, 0] [1, 1, 1] States with human feedback for cost 0.18: [1, 1, 1] States with human feedback for cost 0.22: