Policies for different costs for return likelihood as reward signal with the effort-based state features¶

Author: Nele Albers

Date: January 2025

Let's show the policies when using the return likelihood as basis for the reward. This reproduces Supplementary Table 2. We use the state features that were selected for the effort-based reward.

Required files:

- Intermediate_Results_Return_Effortfeatures/_qvals_012_0.85_[3, 2, 2]
- for each cost in [0.01, 0.02, 0.05, 0.06, 0.09, 0.137, 0.14, 0.18, 0.22]:
    - Intermediate_Results_Return_Effortfeatures/_qvals_012_0.85_[3, 2, 2]_cost<cost>

Authored by Nele Albers, Francisco S. Melo, Mark A. Neerincx, Olya Kudina, and Willem-Paul Brinkman.

Setup¶

Load packages¶

Let's load the packages that we need.

In [3]:
import numpy as np
import pickle

Define variables¶

And we define some variables we use throughout.

In [4]:
FEAT_SEL = [0, 1, 2]
NUM_VALS_PER_FEATURE = [3, 2, 2]
DISCOUNT_FACTOR = 0.85
COSTS = [0.01, 0.02, 0.05, 0.06, 0.09, 0.137, 0.14, 0.18, 0.22]

PATH = "Intermediate_Results_Return_Effortfeatures/" # pre-fix for path for storing results
path_to_save =  str(str(FEAT_SEL[0]) + str(FEAT_SEL[1]) + str(FEAT_SEL[2]) + "_" + str(DISCOUNT_FACTOR) + "_" + str(NUM_VALS_PER_FEATURE))

states = [[i, j, k] for i in range(NUM_VALS_PER_FEATURE[0]) for j in range(NUM_VALS_PER_FEATURE[1]) for k in range(NUM_VALS_PER_FEATURE[2])]

Load previously computed Q-values¶

We load the previously computed Q-values for a cost of 0.

In [5]:
with open(PATH + "_qvals_" + path_to_save, "rb") as f:
    q_vals = pickle.load(f)

Compute policies for different human feedback costs¶

Now we compute the policies for different human feedback costs.

In [6]:
reward_func_policiescost = []
trans_func_policiescost = []
policies_cost = []

opt_policy = np.array([np.argmax(q_vals[state]) for state in range(len(q_vals))])
print("Optimal policy:", opt_policy)

for cost in COSTS:

    with open(PATH + "_qvals_" + path_to_save + "_cost" + str(cost), "rb") as f:
        q_vals_cost = pickle.load(f)
    policycost = np.array([np.argmax(q_vals_cost[state]) for state in range(len(q_vals_cost))])
    
    print("States with human feedback for cost " + str(cost) + ":")
    for state_idx, state in enumerate(states):
        if policycost[state_idx] == 1:
            print(state)
Optimal policy: [1 0 0 1 1 1 1 1 1 0 1 1]
States with human feedback for cost 0.01:
[0, 0, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 1, 0]
[2, 1, 1]
States with human feedback for cost 0.02:
[0, 0, 0]
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 1, 0]
States with human feedback for cost 0.05:
[0, 1, 1]
[1, 0, 0]
[1, 0, 1]
[1, 1, 0]
[1, 1, 1]
[2, 1, 0]
States with human feedback for cost 0.06:
[0, 1, 1]
[1, 0, 0]
[1, 1, 0]
[1, 1, 1]
[2, 1, 0]
States with human feedback for cost 0.09:
[0, 1, 1]
[1, 0, 0]
[1, 1, 0]
[1, 1, 1]
States with human feedback for cost 0.137:
[0, 1, 1]
[1, 0, 0]
[1, 1, 1]
States with human feedback for cost 0.14:
[1, 0, 0]
[1, 1, 1]
States with human feedback for cost 0.18:
[1, 1, 1]
States with human feedback for cost 0.22:
In [ ]: