# Controlled Yet Natural: A Hybrid BDI-LLM Conversational Agent for Child Helpline Training
Authors: Mohammed Al Owayyed & Adarsh Denga
Contact: M.AlOwayyed@tudelft.nl

This dataset is created as part of an evaluation study for a virtual LLM-integrated training system for training child helpline counsellors. The design of this evaluation study was pre-registered under the Open Science Framework (OSF) registries and is publicly available at https://osf.io/6g7e2. The dataset contains participants' survey responses with regards to our measures - human-like behaviour, natural behaviour, engagement, attitude and overall performance for both the LLM-integrated and rule-based systems. Also, it includes non-inferiority test to compare the LLM to human generated responses. The markdown script contains data-specific information on how to make use of this dataset. Participants were recruited through the online platform Prolific, and the data was collected through an online survey hosted on Qualtrics. The questionnaires for our measures are from the Artificial Social Agent Questionnaire, available at https://ii.tudelft.nl/evalquest/web/node/1. 

All statistical analyses were done using R software. This work is licensed under CC BY 4.0.
To run the code, you can use the Dockerfile for easier reproducibility. You can run the Docker container via `docker run -d -p 8787:8787 -v <path_to_this_directory>:/home/rstudio/analysis -e PASSWORD=<some_password> analysis_bay`.


The original data files obtained from the experiment contain personally identifiable data, and hence a cleaned and anonymized version is published.

## Files


Analysis files:
- Experiment Analysis.Rmd: The R markdown file explaining the data and outlining the statistical tests we perform as part of the experiment.
- Experiment-Analysis.pdf: The knit PDF file from Experiment Analysis.Rmd
- Non-inferiorityTest.Rmd: the R markdown explaining the non-inferiority tests
- Non-inferiorityTest.pdf: The knit PDF file from Non-inferiorityTest.Rmd

Python Scripts:
- construct_split.py: The Python script used to transfer the raw experiment data from data_raw_llm.csv and data_raw_rbs.csv to constructs_averaged_llm.csv and constructs_averaged_rbs.csv respectively. The last two files are used for the analyses in R.



CSV Files:
- data_raw_llm.csv: The raw experiment data from our partipants after using the LLM-based system.
- data_raw_rbs.csv: The raw experiment data from our partipants after using the rule-based system.
- constructs_averaged_llm.csv: Participants' data from the LLM-based system split into the averages for our five measures.
- constructs_averaged_rbs.csv: Participants' data from the rule-based system split into the averages for our five measures. 
- qual.csv: Participants' qualitative responses from using both the LLM and rule-based systems. The file also includes the coders assigned themes per code.


Supplementary materials:
- prompts.pdf: includes the prompts we used in the system and graphical explanations.
- Screenshot - experiment interface.png: shows a screenshot of the tool we used in the experiment.
- Study 1 Materials.pdf: the materials and scripts given to the participants in study 1.


## File Structure

### data_raw_(llm/rbs).csv:
Both of the raw data files contain the following columns:
- PROLIFIC_PID: Anonymised participant ID
- Q1-Q34: Feedback for the 34 quantitative questions from the questionnaire
The columns Q1-Q34 correspond to our measures as follows:
- Human-Like Behaviour: Q2, Q25-Q28
- Natural Behaviour: Q4, Q29-Q30
- Engagement: Q13, Q31-Q32
- Attitude: Q19, Q33-Q34
- Overall Performance: Q1-Q24

### constructs_averaged(llm/rbs).csv:
Both of the averaged data files contain the following columns:
- PROLIFIC_PID: Anonymised participant ID
- OVERALL: Overall performance score, averaged from the columns corresponding described above
- BELIEVABILITY1: Human-like behaviour score, averaged from the corresponding columns described above
- BELIEVABILITY2: Natural behaviour score, averaged from the corresponding columns described above
- Engagement: Engagement score, averaged from the corresponding columns described above
- Attitude: Attituded score, averaged from the corresponding columns described above

### qual.csv
The qualitative feedback file contains the following columns:
- PROLIFIC_PID: Anonymised participant ID
- CONDITION_ORDER: Flag to determine the order in which the participants interacted with the two conditions (LLM-based and Rule-Based)
- Quote: Participants' qualitative feedback, either from using the LLM-based system or the rule-based system
- Coder1: the assigned theme by Coder1
- Coder2: the assigned theme by the double coder
- Route: either LLM or RBS
- FinalCoding: the final theme assigned to the code
