Data supporting the thesis “Exploring Hybrid Intelligence for Topic Interpretation in Colorectal Cancer Research: A Comparative Study of GPT-3.5 and Human Expertise”
doi: 10.4121/a7e63b3f-18f5-4ae4-8750-255528f82178
The research objective of this thesis is to bridge the gap between human and machine intelligence in the interpretation of colorectal cancer patient experiences extracted from patient web forums. This Computer Science thesis was done in collaboration with colorectal cancer human experts from Erasmus MC. To perform this scientific research and make these human experts and GPT-3.5 interpret colorectal cancer patient experiences, nearly 300k patient web forums were scraped from the American platform called Cancer Survivors Network USA (Colorectal Cancer — Cancer Survivors Network). For extracting the patient web forums, the Selenium webdriver was used to extract the page urls for each discussion thread, and BeautifulSoup4 (bs4) was used to access the page urls and parse the html elements from each type of patient forum, including main post, comment and reply, and store them in a local dataset. The patient forum attributes stored in the dataset are: URL – username (i.e. author of the post)– userposts (i.e. number of posts written by the author)– time (i.e. when the post was made)– title – post (i.e. text consisting of unstructured colorectal cancer patient experiences)
- 2023-09-04 first online, published, posted
DATA
- 175,133,719 bytesMD5:
f33b2c03be48bb2fed710a18aeeafdab
topic_posts_csn.csv -
download all files (zip)
175,133,719 bytes unzipped