Evaluation of COVID-19 Information Provided by Digital Voice Assistants

Background: Digital voice assistants are widely used for health information seeking activities during the COVID-19 pandemic. Due to the rapidly changing nature of COVID-19 information, there is a need to evaluate COVID-related information provided by voice assistants, to ensure consumers’ needs are met and prevent misinformation. The objective of this study is to evaluate COVID-related information provided by the voice assistants in terms of relevance, accuracy, comprehensiveness, user-friendliness and reliability. Materials and Methods: The voice assistants evaluated were Amazon Alexa, Google Home, Google Assistant, Samsung Bixby, Apple Siri and Microsoft Cortana. Two evaluators posed COVID-19 questions to the voice assistants and evaluated responses based on relevance, accuracy, comprehensiveness, user-friendliness and reliability. Questions were obtained from the World Health Organization, governmental websites, forums and search trends. Data was analyzed using Pearson’s correlation, independent samples t-tests and Wilcoxon rank-sum tests. Results: Google Assistant and Siri performed the best across all evaluation parameters with mean scores of 84.0% and 80.6% respectively. Bixby performed the worst among the smartphone-based voice assistants (65.8%). On the other hand, Google Home performed the best among the non-smartphone voice assistants (60.7%), followed by Alexa (43.1%) and Cortana (13.3%). Smartphone-based voice assistants had higher mean scores than voice assistants on other platforms (76.8% versus 39.1%, p = 0.064). Google Assistant consistently scored better than Google Home for all the evaluation parameters. A decreasing score trend from Google Assistant, Siri, Bixby, Google Home, Alexa and Cortana was observed for majority of the evaluation criteria, except for accuracy, comprehensiveness and credibility. Conclusion: Google Assistant and Apple Siri were able to provide users with relevant, accurate, comprehensive, user-friendly, and reliable information regarding COVID-19. With the rapidly evolving information on this pandemic, users need to be discerning when obtaining COVID-19 information from voice assistants. CORRESPONDING AUTHOR: Kevin Yi-Lwern Yap, PhD Senior Lecturer in Public Health (Digital Health), Department of Public Health, School of Psychology and Public Health, La Trobe University, Melbourne (Bundoora), VIC 3086, Australia kevinyap.ehealth@gmail.com; k.yap@latrobe.edu.au


INTRODUCTION
Digital voice assistants are becoming widely used in today's world. In 2020, there were 4.2 billion voice assistants used in various digital platforms worldwide [1], such as smartphones, laptops and smart speakers. Commonly used smartphone voice assistants included Apple Siri (44% in consumer usage), Google Assistant (30%) and Samsung Bixby (4%) [2]. Other home-based speakers like Amazon Alexa (64.6% in consumer usage) and Google Home (19.6%) [2], and the laptop's Microsoft Cortana (11.4%) were also commonly used [2]. In a recent survey, it was shown that 51.9% of US consumers would consider a voice assistant for healthcare-related issues [3].
There are many instances whereby voice assistants have been used for healthcare-related issues. For example, the Cedars-Sinai Medical Center had used an Alexa-powered platform for patients to verbally request for their nurses, which would be sent to the nurses' mobile phones [4]. Another study by Boyd and Wilson found that Google internet searches and Google Assistant fared better than Siri for smoking cessation information, but there was room for improvement for all three voice assistants in sourcing expert content [5]. Alagha and Helbing found that Google Assistant and Siri understood consumer queries about vaccine safety and use better and provided more reliable sources than Alexa [6]. In contrast, Miner et al. reported that Siri, Cortana, Google Now and S Voice were inconsistent and incomplete in their responses to queries regarding mental health, interpersonal violence and physical health [7]. Similarly, in the study by Kocaballi et al. [8], the authors suggested that Alexa, Siri, Google Assistant, Google Home, Cortana and Bixby were limited in their ability to deal with prompts about mental and physical health, violence and lifestyle. These studies have shown inconsistency in the responses of voice assistants. Furthermore, from our knowledge, there have been no studies that have evaluated voice assistants on sudden disease outbreaks and pandemics.
The usage of voice assistants by consumers to access news and information about the current coronavirus disease (COVID-19) has been increasing [9]. The rapidly evolving information about this pandemic has led to an infodemic, and there are many sources with poor quality information being generated on the Internet that voice assistants may access and provide to consumers [10]. Voice assistants can relieve the burden of healthcare professionals by informing consumers about COVID-19 symptoms and help them recognize their symptoms [11]. Voice assistants also offer anonymity, which can benefit consumers who fear disclosing their worries or symptoms to a healthcare professional [12]. Given the benefits that voice assistants offer in such situations, developers need to quickly update their voice assistants with the necessary abilities in order to prevent misinformation during the pandemic [13]. COVID-19 is an infectious disease that is transmissible via fomites [14], thus another advantage of using voice assistants is its hands-free accessibility, since consumers do not have to touch their devices to communicate, hence reducing possible transmission of the virus.
Major companies, such as Apple and Amazon, have equipped their voice assistants, Siri and Alexa, with the functionality to screen users for COVID-19 based on their symptoms and to provide advice accordingly [15,16]. However, research has not been done on voice assistants' ability to provide consumers with relevant, accurate, comprehensive, user-friendly and reliable health information regarding a pandemic, such as COVID-19. Relevant, comprehensive and user-friendly information is important to ensure consumers' needs are fully met, while accurate and reliable information will ensure consumers are not misinformed. Hence, this study aims to evaluate the COVID-19-related information provided by voice assistants in terms of relevance, accuracy, comprehensiveness, user-friendliness and reliability.

VOICE ASSISTANTS EVALUATED
The voice assistants that were evaluated were: Amazon Alexa, Google Assistant, Google Home, Apple Siri, Microsoft Cortana and Samsung Bixby. Alexa was accessed via Echo Dot. Google Assistant and Siri were accessed on an iPhone 11. Cortana was accessed via a Windows laptop and Bixby via a Samsung Galaxy S8.

QUESTIONS ON COVID-19
A series of commonly asked COVID-19 questions was compiled along with their respective answers from the websites of the World Health Organization (WHO) [17], United States Centers for Disease Control and Prevention (US CDC) [18], United Kingdom National Health Service (UK NHS) [19], European Centre for Disease Prevention and Control [20], Public Health Agency of Canada [21], Australian Government's Department of Health [22], Government of India's Ministry of Health and Family Welfare [23], Ministry of Health Singapore (MOH) [24] and National Centre for Infectious Diseases Singapore (NCID) [25].
A total of 56 questions were collated and organized into 6 categories: general information, prevention, transmission, screening, diagnosis, and treatment (Appendix A). The questions were checked against frequently asked questions found on public forums such as AskDr [26], Patient.info [27] and MedHelp [28]. Questions not in the original list by WHO and the government websites, but had appeared multiple times across these forums were compared with search trend data from Google Search and AnswerThePublic [29] to confirm that they were frequently asked questions. Some questions were rephrased to add context and questions that incorporated more than one topic were split into their respective categories.

EVALUATION RUBRIC
The rubric used was adapted from 3 studies on voice assistants in healthcare [5,6,8] and the DISCERN [30] and HONcode [31,32] quality evaluation tools (Figure 1). The point system was adapted from Alagha and Helbing [6]. The rubric evaluated 5 parameters: relevance, reliability, accuracy, comprehensiveness, and user-friendliness of information provided. Relevance was evaluated based on how well the voice assistant's response understood (comprehension ability) and addressed the question (applicability of information). Comprehension ability Figure 1 Evaluation rubric for assessing the voice assistants (VAs) used in this study. • 0 points -Not at all (The information provided is totally incorrect OR any of the information provided could lead to detrimental health consequences * ) How many points does the VA response/website provided have as compared to the answer sheet?
Where applicable, is there a disclaimer that the patient should seek professional medical advice when in doubt and that the information provided should not substitute for professional judgement? • 2 points -Disclaimer was stated clearly/obviously • 1 point -Disclaimer was stated but unclear/not obvious Note: If a list of responses/websites is provided, evaluate only the first response/website in the list.
Are there reference citations in the VA response/website provided?
Yes was evaluated through the voice assistants' ability to recognize the question posed and provide a response. If the voice assistant was unable to provide a response after 3 attempts, the evaluation would end with zero points awarded. A successful response was further evaluated through the number of wrongly transcribed or missing words. Applicability of information was evaluated based on how updated and relevant the response was to the question. Reliability was evaluated based on 3 criteria: transparency, presence of bias and credibility. Transparency was assessed based on whether the authorship of the response was clearly stated, and whether there were any advertisements. Biasness was defined as information provided from the author's subjective point of view, having limited evidence and attempting to sway or convince the audience of the author's personal opinion. Credibility was assessed according to 4 grading categories on the voice assistants' responses and the reference citations provided. Grade A was defined as reputable sites/references backed by recognized authorities, such as WHO, governmental websites and scientific journals. Grade B was defined as sites/references that provided information largely based on expert opinion, such as commercially orientated medical sites, clinician sites and online encyclopaedias. Grade C was defined as sites/references that might have their own agenda and were not primarily known for providing factual health information, such as social media and company websites. Grade D was used if the site/reference was not stated. In addition, the presence of a disclaimer stating that the information provided should not substitute a healthcare professional's advice/professional judgement would be evaluated for questions relating to consumer health advice, treatment, and special populations.
Accuracy was assessed through comparing the voice assistants' responses with our list of compiled answers (Appendix A). Answers that were totally incorrect or would lead to detrimental health consequences were awarded zero points, while partially or fully correct answers were awarded 1 and 2 points respectively. Comprehensiveness was determined based on the proportion of information provided by the voice assistant matched against our list of compiled answers. User-friendliness was assessed based on the understandability of the response by a layperson, with a clear organization of content and minimal scientific jargon and complex words.
The rubric was reviewed by 3 individuals (WLL, KY and QX). One of them (QX) pilot-tested the rubric using Google Assistant with 2 questions from each category in our compiled question list (Appendix A). The feedback obtained was used to refine the rubric for the final evaluation.

EVALUATION
Two independent evaluators (AG, female and JB, male) assessed the voice assistants using the same devices with the search history reset before and after each evaluator's use. All devices' languages were set as English (US) and the location function was switched off. For each question, the evaluator would score the voice assistant's response based on the evaluation rubric. If more than one weblink was provided by the voice assistant, the first weblink was evaluated. For each evaluator, after all responses were scored, each question's score was converted to a percentage and the mean percentage across all the questions was taken as that evaluator's score for the voice assistant. This was repeated for all voice assistants.

ANALYSIS
Descriptive statistics were used to report the proportion of successful responses and the cited sources by the voice assistants. These proportions were reported separately for each evaluator. Evaluation scores for the voice assistants were reported as a mean of both evaluators' scores. Normality (Shapiro-Wilk) tests were performed, and the data was analyzed at a significance level of 0.05 on the Statistical Package for Social Sciences (SPSS) software (version 25). Independent samples t-tests were used for comparing smartphone-based voice assistants and voice assistants on other platforms, and voice assistants accessing Bing versus those accessing Google search engines. Wilcoxon rank-sum tests were used to compare the comprehension abilities across genders for each voice assistant. Pearson's correlation coefficient was used to determine correlation between the percentage of successful responses and the comprehension abilities of the voice assistants.

RESULTS
The number of successful responses for the 56 COVID-19 questions differed across the voice assistants ( Assistant were consistently higher than Google Home for each of the evaluation criteria (Figure 2).
Google Assistant consistently scored the best in all the evaluation criteria (Figure 3). In terms of relevance, Google Assistant scored the highest for its comprehension ability (92.0%) and applicability of information (87.3%), followed by Siri (comprehension ability 88.8%, applicability of information 86.6%) (Figure 3a). A statistically significant positive correlation was observed between the proportion of successful responses provided by the voice assistants and their comprehension ability (r = 0.981, p = 0.001). There was a decreasing score trend for both relevance and reliability (transparency and presence of bias) from Google Assistant, Siri, Bixby, Google Home, Alexa and Cortana (Figure 3a and 3b). Cortana was the only voice   Comprehension abilities of the voice assistants differed between genders, but they were not significantly different (mean scores: females 67.5% versus males 61.8%, p = 0.738). The largest difference in comprehension ability between genders occurred for Siri, whereby the median score for females (100.0%, interquartile range 100.0-100.0%) was higher than males (100.0%, interquartile range 80.0-100.0%, p = 0.012).

DISCUSSION
Both Google Assistant and Siri performed well in the evaluation criteria, suggesting that COVID-19 information provided by these voice assistants was relevant, reliable, accurate, comprehensive, and userfriendly. The comparatively high scores that Google Assistant, Siri and Bixby achieved for transparency and credibility might have been due to their installation on smartphones, which allowed their responses to be displayed on screen. Unlike voice assistants on smart speakers, which could only provide verbal responses to the questions, the smartphone-based voice assistants enabled the authorship and reference citations to be more clearly identified. Majority of the COVID-19 questions that were posed to the voice assistants were found on the frequently asked questions section of the government websites. However, there were two questions (Appendix A, under General Information, questions 9 and 10) that were based on Google search trends. Although question 9 was not a frequently asked question on government websites, it was a common question asked by consumers in Google searches. The WHO had referred to the answers in their publication [33], hence this was included as a question to be evaluated in our study. On the other hand, question 10 was rephrased from a question from CDC as the original question was not reflective of consumers' actual search queries on Google. According to the company, Google Trends is able to categorize, aggregate and anonymize actual search requests made to Google, so that interests in particular topics can be displayed [34]. Thus, the question was adapted from Google Trends instead, since it would be more representative of how consumers would ask their questions to the voice assistants.
Google Assistant had the best comprehension ability among all the voice assistants. It also provided longer verbal responses than Siri. Our findings were similar to another study comparing the abilities of Google Assistant, Siri and Alexa in comprehending medication names [35]. The authors reported that Google Assistant had the best comprehension accuracy, while Alexa was the worst. Our study showed that there was a correlation between the comprehension ability of the voice assistants and the proportion of successful responses, thus Google Assistant might be the best voice assistant to answer COVID-19 questions posed by the general public.
Bixby performed worse than Google Assistant in terms of all the evaluation criteria. Our results were contrary to a study by Kocaballi and colleagues who reported that Bixby was second to Siri when responding appropriately to health and lifestyle prompts, and it outperformed Google Assistant and other voice assistants [8]. The difference was that in their study, Kocaballi and colleagues only evaluated the applicability of information, but did not assess the other evaluation criteria in our study, such as accuracy, comprehensiveness, user-friendliness and reliability. Our results showed that Bixby did not score as well in terms of accuracy and comprehensiveness of its responses, but also suffered in terms of providing relevant responses. During our evaluations, Bixby repeatedly produced the same generic responses when asked a variety of questions on COVID-19 (Appendix B). In this regard, Bixby's adaptability to the types of questions posed by the general public regarding the COVID-19 pandemic can be improved.
Google Assistant had consistently scored higher than Google Home in all the evaluation parameters, even though they used the same search engine. Our findings were similar to the Kocaballi study, in which the smartphone-based voice assistants outperformed their counterparts on other platforms [8]. A possible explanation could be due to the different search algorithms and prioritization of the search results due to the different capabilities of the devices [36]. Unlike the smartphone-based Google Assistant which could provide a list of resources on screen, Google Home could only vocalize their responses. Thus, instead of answering the question directly, sometimes Google Home would pose another related question back to the user, which might not have captured the essence of the user's initial question. For example, in response to "Am I protected against COVID-19 if I had the influenza vaccine this year?", Google Home posed back the question "Do you want to know what is the mortality rate of the coronavirus disease versus influenza?" When rejected, Google Home was unable to perform any further searches, hence it scored poorly for most of the evaluation parameters. Similar to Bixby, Google Home's adaptability to the types of questions posed by users can be improved.
Alexa had provided long verbal responses (61.2 words on average per response) to the COVID-19 questions, which was similar to another study that reported that Alexa had the greatest number of spoken words in the responses compared to Siri and Google Assistant [6]. Furthermore, Alexa provided clear disclaimers in its verbal responses, thus bringing to the user's attention regarding any precautions that needed to be taken when accessing the information provided. The long verbal responses by Alexa could be an advantage to special populations who could not read small fonts on smartphones [37], such as the elderly and those with poor eyesight. More importantly, this could be beneficial to users who choose not to touch easily avoidable surfaces in the current COVID-19 pandemic [38]. However, during our evaluations, Alexa seemed to perform poorly with regards to applicability, credibility, accuracy and comprehensiveness of information on COVID-19, despite it being used in various healthcare settings [4,39,40]. Alexa's poor scores could be due to the differences between the Bing and Google search engines, which had different search engine optimization factors that affected the search results [41,42]. For example, Google focuses on the quality rather than the quantity of backlinks, unlike Bing which treats both quality and quantity similarly. Furthermore, Bing favors backlinks with official domains, such as .edu, .org and .gov sites. In addition, the Google algorithm works on the context of search queries, unlike Bing, which uses targeted keywords and metadata as ranking parameters. Last, but not least, in contrast to Google searches, social media signals are used as a ranking factor in Bing searches. Since Alexa's source of answers were from Bing, the difference between its scores and those of the other voice assistants that utilized Google as a search engine (i.e. Google Assistant, Google Home, Siri and Bixby) was expected. Nonetheless, Alexa had an algorithm embedded to identify the user's risk for COVID-19. When prompted with questions on concerns over exposure to or having COVID-19, Alexa would start the algorithm with a prompt of "If you're concerned about COVID-19, I can ask you a few questions based on CDC's guidelines to help you understand your risk and make a decision about seeking medical care. Do you have a few minutes for this?" Evaluation of this algorithm found it to be thorough in identifying related symptoms along with risk factors such as age, health conditions, and close contact with infected people. However, if the user answered "no" to the prompt, Alexa would just end the process. The usefulness of this algorithm, combined with efforts from healthcare organizations such as the Mayo Clinic to further enhance Alexa's skills in responding to COVID-19 questions [43], can potentially improve its credibility as a one-stop resource on the pandemic in time to come.
Cortana performed the worst among all the voice assistants. Besides a lack in comprehension ability, there was also a lack of reliable sources in its responses. Three-quarters of the sources provided by Cortana were Grade C sources, such as media and company sites like the ABC News, which might contain health information that lacked in completeness and accuracy [44]. Moreover, the media had been shown to present health issues in a perspective that disproportionately emphasized risk, which might result in unnecessary heightened fear among consumers [45]. The higher selection of media sites by Cortana compared to other voice assistants could potentially also be linked to the search optimization factors of its Bing search engine instead of Google. As the search optimization factors for Bing continue to evolve [46], hopefully future pandemicrelated information provided by voice assistants using Bing as a search engine would improve in terms of credibility and relevance.
Among all the parameters evaluated for voice assistants in this study, our author consensus was that even though the accuracy, credibility and comprehensiveness of pandemic-related information would have the greatest public health impact, these parameters would require a substantial amount of effort to develop, maintain and keep up-to-date, especially in relation to the rapid spread of the infodemic (Figure 4). On the other hand, understandability, comprehension ability and applicability of information could be "quick wins" if these parameters could be tailored towards a pandemicrelated situation, so as to increase public awareness regarding the pandemic, as well as enhance the userfriendliness of the voice assistants. In contrast, while little effort is needed to improve the transparency and biasness of the voice assistants, these would only be useful if the other evaluation parameters were enhanced. As such, developers are encouraged to prioritize the features of voice assistants according to their societal impact and amount of effort needed to develop these features in pandemic-related situations, such as COVID-19.

LIMITATIONS AND FUTURE WORK
As this study was conceived due to the rapidly evolving nature of the COVID-19 infodemic, the evaluation framework has not been validated. The information on COVID-19 is continually changing with new and updated information, thus we were not able to evaluate the quality of information longitudinally as it would also change over time. Our author consensus was that it would be timely to create public awareness regarding the quality of voice assistants during this crucial time in order to combat the infodemic on COVID-19. As such, we intend to validate this framework for pandemic-related information as part of future research. Another limitation was that the evaluation process might not have accurately mimicked the questioning process of an average consumer's usage of a voice assistant. If the voice assistant did not understand the question on the first attempt, a total of three attempts would be made by the evaluator and any successful response provided out of the three attempts would be evaluated. In reality, consumers might have given up on their first attempt and the voice assistant would have failed to provide the appropriate information required. Although the location feature was switched off, the responses provided by the voice assistants could still have been adapted to suit Singapore's local context where the evaluation was conducted, as the Internet Protocol address of the devices might have been used to provide the results [47,48]. Hence, caution is advised when extrapolating the results of this study to other countries where the devices might provide different responses. Lastly, Chinese voice assistants were excluded. Given that the COVID-19 virus was first reported in China [49] and that Chinese voice assistants occupy a large part of the voice assistant market [50], future studies should also consider evaluating these voice assistants for pandemic-related information.

CONCLUSION
This study identified that Google Assistant and Siri were the best voice assistants in providing consumers with pandemic-related information about COVID-19. Consumers need to be discerning when obtaining healthrelated information from voice assistants, including examining the sources of information cited by the voice assistants. On the other hand, developers should also continue to enhance the skills of voice assistants in order to ensure that the information provided to consumers is reliable, accurate, comprehensive, user-friendly and relevant.

ADDITIONAL FILE
The additional file for this article can be found as follows: • Appendix A. COVID-19 questions posed to voice assistants. DOI: https://doi.org/10.29337/ijdh. 25.s1