A comprehensive comparison of 4 “ChatGPT Search”, Stanford Chinese Ph.D. hand-marked: New Bing has the lowest fluency, and nearly half of the sentences are not quoted-IT HOME
Generative search engines are still unable to replace traditional search engines. The source of sentences is too few and the accuracy of citations is not high.
Shortly after the release of ChatGPT, Microsoft successfully launched the “New Bing” on the car. Not only did the stock soar, but it even had the potential to replace Google and open a new era of search engines.
But is the new Bing really the right way to play large language models? Are the generated answers really useful to users? How reliable is the citation in the sentence?
Recently, researchers at Stanford collected a large number of user queries from different sources, and conducted manual evaluations on four popular generative search engines, Bing Chat, NeevaAI, perplexity.ai and YouChat.
Experimental results find that responses from existing generative search engines are fluent and informative, but often contain unsubstantiated statements and inaccurate citations.
On average, only 51.5% of the citations can fully support the generated sentences, and only 74.5% of the citations can be used as evidence support for related sentences.
The researchers argue that this result is too low for a system that could become a primary tool for information-seeking users, especially given that some sentences are only plausible, and generative search engines still need further optimization.
The first author, Nelson Liu, is a fourth-year doctoral student in the Natural Language Processing Group at Stanford University, supervised by Percy Liang, who graduated from the University of Washington with a bachelor’s degree. His main research direction is to build practical NLP systems, especially applications for information finding.
Don’t Trust Generative Search Engines
In March 2023, Microsoft reported that “approximately one-third of Daily Preview users use [Bing] chat”, and Bing chat provided 45 million chats in the first month of its public preview, that is to say, it is very marketable to integrate large language models into search engines, and it is very likely to change the search portal of the Internet.
But at present, the existing generative search engines based on large-scale language model technology still have the problem of low accuracy, but the specific accuracy has not been fully evaluated, and it is impossible to understand the limitations of the new search engine .
Verifiability is the key to improving the credibility of search engines, that is, to provide external links of citations for each sentence in the generated answer as evidence support, which can make it easier for users to verify the accuracy of the answer.
The researchers performed human evaluation on four commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat) by collecting questions of different types and sources.
The evaluation indicators mainly include fluency, that is, whether the generated text is coherent; usefulness, that is, whether the search engine’s reply is helpful to the user, and whether the information in the answer can solve the problem; reference recall, that is, the generated information about external websites. Proportion of sentences containing citation support; citation precision, the proportion of generated citations supporting their related sentences.
Simultaneously showing the user query, the generated response, and the statement that “the response is fluent and semantically coherent”, the annotators scored the data on a five-point Likert scale.
Similar to fluency, annotators need to rate how much they agree with the statement that the response is useful and informative to the user query.
Citation recall refers to the proportion of verifiable sentences that are fully supported by their relevant citations, so the calculation of this indicator needs to determine the verifiable sentences in the response and evaluate whether each verifiable sentence can be supported by relevant citations.
In the process of “identifying verifiable sentences,” the researchers argue that every generated sentence about the external world is verifiable, even those that may seem obvious and trivial, because to some readers it may seem It is obvious “common sense”, but it may not be true.
The goal of a search engine system should be to provide a reference source for all generated sentences about the outside world, enabling readers to easily verify any narrative in the generated responses, without sacrificing verifiability for simplicity.
So in fact the annotator verifies all the generated sentences, except for the system-first-person replies, such as “As a language model, I’m not capable of doing…”, or the user’s questions, such as “You Want to know more?” etc.
The assessment of “whether a statement worthy of verification is adequately supported by its relevant citations” can be based on the attributable to identified sources (AIS, attributable to identified sources) assessment framework, where annotators perform binary annotations, that is, if an ordinary audience agrees that “based on The quoted webpage, it can be concluded that…”, then the citation fully supports the response.
To measure citation precision, annotators need to judge whether each citation provides full, partial, or irrelevant support for its related sentence.
Full support: All the information in the sentence is supported by citations.
Partial support: Some information in the sentence is supported by citations, but other parts may be missing or contradictory.
Irrelevant support (No support): For example, the referenced web pages are completely irrelevant or contradictory.
For sentences with multiple related citations, annotators are additionally asked to use the AIS evaluation framework to judge whether all related citation pages as a whole provide sufficient support for the sentence (binary judgment).
In the fluency and usefulness evaluation, it can be seen that each search engine is able to generate very smooth and useful responses.
In the specific search engine evaluation, you can see that Bing Chat has the lowest fluency/usefulness score (4.40/4.34), followed by NeevaAI (4.43/4.48), perplexity.ai (4.51/4.56), and YouChat ( 4.59/4.62).
In different categories of user queries, it can be seen that shorter extractive questions are usually more fluent than long questions, usually only answering factual knowledge; some difficult questions usually need to summarize different tables or web pages, The compositing process reduces overall fluency.
In citation evaluation, it can be seen that existing generative search engines often fail to fully or correctly cite web pages, with an average of only 51.5% of generated sentences being fully supported by citations (recall rate), and only 74.5% of citations being fully supported by other citations. Related sentences (precision).
This value is unacceptable for a search engine system that already has millions of users, especially when generating replies with a relatively large amount of information.
And there are large differences in citation recall and precision between different generative search engines, with perplexity.ai achieving the highest recall (68.7), while NeevaAI (67.6), Bing Chat (58.7) and YouChat (11.1 ) lower.
On the other hand, Bing Chat achieved the highest accuracy (89.5), followed by perplexity.ai (72.7), NeevaAI (72.0) and YouChat (63.6)
Among different user queries, the difference in reference recall between NaturalQuestions queries with long answers and non-NaturalQuestions queries is close to 11% (58.5 and 47.8, respectively);
Likewise, the difference in reference recall between NaturalQuestions queries with short answers and NaturalQuestions queries without short answers is close to 10% (63.4 for queries with short answers, 53.6 for queries with only long answers, and 53.6 for queries with no long or short answers is 53.4).
Citations are lower in questions without web support, such as when evaluating the open-ended AllSouls thesis question, where Generative Search Engines had only a 44.3 citation recall
This article comes from the WeChat public account:Xin Zhiyuan (ID: AI_era)