AI chatbots with many knowledge gaps in history

Date:

Although AI chatbots perform many tasks, the results of a performance comparison regarding historical knowledge and understanding, conducted by a team of researchers with the participation of the Vienna Complexity Science Hub (CSH), were mixed. GPT-4 Turbo performed the best of the seven models tested.

Peter Turchin, head of the Social Complexity and Collapse research group at CSH, was surprised by the poor performance on the academic-level questions. For years, he and his colleagues have collected knowledge about human history in the ‘Seshat Global History Databank’. This database also served as a basis for testing AI-powered chatbots based on so-called Large Language Models (LLM) on their understanding of historical knowledge.

GPT-4 Turbo performed the best
Seven models had to choose the correct answer from four possible answers. The arbitrary percentage of 25 percent was exceeded by everyone, although not by much. GPT-4 Turbo from ChatGPT developer OpenAI performed the best with a 46 percent hit rate, while Llama-3.1-8B from Facebook Group Meta came in last with 33.6 percent. It should be noted that no general knowledge was asked, but the questions were at expert level – in accordance with the database, which contains knowledge about 600 companies worldwide.

What was also tested was not only whether the models correctly identify facts, but also whether they can infer them from indirect evidence, explains first author Jakob Hauser of CSH in a press release. According to the research, which was recently presented at the NeurIPS conference in Vancouver, Canada, a prominent meeting place for the AI ​​community, there were major differences in specific areas.

Differ depending on subjects, regions and eras
For example, there were disadvantages in assessing the characteristics of past societies outside North America and Western Europe. There were also significant gaps in historical understanding in more recent eras up to the present, while there were questions about early history, especially from the period between 8000 BC. and 3000 BC BC, were answered very accurately. In terms of subject category, the models were weaker when it came to discrimination and social mobility.

The models would be ideal for conveying basic facts, “but when it comes to more nuanced historical research, they are not yet up to the task,” says co-author Maria del Rio-Chanona, CSH external faculty member and assistant professor at University College London, cited. In the future, more data from underrepresented regions will be included in the performance equation and more models will be tested.

Source: Krone

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

Popular

More like this
Related