ArchéoBot: an Intelligent, Tested Companion for Exploring Archaeology and its Methods with LangChain

Following a call for digital educational projects, the team[1] led by Mr Vincenzo Capozzoli, MCF in digital archaeology, proposed setting up ArchéoBot, a chatbot created specifically for the field of archaeology, aimed at students at the School of Art History and Archaeology (UFR03) at the University of Paris 1 Panthéon-Sorbonne. Based on the LangChain framework and using current large language models (LLMs), Archéo-bot aims to enhance the teaching of archaeology by offering an interactive learning experience to UFR03 students through its integration into the digital course environment (EPI).

[1] The other members of the team are Guillaume Simiand (professeur agrégé affecté à l’IRJS, EDS), Alain Duplouy (MCF HDR à l’École d’histoire de l’art et d’archéologie de la Sorbonne) and François Giligny (Professeur en archéologie et préhistoire, directeur de l’école doctorale Archéologie).

Archéo-bot is an evolution of the 'Le répétiteur automatique' project, carried out at the IED and set up in embryonic form and in a test version by Guillaume Simiand. This first project includes a conversational agent that queries a vector database, fed by Mr Simiand's law methodology course. Based on this repeater, Archéo-bot was designed by adapting the concept to archaeology and, first and foremost, to its methods. However, unlike the automatic repeater, which is based on the principle of questions/answers from a knowledge base, Archéo-bot will eventually enable fluid conversational interactions with students, while exploiting the latest advances in large language models (LLMs) thanks to the LangChain framework.

The LangChain Framework

The choice of the LangChain framework for the Archéo-bot project was based on several key criteria.

Firstly, LangChain offers remarkable software flexibility, enabling the language model used to be adapted quickly and easily. This ability to modify the model in just a few lines of code is essential for effective, personalised updating of the system in line with the changing needs of users and educational content.

Secondly, in the context of widespread adoption by the university's users and in consideration of the costs associated with OpenAI services, the decision has been taken to favour, in the long term, the use of an open source language model. This move towards an open source solution will promote the accessibility and sustainability of the Archéo-bot project, while offering greater autonomy in the management and evolution of the system.

Finally, it is important to note that LangChain, as an open source framework, is primarily focused on the management of data processing processes and flows, and is not hosted on remote cloud servers, but run locally in an environment controlled by the project team. As a result, the guarantee that the data collected will not be reused to retrain language models currently comes from the use of OpenAI's chargeable API. In the future, this assurance could also be obtained by using an open source language model. This approach is essential to ensure the protection and confidentiality of our teachers' specific learning resources, while maintaining academic integrity and data security within our academic institution.

Archéo-bot and the Other Conversational Robots

Alongside this software choice, what sets the Archéo-bot project apart from other conversational robot projects is the fact that its knowledge base will be made up of both :

  • Existing courses in rich text format (course materials, documents)
  • Subtitled educational videos
  • Structured bibliographic resources (Zotero export)

These various source documents will be vectorised and integrated into a searchable vector database for efficient extraction of relevant information by the chatbot. Mechanisms for updating this database are planned to guarantee consistency and the integration of the latest archaeological knowledge.

Among its major innovations, Archéo-bot aims to stand out by :

  • Reducing AI hallucinations. The system is designed to minimise errors and inaccurate responses, thereby improving the reliability of interactions.
  • Display of Sources. Every response provided by Archaeo-bot cites its sources, ensuring transparency and traceability of information.
  • Flexibility of Language Models. The ability to switch easily between different language models allows for adaptability and continuous updating of the system.
  • Zotero library integration. All references and sources are systematically stored in a dedicated Zotero library, making resource management more efficient.
Illustration du fonctionnement d'Archeobot
Illustration du fonctionnement d'Archeobot

The Various Stages of Work on Archeobot

The project will involve several key stages:

Course Writing and Structuring

It consists of transforming existing and new courses into content that can be used by Archéo-bot, with a focus on clarity and organisation of key concepts. The data to be exploited will include the videos already produced as part of the Ancient Cities MOOC, led by Alain Duplouy and already used in his L1 Art and archaeology of classical antiquity course, as well as the videos progressively produced by Vincenzo Capozzoli as part of his teaching of digital practices in archaeology.

Vectorisation and Data Management

The lessons are converted into vectors for a compact and meaningful representation, facilitating personalised recommendations and searches for relevant information. In addition, the vectorisation of texts could make it possible to establish similarities and correspondences between different courses, which will be useful for personalised recommendations and the search for related information. The vector database will ensure consistency and efficient management of pedagogical data with easy integration of course updates. The result will be a richer, more in-depth learning experience for students and up-to-date, accurate information.

Chatbot development

Using advanced language models for natural and accurate interaction, with particular attention to error management and preventing AI hallucinations. To ensure smooth and relevant interactions, Archéo-bot will rely on state-of-the-art language models. At launch, OpenAI's GPT-4 model will be used. But thanks to the LangChain framework, other major language models supported by Hugging Face can also be easily integrated. Whichever model is chosen, Archéo-bot will be designed to understand a wide variety of questions from students and answer them in a contextualised way. The chatbot will also seek to deepen discussions, much as a human teacher would. To prevent the propagation of erroneous information, advanced error detection functions and verification of the answers generated will be implemented on an ongoing basis. The aim is to minimise the risks of 'hallucination' inherent in artificial intelligence models.

Integration and support

Archéo-bot will be directly integrated into the digital educational spaces (EPI) used by students as part of their courses. The aim is to make the chatbot available in a familiar environment to enable natural interactions. Exchanges with the chatbot can feed into classroom discussions. For example, some of the questions dealt with in the EPI could be taken up again to explore certain concepts in greater depth or clarify them. The aim of integrating the Archéo-bot is therefore to strengthen the interaction between students and teachers, rather than replace it. The chatbot helps to enrich teaching methods without dehumanising the learning relationship.

présentation des différentes étapes du chatbot Archeobot

In conclusion, this project, which is the fruit of collaboration between teachers from UFR 03, engineers from Paris 1 SUN and Guillaume Simiand from EDS-IRJS, aims to enrich the educational experience in archaeology. It will offer a personalised, immersive and interactive approach, while guaranteeing the protection of educational resources against third-party use. Finally, by combining cutting-edge technology and innovative teaching methods, it promises to enrich understanding, stimulate curiosity and enhance interaction between students and teachers, while controlling the risks inherent in artificial intelligence.