2019-05-09
Large language models (LLMs) have significantly advanced in recent years, greatly enhancing the capabilities of retrieval-augmented generation (RAG) systems. However, challenges such as semantic similarity, bias/sentiment, and hallucinations persist, especially in domain-specific applications. This paper introduces MultiLLM-Chatbot, a scalable RAG-based benchmarking framework designed to evaluate five popular LLMs: GPT-4-Turbo, CLAUDE-3.7-Sonnet, LLAMA-3.3-70B, DeepSeek-R1-Zero, and Gemini-2.0-Flash across five domains: Agriculture, Biology, Economics, Internet of Things (IoT), and Medical. Fifty peer-reviewed research papers (10 per domain) were used to generate 250 standardized queries, resulting in 1,250 model responses. Texts from PDFs were extracted using PyPDF2, segmented to preserve factual coherence, embedded with sentence-transformer models, and indexed in Elasticsearch for efficient retrieval. Each response was analyzed across 4 dimensions: cosine similarity for semantic similarity, VADER sentiment analysis for sentiment detection, TF-IDF scoring, and named entity recognition (NER) for hallucination identification and factual verification. A composite scoring scheme aggregates these metrics to rank model performance. Experimental results show LLAMA-3.3-70B as the overall best-performing model, leading in all 5 domains. The proposed framework is implemented using Colab notebooks, which offer a reproducible, extensive pipeline for domain-specific LLM benchmarking. Through the combination of cross-domain analysis and multi-metric evaluation, this study fills in the gaps in current LLM benchmarking procedures and offers a modular architecture that can be adjusted to new domains and future LLM advancements. The findings inform model selection strategies for researchers and practitioners seeking trustworthy LLM deployment across diverse industrial and scientific sectors.