Overall, I am quite impressed with the responses! With minimal prompt engineering, document cleaning! It was able to return accurate responses, and even separated different conditions and provided appropriate treatment options. It was also able to return the correct response for tricky questions that our RAG was not able to. It definitely has potential!
Gemini 1.5 Flash is a powerful model designed for high-volume, low-latency AI tasks. It boasts a massive 1 million token context window
, enabling the processing of extensive data like lengthy documents, videos, or codebases. That also means that you can attach documents directly and not worry about RAG where we would generally use an embedding model and store it in vectorstore, and then have a retriever to search similarity using whichever parameter we set. But what is the catch? That’s sending a lot of tokens all at once! It can be quite expensive, isn’t it? Let’s take a look at price of the free version.
Here is the link in case they changed their pricing.
For our purpose, it seems like the FREE version would work just fine! It seems like there are limits, 15 requests per minute, 1 million tokens per minute and 1500 requests per day. That’s a lot of tokens!
But, is this really better than RAG? Let’s find out. If you want to check out our previous play with Llama 3, take a look at this. We discussed how RAG works, you can look at the prior responses there. Here, we will use LLM-as-a-judge to assess the relevance, factual accuracy and succicntness of the reponse Gemini Flash 1.5 generated. How well do you think it’s going to perform?
Before you begin, remember to get your Gemini API key
library(tidyverse) library(reticulate) # Step 1: Create a virtual environment, if you've already created one please move on the step 2. This is a best practice. virtualenv_create(envname = "gemini") ## Step 1.1: Install the appropriate modules py_install(c("google-generativeai","langchain","langchain-community","pypdf","python-dotenv"), pip = T, virtualenv = "gemini") # Step 2: Use the virtual environment use_virtualenv("gemini") # Step 3: Import installed modules dotenv <- import("dotenv") genai <- import("google.generativeai") langchain_community <- import("langchain_community") PyPDFLoader <- langchain_community$document_loaders$PyPDFLoader langchain <- import("langchain") PromptTemplate <- langchain$prompts$PromptTemplate # Step 4: Load your API keys onto a .env file - see https://pypi.org/project/python-dotenv/ dotenv$load_dotenv(dotenv_path = ".env") # Step 5: Load PDF of interest loader = PyPDFLoader("amr-guidance-4.0.pdf") documents = loader$load() # Step 6: Setup Gemini genai$configure() #if you're skipping dotenv, insert your API key here llm <- genai$GenerativeModel( 'gemini-1.5-flash', generation_config=genai$GenerationConfig( max_output_tokens=2000L, temperature=0 )) # Step 7 (optional): Test llm$generate_content(contents = "hello") # you should see a return of text and tokens etc. # Step 8: Prompt prompt_text = " You are a question and answer assistant. Given the context below, answer the question. Context: {text} Question: {question} " prompt = PromptTemplate(template=prompt_text, input_variables=list("text", "question")) questions = c( "What is the preferred treatment of CRE?", "What is the preferred treatment of ESBL-E?", "Can we use fosfomycin in ESBL Klebsiella?", "Can we use fosfomycin in ESBL Ecoli?", "What is the preferred treatment of stenotrophomonas?", "What is the preferred treatment of DTR Pseudomonas?", "Which organisms require two active agent when susceptibility is known?", "Can we use gentamicin in pseudomonas infection?", "Can we use tobramycin to treat pseudomonas infection?", "Why is there carbapenemase non-producing organism?", "Can we use oral antibiotics for any of these MDRO?", "What is the preferred treatment of MRSA?", "What is the preferred treatment of CRAB?", "Can fosofmycin be used for pyelonephritis?", "Is IV antibiotics better than oral antibiotics?" ) content = prompt$format( text=documents, question=questions ) # Step 9: Generate Content / aka Langchain lingo == invoke response <- llm$generate_content(contents=content) # Step 10: Let's simulate a streaming response 🤪 print_keystrokes <- function(text) { for (char in strsplit(text, "")[[1]]) { cat(char) # Print the character Sys.sleep(0.005) # Optional delay for visual effect } cat("\n") # Add a newline at the end } print_keystrokes(response$text)
OK, we don’t really need step 10, it’s actually more for show But, what do you think? Is it better than our prior RAG?
Let’s take a closer look here:
First of all, I didn’t even specify which conditions and Gemini was able to return responses for different conditions. Something we definitely did not see with our previous WizardLM model. The response overall also appeared to be quite accurate.
Impressive! It separated different conditions and accurately returned uncomplicated cysitis vs pyelonephritis and its treatment. Also impressive to caution ertapenem use in the setting of hypoalbuminemia in critically ill patients, offered the appropriate treatment. Wow! I’m starting to like what I’m seeing so far.
Correct!
Yup!
This response is quite amazing! Our previous RAG can’t get accurate response without removing reference and also use proper term such as “s. maltophila”, but Gemini was able to return the correct response without requiring any additional cleaning!
Looks about right.
OK, this is a tricky one! Because no matter how I tweaked it, I couldn’t get the right answer to return both stenotrophomonas and CRAB, but Gemini was able to! Truly impressive!
Wait a minute, why is it “yes” for this question? It is interesting that it was able to return the second sentence.
Cool beans!
A bit confused. why ESBL in CRE?
Sure
Good return! Did not hallucinate or try to return an answer, with very minimal prompt engineering!
Not bad!
Alright!
Not too shabby too!
Using an LLM as a judge to evaluate other LLMs’ responses involves leveraging advanced language models to assess output quality across various dimensions. Key aspects to evaluate include relevance, coherence, factual accuracy, completeness, language quality, reasoning, creativity, safety, and task-specific criteria. The process requires careful prompt engineering, model selection, and consistency checks. Evaluators should consider relevance to the query, logical structure, factual correctness, comprehensiveness, grammar, reasoning quality, originality, ethical considerations, and metacognitive awareness. Implementing this approach necessitates designing clear evaluation criteria, using few-shot examples, and developing a robust scoring system while remaining mindful of potential biases in the judge model itself.
In our use case, we will assess the relevance, factual accuracy, and also succintness of the response. The whole point of using LLM as a tool to chat with document is basically to get the essence of the context, hence I do not want it to return the whole text, but more so a concise output to help me either gain knowledge efficiently, or ask more questions. Either way, that’s great for life-long learning!
Let’s use Anthropic Claude Sonnet 3.5 to assess Gemini’s Flash 1.5’s response
Wow, not too shabby! Even Claude agreed for the most part. I basically attached the pdf on Claude Sonnet 3.5, and then wrote a prompt below:
Then pasted the Gemini’s response and had Claude output schema to data.frame, and use DT heatmap, hence the colored DT datatable! Limitations
Lessons learnt
Overall, I am quite impressed with the responses! With minimal prompt engineering, document cleaning! It was able to return accurate responses, and even separated different conditions and provided appropriate treatment options. It was also able to return the correct response for tricky questions that our RAG was not able to. It definitely has potential! We haven’t explored context caching yet here but if there is a long context you can upload the file and use context caching for a lower price. See this Lastly, is it better than RAG? Well, it depends. I think for documents + prompt + query does not exceed 1 Million tokens per minute, maybe. Otherwise, RAG appears to be more effective and also there is a better way to ensure what context were retrieved, unlike this. So, there you have it, if you want a plug and play without understanding the what is going on under the hood, this might be a good one for you! Otherwise, if you’re like me who is nosy and want to know what’s going on, might still want to stick with RAG for a bit. If you like this article:
To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. |
---|