Skip to content Skip to footer

Using LLMs to Query PubMed Knowledge Bases for BioMedical Research | by Jillian Rowe | Jul, 2024


AI for fun and profit!

Photo by 🇸🇮 Janko Ferlič on Unsplash

In this article, we’ll explore how to leverage large language models (LLMs) to search and scientific papers from PubMed Open Access Subset, a free resource for accessing biomedical and life sciences literature. We’ll use Retrieval-Augmented Generation, RAG, to search our digital library.

AWS Bedrock will act as our AI backend, PostgreSQL as the vector database for storing embeddings, and the LangChain library in Python will ingest papers and query the knowledge base.

If you only care about the results generated by querying the knowledge base, skip down to the end.

The specific use case we’ll be focusing on is querying papers related to Rheumatoid Arthritis, a chronic inflammatory disorder affecting joints. We’ll use the query ((rheumatoid arthritis) AND gene) AND cell to retrieve around 10,000 relevant papers from PubMed and then sample that down to approximately 5,000 papers for our knowledge base.

Not all research articles or sources have licensing that allows for ingesting with AI!

I’m not including all the source code because the AI libraries change so frequently and because there are oodles of ways to configure a knowledge base backend, but I have included some helper functions so you can follow along.

To make it easier for the LLM to process and understand the textual data from the research papers, we’ll convert the text into numerical embeddings, which are dense vector representations of the text. These embeddings will be stored in a PostgreSQL database using the PGVector library. This step essentially simplifies the text data into a format that the LLM can more easily work with.

I’m running a local postgresql database, which is fine for my datasets. Hosting AWS Bedrock Knowledgebases can get expensive, and I’m not trying to run up my AWS bill this month. It’s summer, and I have kids camp to pay for!

AWS Bedrock is a managed service provided by Amazon Web Services (AWS), allowing you to easily deploy and operate large language models. In our setup, Bedrock will host the LLM that we’ll use to query and retrieve relevant information from our knowledge base of research papers.

LangChain is a Python library that simplifies building applications with large language models. We’ll use LangChain to load our research papers and their associated embeddings into a knowledge base and then query this knowledge base using the LLM hosted on AWS Bedrock.

While this setup can work with research papers from any source, we’re using PubMed because it’s a convenient source for acquiring a large volume of papers based on specific search queries. We’ll use the PubGet tool to retrieve the initial set of 10,000 papers matching our query on Rheumatoid Arthritis, genes, and cells. Behind the scenes pubget fetches articles from the PubMed FTP service.

pubget run -q "((rheumatoid arthritis) AND gene) AND cell" \
pubget_data

This will get us articles in xml format.

Beyond the technical aspects, this article will focus on how to structure and organize your dataset of research papers effectively.

  1. Dataset: Managing your datasets on a global level using collections.
  2. Metadata Management: Handling and incorporating metadata associated with the papers, such as author information, publication dates, and keywords.

You’ll want to think about this upfront. When using LangChain, you query datasets based on their collections. Each collection has a name and a unique identifier.

When you load your data, whether it’s pdf papers, xml downloads, markdown files, codebases, powerpoint slides, text documents, etc, you can attach additional metadata. You can later use this metadata to filter your results. The metadata is an open dictionary, and you can add tags, source, phenotype, or anything you think may be relevant.

The article will also cover best practices for loading your preprocessed and structured dataset into the knowledge base and provide examples of how to query the knowledge base effectively using the LLM hosted on AWS Bedrock.

By the end of this article, you should have a solid understanding of how to leverage LLMs to search and retrieve relevant information from a large corpus of research papers, as well as strategies for structuring and organizing your dataset to optimize the performance and accuracy of your knowledge base.

import boto3
import pprint
import os
import boto3
import json
import hashlib
import funcy
import glob
from typing import Dict, Any, TypedDict, List
from langchain.llms.bedrock import Bedrock
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain_core.documents import Document
from langchain_aws import ChatBedrock
from langchain_community.embeddings import BedrockEmbeddings # to create embeddings for the documents.
from langchain_experimental.text_splitter import SemanticChunker # to split documents into smaller chunks.
from langchain_text_splitters import CharacterTextSplitter
from langchain_postgres import PGVector
from pydantic import BaseModel, Field
from langchain_community.document_loaders import (
WebBaseLoader,
TextLoader,
PyPDFLoader,
CSVLoader,
Docx2txtLoader,
UnstructuredEPubLoader,
UnstructuredMarkdownLoader,
UnstructuredXMLLoader,
UnstructuredRSTLoader,
UnstructuredExcelLoader,
DataFrameLoader,
)
import psycopg
import uuid

I’m running a local Supabase postgresql database running using their docker-compose setup. In a production setup, I’d recommend using a real database, like AWS AuroraDB or Supabase running someplace besides your laptop. Also, change your password to something besides password.

I didn’t notice any difference in performance for smaller datasets between an AWS-hosted knowledgebase and my laptop, but your mileage may vary.

connection = f"postgresql+psycopg://{user}:{password}@{host}:{port}/{database}" 
# Establish the connection to the database
conn = psycopg.connect(
conninfo = f"postgresql://{user}:{password}@{host}:{port}/{database}"
)
# Create a cursor to run queries
cur = conn.cursor()

We’re using AWS Bedrock as our AI Knowledgebase. Most of the companies I work with have some kind of proprietary data, and Bedrock has a guarantee that your data will remain private. You could use any of the AI backends here.

os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
bedrock_client = boto3.client("bedrock-runtime")
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)
bedrock_embeddings_image = BedrockEmbeddings(model_id="amazon.titan-embed-image-v1",client=bedrock_client)
llm = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_client)
# function to create vector store
# make sure to update this if you change collections!
def create_vectorstore(embeddings,collection_name,conn):
vectorstore = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=conn,
use_jsonb=True,
)
return vectorstore
def load_and_split_pdf_semantic(file_path, embeddings):
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
return pages
def load_xml(file_path, embeddings):
loader = UnstructuredXMLLoader(
file_path,
)
docs = loader.load_and_split()
return docs
def insert_embeddings(files, bedrock_embeddings, vectorstore):
logging.info(f"Inserting {len(files)}")
x = 1
y = len(files)
for file_path in files:
logging.info(f"Splitting {file_path} {x}/{y}")
docs = []
if '.pdf' in file_path:
try:
with funcy.print_durations('process pdf'):
docs = load_and_split_pdf_semantic(file_path, bedrock_embeddings)
except Exception as e:
logging.warning(f"Error loading docs")
if '.xml' in file_path:
try:
with funcy.print_durations('process xml'):
docs = load_xml(file_path, bedrock_embeddings)
except Exception as e:
logging.warning(e)
logging.warning(f"Error loading docs")
filtered_docs = []
for d in docs:
if len(d.page_content):
filtered_docs.append(d)
# Add documents to the vectorstore
ids = []
for d in filtered_docs:
ids.append(
hashlib.sha256(d.page_content.encode()).hexdigest()
)

if len(filtered_docs):
texts = [ i.page_content for i in filtered_docs]
# metadata is a dictionary. You can add to it!
metadatas = [ i.metadata for i in filtered_docs]
#logging.info(f"Adding N: {len(filtered_docs)}")
try:
with funcy.print_durations('load psql'):
vectorstore.add_texts(texts=texts, metadatas = metadatas, ids=ids)
except Exception as e:
logging.warning(e)
logging.warning(f"Error {x - 1}/{y}")
#logging.info(f"Complete {x}/{y}")
x = x + 1

collection_name_text = "MY_COLLECTION" #pubmed, smiles, etc
vectorstore = create_vectorstore(bedrock_embeddings,collection_name_text,connection)

Most of our data was fetched using the pubget tool, and the articles are in XML format. We’ll use the LangChain XML Loader to process, split and load the embeddings.

files = glob.glob("/home/jovyan/data/pubget_ra/pubget_data/*/articles/*/*/article.xml")
#I ran this previously
insert_embeddings(files[0:2], bedrock_embeddings, vectorstore)

PDFs are easier to read, and I grabbed some for doing QA against the knowledge base.

files = glob.glob("/home/jovyan/data/pubget_ra/papers/*pdf")
insert_embeddings(files[0:2], bedrock_embeddings, vectorstor

Now that we have our knowledge base setup we can use Retrieval Augmented Generation, RAG methods, to use the LLMs to run queries.

Our queries are:

  • Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles
  • Tell me about single-cell research in rheumatoid arthritis.
  • Tell me about protein-protein associations in rheumatoid arthritis.
  • Tell me about the findings of GWAS studies in rheumatoid arthritis.
import hashlib
import logging
import os
from typing import Optional, List, Dict, Any
import glob
import boto3
from toolz.itertoolz import partition_all
import json
import funcy
import psycopg
from IPython.display import Markdown, display
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import PromptTemplate
from langchain.retrievers.bedrock import (
AmazonKnowledgeBasesRetriever,
RetrievalConfig,
VectorSearchConfig,
)
from aws_bedrock_utilities.models.base import BedrockBase, RAGResults
from aws_bedrock_utilities.models.pgvector_knowledgebase import BedrockPGWrapper
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from pprint import pprint
import time
import logging
from rich.logging import RichHandler

I don’t list it here, but I’ll always do some QA against my knowlegebase. Choose an article, parse out the summary or findings, and ask the LLM about it. You should get your article back.

You’ll need to first have the collection name you’re querying along with your queries.

I always recommend running a few QA queries. Ask the obvious questions in several different ways.

You’ll also want to adjust the MAX_DOCS_RETURNED based on your time constraints and how many articles are in your knowledgebase. The LLM will search until it hits that maximum, and then stops. You’ll need to increase that number for an exhaustive search.

# Make sure to keep the collection name consistent!
COLLECTION_NAME = "MY_COLLECTION"
MAX_DOCS_RETURNED = 50
p = BedrockPGWrapper(collection_name=COLLECTION_NAME)                              credentials.py:1147
#model = "anthropic.claude-3-sonnet-20240229-v1:0"
model = "anthropic.claude-3-haiku-20240307-v1:0"
model = "anthropic.claude-3-haiku-20240307-v1:0"
queries = [
"Tell me about T cell–derived cytokines in relation to rheumatoid arthritis and provide citations and article titles",
"Tell me about single-cell research in rheumatoid arthritis.",
"Tell me about protein-protein associations in rheumatoid arthritis.",
"Tell me about the findings of GWAS studies in rheumatoid arthritis.",

]
ai_responses = []
for query in queries:
answer = p.run_kb_chat(query=query, collection_name= COLLECTION_NAME, model_id=model, search_kwargs={'k': MAX_DOCS_RETURNED, 'fetch_k': 50000 })
ai_responses.append(answer)
time.sleep(1)

for answer in ai_responses:
t = Markdown(f"""
### Query
{answer['query']}### Response
{answer['result']}

""")
display(t)

We’ve built our knowledge base, run some queries, and now we’re ready to look at the results the LLM generated for us.

Each result is a dictionary with the original query, the response, and the relevant snippets of the source document.



Source link

Leave a comment

0.0/5