Skip to content Skip to footer

Topic Modeling Open-Source Research with the OpenAlex API | by Alex Davis | Jul, 2024


While we ingest the data from the API, we will apply some criteria. First, we will only ingest documents where the year is between 2016 and 2022. We want fairly recent language as terms and taxonomy of certain subjects can change over long periods of time.

We will also add key terms and conduct multiple searches. While normally we would likely ingest random subject areas, we will use key terms to narrow our search. This way, we will have an idea of how may high-level topics we have, and can compare that to the output of the model. Below, we create a function where we can add key terms and conduct searches through the API.

import pandas as pd
import requests
def import_data(pages, start_year, end_year, search_terms):

"""
This function is used to use the OpenAlex API, conduct a search on works, a return a dataframe with associated works.

Inputs:
- pages: int, number of pages to loop through
- search_terms: str, keywords to search for (must be formatted according to OpenAlex standards)
- start_year and end_year: int, years to set as a range for filtering works
"""

#create an empty dataframe
search_results = pd.DataFrame()

for page in range(1, pages):

#use paramters to conduct request and format to a dataframe
response = requests.get(f'https://api.openalex.org/works?page={page}&per-page=200&filter=publication_year:{start_year}-{end_year},type:article&search={search_terms}')
data = pd.DataFrame(response.json()['results'])

#append to empty dataframe
search_results = pd.concat([search_results, data])

#subset to relevant features
search_results = search_results[["id", "title", "display_name", "publication_year", "publication_date",
"type", "countries_distinct_count","institutions_distinct_count",
"has_fulltext", "cited_by_count", "keywords", "referenced_works_count", "abstract_inverted_index"]]

return(search_results)

We conduct 5 different searches, each being a different technology area. These technology areas are inspired by the DoD “Critical Technology Areas”. See more here:

Here is an example of a search using the required OpenAlex syntax:

#search for Trusted AI and Autonomy
ai_search = import_data(35, 2016, 2024, "'artificial intelligence' OR 'deep learn' OR 'neural net' OR 'autonomous' OR drone")

After compiling our searches and dropping duplicate documents, we must clean the data to prepare it for our topic model. There are 2 main issues with our current output.

  1. The abstracts are returned as an inverted index (due to legal reasons). However, we can use these to return the original text.
  2. Once we obtain the original text, it will be raw and unprocessed, creating noise and hurting our model. We will conduct traditional NLP preprocessing to get it ready for the model.

Below is a function to return original text from an inverted index.

def undo_inverted_index(inverted_index):

"""
The purpose of the function is to 'undo' and inverted index. It inputs an inverted index and
returns the original string.
"""

#create empty lists to store uninverted index
word_index = []
words_unindexed = []

#loop through index and return key-value pairs
for k,v in inverted_index.items():
for index in v: word_index.append([k,index])

#sort by the index
word_index = sorted(word_index, key = lambda x : x[1])

#join only the values and flatten
for pair in word_index:
words_unindexed.append(pair[0])
words_unindexed = ' '.join(words_unindexed)

return(words_unindexed)

Now that we have the raw text, we can conduct our traditional preprocessing steps, such as standardization, removing stop words, lemmatization, etc. Below are functions that can be mapped to a list or series of documents.

def preprocess(text):

"""
This function takes in a string, coverts it to lowercase, cleans
it (remove special character and numbers), and tokenizes it.
"""

#convert to lowercase
text = text.lower()

#remove special character and digits
text = re.sub(r'\d+', '', text)
text = re.sub(r'[^\w\s]', '', text)

#tokenize
tokens = nltk.word_tokenize(text)

return(tokens)

def remove_stopwords(tokens):

"""
This function takes in a list of tokens (from the 'preprocess' function) and
removes a list of stopwords. Custom stopwords can be added to the 'custom_stopwords' list.
"""

#set default and custom stopwords
stop_words = nltk.corpus.stopwords.words('english')
custom_stopwords = []
stop_words.extend(custom_stopwords)

#filter out stopwords
filtered_tokens = [word for word in tokens if word not in stop_words]

return(filtered_tokens)

def lemmatize(tokens):

"""
This function conducts lemmatization on a list of tokens (from the 'remove_stopwords' function).
This shortens each word down to its root form to improve modeling results.
"""

#initalize lemmatizer and lemmatize
lemmatizer = nltk.WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

return(lemmatized_tokens)

def clean_text(text):

"""
This function uses the previously defined functions to take a string and\
run it through the entire data preprocessing process.
"""

#clean, tokenize, and lemmatize a string
tokens = preprocess(text)
filtered_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize(filtered_tokens)
clean_text = ' '.join(lemmatized_tokens)

return(clean_text)

Now that we have a preprocessed series of documents, we can create our first topic model!



Source link

Leave a comment

0.0/5