A very interesting business application of text classification is sentiment analysis. It is a method to automatically understand the perception of customers towards a product or service based on their comments. The input text is classified into positive, negative, and in some situations, neutral. It is extensively used by companies to track user behavior on social media. Sentiment analysis can strongly influence the marketing strategy of a company, improving the customer experience and defining the advertising roadmap[1].
The evaluation of drug aspects (i.e. side effects, dosage, efficacy etc) heavily relies on randomized controlled trials with rigorous inclusion and exclusion criteria. However, such processes are limited to a small number of individuals enrolled in the study and are constrained to participants in the target population who meet possibly restrictive eligibility criteria, limiting the population representativeness and subsequent study generalizability.
The ramifications of these acclimations could potentially have resulted in the overestimation of the efficacy of the product and misidentification of adverse events/ side effects in the diverse population. Not to mention the heavy costs and time involved. To counter such issues, approaches such as post-marketing drug surveillance have been introduced to optimize the safety of the drug after its regulatory approval and mass production e.g government regulators such as the FDA (US based) or MHRA (UK based) or public/ private organisations to monitor side effects from drugs. Existing methods for identifying adverse events typically focused on analyzing molecular drug composition, query logs, VAERS (Vaccine Adverse Event Reporting System) records, or clinical notes in the medical records. However, sentiment or user reviews from consumers were never taken into account.
Nevertheless, publicly available information on the Internet offers an easily attainable resource that could be leveraged to gain a deep understanding of the drug reviews by the users. Entire user reviews are fully available on drug review websites, on which users can comment on their personal experiences of the drugs they have taken for a specific condition.
Unlike many other forms of medical data, this information is not filtered through medical professionals. Since these reviews are given by anonymous users, there is no risk of patient health record violation for confidentiality. These reviews contain a plethora of information regarding individual experiences associated with the drugs such as symptoms, adverse events, and interactions with other drugs. Such reviews have also contained an extensive amount of user sentiment related to a particular condition, which could be leveraged to detect the side effects and efficacy of drugs[2].
The insights gained through public review (drug user reviews) analysis can influence strategy for better performance. Automatic analysis of patient posts on forums or social media have received attention in the last few years as a direct source that can help in understanding patients, enhancing the quality of care and increase patient satisfaction. Previously, we had to rely on governing bodies/ trials for feedback on drugs, as stated above[3].
The application of the proposed sentiment analysis approach will be useful not only for patients, but also for drug makers and clinicians to obtain valuable summaries of public opinion. Since sentiment analysis is domain specific, domain knowledge in drug reviews can be incorporated into the sentiment analysis algorithm to provide more accurate analysis. In particular and for example, MetaMap is used to map various health and medical terms (such as disease and drug names) to semantic types in the Unified Medical Language System (UMLS) Semantic Network[4].
Sentiment analysis is the process of measuring automatically the type of opinion, i.e. positive, negative or neutral, expressed in text.
Thus, our objective is to:
Review sentiment based on these drug user reviews text/ dataset, using a supervised binary text classifier, which will classify user reviews as positive or negative. Ultimately this can help us predict the sentiments concerning overall satisfaction of these drugs. Which in turn, can provide valuable insights and help with decision making.
Here, in Part 2, we'll utilize two large language models, BERT and XLNet.
The results (metrics obtained from our classification approach) can determine which text classifier/ model works well, or more specifically, prove the most accurate with our chosen data set.
Even though LLM's don't require pre-processing (refer to summary and conclusion section below), we're still going to clean up our text for the data exploration and analytics steps and for demonstration purposes.
The dataset was retrieved from Kaggle [5]. A platform which allows users to source their data sets and explore/ build upon them (license or rules permitting). The Drug Review Dataset is taken from the UCI Machine Learning Repository. This Dataset provides over 200,000 patient drug reviews on specific drugs along with related conditions and a 10-star patient rating reflecting the overall patient satisfaction. The actual data was scraped from online pharmaceutical review sites.
The license is other.
The data is split into a train (70%) a test (30%) partition (train set consists of 161297 samples, test set consists of 53766 samples of drug reviews) and stored in two .csv (comma-separated-values) files, respectively.
The structure of the data is that a patient with a unique ID purchases a drug that meets their condition and writes a review and rating for the drug he/she purchased on the date. Afterwards, if the others read that review and find it helpful, they will click usefulCount, which will add 1 for the variable.
The size of the training file is 83MB and the associated test set is 27.6MB, hence a total size of 110.6MB. There are 14 columns, 6 of which are string type, 4 are integer type, 2 are DateTime format and 2 are 'other'.
The dataset was originally published on the UCI Machine Learning repository by Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder, in 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125.
The file format is .csv, which is a delimited text file, that uses commas to separate values.
To summarize, this dataset has a good usuability score of 8.8, where it's easy to intepret and includes all the relevant metadata. Hence, after some research, it was evident this is a good quality data set for the topic of text classification in the healthcare arena, more specifically sentiment analysis on drug product reviews.
Techniques, insights, findings, rational and caveats behind the code are presented with Python comments, docstrings and individual summaries below:
Given that the review rating is in the range of 1-10, 2 classes were utilised (positive and negative), i.e. this is a supervised binary classification problem. We consider the review to be positive if the rating is higher than 5 (this is a somewhat arbitrary choice, but seems reasonable).
In this case, metrics will be 1) overall accuracy 2) F1 per class 3) F1 macro average. We are using accuracy (comparing to most frequent class baseline) and F-score for each class. That covers most probable failure modes [6].
How do we evaluate the classifier's performance? One way of evaluating the performance of an algorithm is to measure the accuracy. That is the percentage of correctly classified examples from the total number of samples. And we can do that by:
F1 per class (F1-score for a binary classifier) We would like to summarize the models’ performance into a single metric. That’s where F1-score are used. It’s a way to combine precision and recall into a single number. F1-score is computed using a mean (“average”), but not the usual arithmetic mean. It uses the harmonic mean, which is given by this simple formula:
F1-score = 2 × (precision × recall)/(precision + recall)
Similar to arithmetic mean, the F1-score will always be somewhere in between precision and recall. But it behaves differently: the F1-score gives a larger weight to lower numbers. For example, when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%. Or for example, say that Classifier A has precision = recall = 80%, and Classifier B has precision = 60%, recall = 100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.
As F1 score gives equal weighting to precision and recall, it is perfect for our binary classifer.
F1 macro average
The next step is combining the per-class F1-scores into a single number, the classifier’s overall F1-score. There are a few ways of doing that. Let’s begin with the simplest one: an arithmetic mean of the per-class F1-scores. This is called the macro-averaged F1-score, or the macro-F1 for short, and is computed as a simple arithmetic mean of our per-class F1-scores.
Macro F1-score will give the same importance to each label/class. It will be low for models that only perform well on the common classes while performing poorly on the rare classes.
The Macro F1-score is defined as the mean of class-wise/label-wise F1-scores:
$$Macro F1-score = \frac{1}{N} \sum \limits _{i=0} ^{N} F1-score_{i}$$The metrics are defined by the software packages, sklearn. This is the most popular machine learning package, and it provides the sklearn.metrics.f1_score function, which computes Macro-F1 .
Finally, the F1-score is possibly the most common metric used on imbalanced classification problems, such as our dataset [7].
#!pip install -r requirements.txt
# importing libraries
import pandas as pd, numpy as np
import statistics
# to ignore any warning messages.
import warnings
warnings.filterwarnings('ignore')
# for visualization
import matplotlib
import matplotlib.pyplot as plt, seaborn as sb
import plotly.express as px
import wordcloud
# for unit testing
#from unittest.mock import patch, Mock
#import unittest
#for machine learning
import scipy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#for text preprocessing
import re
import nltk
import nltk.stem as Stemmer
import string
from nltk.corpus import stopwords
nltk.download('stopwords')
#Stop words present in the nltk library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
from nltk.tokenize import word_tokenize
#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
from nltk.stem import porter
from nltk.stem import WordNetLemmatizer
Stemmer=porter.PorterStemmer()
import matplotlib.pyplot as plt
pd.options.display.max_colwidth = 200
%matplotlib inline
import os
#for language detection
#import langdetect
#for sentiment. only used when no training data
#from textblob import TextBlob
#for progress visualization
from tqdm import tqdm
tqdm.pandas()
#for vectorizer
from sklearn import feature_extraction, manifold
#for word embedding
import gensim.downloader as gensim_api
#for topic modeling
import gensim
#dealing with date feature
from datetime import datetime
# for error handling
from traceback import format_exc
#import datasets via dropbox storage
!wget https://www.dropbox.com/s/wd35fbl314kcv6f/drugsComTest_raw.csv?dl=1
!wget https://www.dropbox.com/s/a7n2c8wtm9hu72w/drugsComTrain_raw.csv?dl=1
#first, check we're using the right file type/encoder i.e. it should be UTF-8
#this ensures, it's relatively clean, and can be read and organised (as opposed to some other unusable formats)
import pandas as pd
import csv
train = open('drugsComTrain_raw.csv?dl=1', 'r')
test = open('drugsComTest_raw.csv?dl=1', 'r')
train
#we have the correct file encoder, as stated in the above output
#now, import the datasets and read the training set
#import csv
#train=pd.read_csv('drugsComTrain_raw.csv')#no sentiment lexicons required (manually curated wordlists),
#as we have plenty of training data here - scraped from drug user review websites
#test=pd.read_csv('drugsComTest_raw.csv')
import csv
train=pd.read_csv('drugsComTrain_raw.csv?dl=1')
test=pd.read_csv('drugsComTest_raw.csv?dl=1')
train.head().append(train.tail())
#as both datasets contain the same columns, we can then combine them for efficient preprocessing and
#better analysis
data = pd.concat([train, test])
data.head()
#Let's check the number of rows and columns
data.shape
#Let's check for missing values (Nan)
#From above output, we have 7 columns
data.isnull().sum()
#lets create a data type function for displaying null values and data types [7]
def finding_null_value(data):
total_null = data.isnull().sum()
total_percent = (data.isnull().sum()/data.isnull().count()*100)
new_var = pd.concat([total_null, total_percent], axis=1, keys=['Total_null', 'Total_percent(%)'])
types_array = []
for column in data.columns:
dtype = str(data[column].dtype)
types_array.append(dtype)
new_var['Types'] = types_array
return(np.transpose(new_var))
#The data type function displays:
# 1. Total null values
# 2. Total percentage
# 3. Display types of every feature (don't need to use Dtypes command, however, we'll still demonstrate below)
finding_null_value(data)
# calculating the number of rows dropped due to null conditions
#'condition' is a critical feature and a string, hence we cannot replace with a mean or frequency
# we must then drop these rows where 'condition' is a null/NaN value
data.dropna(subset=["condition"], axis=0, inplace=True)
# reset index, because we droped 1194 rows
# resetting the index to avoid errors, if accessing rows by their indexes
data.reset_index(drop=True, inplace=True)
data
#let's check data types
#This gives us an extra step to check that there's no mistakes or no unexpected values in our data.
#This extra step is very useful for plotting as well, because if we plot an ordinal variable, Pandas and
#Matplotlib will obey this natural ordering of the values, whereas if we didn't do that, the visualization would
#tend to be sorted alphabetically, which can make things very confusing. This early stage of pre-processing
#definitely gives us some benefits when it comes to visualization[7].
data.dtypes
#let's first modify drugName, condition and review to category
#Set the nominal (non-ordered categorical) data types from object to category type
data['drugName']=data['drugName'].astype('category')
data['condition']=data['condition'].astype('category')
data['review']=data['review'].astype('category')
data.dtypes
#then modify date object to string
data['date']=data['date'].astype('string')
#We can see below, the specified objects that were once object, have now been transformed into the correct category
data.info()
#quick and dirty summary statistics
data.describe(include='all')
#perform deeper EDA
#preparing a separate DataFrame for analysis
data_explorer= data
# dropping some columns
data_explorer.drop(columns = ['usefulCount'], inplace = False)
# sorting rows in descending order of rating
data_explorer = data_explorer.sort_values('rating', ascending = False)
data_explorer.head()
#ensure all unique ID's reflect actual number of rows from above dataframe (not including missing values)
#this tells us we don't have duplicate patients (unique ID's)
#213,869 is the correct output
#check uniqueID
data_explorer['uniqueID'].nunique()
#plot a bargraph to check top 10 drugnames
#from bargraph below, we can see that:
#Levonorgestrel - Around 4,800 patients are taking this medication, the most popular drug here
#Over 3000 patients are taking the top 3 drugs
#The top 3 drugName has count around 4000 and above.
#Most of the drugName counts are around 1500 if we look at top 10
plt.figure(figsize=(12,6))
drug_top = data_explorer['drugName'].value_counts(ascending = False).head(10)
plt.bar(drug_top.index,drug_top.values,color='blue')
plt.title('Drug Names Top 10',fontsize = 20)
plt.xticks(rotation=90)
plt.ylabel('count')
plt.show()
#plot a bargraph to check top 10 conditions
#from bargraph below, we can see that:
#Birth control - Between 35,000 to 40,000 patients have birth control conditions, the most popular condition here
#Approximately 3-7 times more popular than any of the other conditions
#Most of the conditions for the top 10 are between 5000 - 10,000
plt.figure(figsize=(12,6))
cond_top = data_explorer['condition'].value_counts(ascending = False).head(10)
plt.bar(cond_top.index,cond_top.values,color='red')
plt.title('Conditions Top 10',fontsize = 20)
plt.xticks(rotation=90)
plt.ylabel('count')
plt.show()
#check counts for ratings
#from the output below, rating '10'/ top rating produces the majority of the counts
ratings_ = data_explorer['rating'].value_counts().sort_values(ascending=False).reset_index().\
rename(columns = {'index' :'rating', 'rating' : 'counts'})
ratings_['percent'] = 100 * (ratings_['counts']/data_explorer.shape[0])
print(ratings_)
#as a percentage, this top ten rating produces just over 30%, or approximately a third of the counts
sb.set(font_scale = 1.2, style = 'darkgrid')
plt.rcParams['figure.figsize'] = [12, 6]
#let's plot and check
sb.barplot(x = ratings_['rating'], y = ratings_['percent'],order = ratings_['rating'], palette='winter')
plt.title('Ratings Percent',fontsize=20)
plt.show()
#lets check the number of drugs per condition
#we can see from the output below, there's 219 drugs linked to treating pain
#however, there are 253 drugs (highest amount) linked to 'not listed/ other' conditions.
#it may be possible that specific users didn't mention their condition, for privacy reasons. We can look up the
#drug names and fill up the conditions for which that drug is used.
#there's possibly noise present in our dataset, possibly due to webscraping where the values are wrongly fed in here.
data_explorer.groupby('condition')['drugName'].nunique().sort_values(ascending=False).head(10)
#the barplot below shows the top 10 drugs with the '10/10' rating which basically shows us which drugs have received
#majorly positive ratings and reviews
sb.set(font_scale = 1.2, style = 'whitegrid')
plt.rcParams['figure.figsize'] = [15, 8]
rating = dict(data.loc[data.rating == 10, "drugName"].value_counts())
drugname = list(rating.keys())
drug_rating = list(rating.values())
sns_rating = sb.barplot(x = drugname[0:10], y = drug_rating[0:10], palette = 'winter')
sns_rating.set(title = 'Top 10 Drugs with 10/10 Rating - Most Positive Ratings', ylabel = 'Number of Ratings', xlabel = "Drug Names")
plt.setp(sns_rating.get_xticklabels(), rotation=90);
#the barplot below shows the top 10 drugs with the '1/10' rating which basically shows us which drugs have received
#majorly negative ratings and reviews
#analysing the top 10 drugs with both 1/10 and 10/10 ratings:
#3 drugs appear (Levonorgestral, Etongestrel and Ethyl estradiol/ Norethindrone) in both the bottom 10 and top 10
#ratings. Hence, there are mixed reviews on these 3 particular drugs. Which infers not all drug reviews are
#100% positive or negative
sb.set(font_scale = 1.2, style = 'whitegrid')
plt.rcParams['figure.figsize'] = [15, 8]
rating = dict(data.loc[data.rating == 1, "drugName"].value_counts())
drugname = list(rating.keys())
drug_rating = list(rating.values())
sns_rating = sb.barplot(x = drugname[0:10], y = drug_rating[0:10], palette = 'winter')
sns_rating.set(title = 'Top 10 Drugs with 1/10 Rating - Most Negative Ratings', ylabel = 'Number of Ratings', xlabel = "Drug Names")
plt.setp(sns_rating.get_xticklabels(), rotation=90);
Preprocessing Steps [8]:
This is Step 1 in our feature extraction process
The kind of data we receive from customer feedback or user reviews is usually unstructured. It contains noise; unusual text and symbols that require to be cleaned, so that a machine learning model can process it. Data cleaning and pre-processing are as important as building any sophisticated machine learning model. The reliability of our model is highly dependent upon the quality of our data. [9].
As stated above, the description is in textual format which is not suitable for machine learning, in it's current format (in other words, we need to extract numeric features so that we can use them to train and test classifiers). We can also say, machine learning algorithms learn from a pre-defined set of features from the training data, to produce an output for the test data.
As part of this task, we are expected to transform the textual description (review column) into a numeric feature matrix. One simple technique to do that is different embeddings, such as Term Frequency Inverse Document Frequency (TFIDF) and the Count Vectors (CV). Or apply, in Python using the sci-kit learn package. Before TF-IDF can be applied, we will need to clean up and prepare the textual data correctly. Here is a list of steps that we can apply to the review column before applying TF-IDF:
#increases the quality of generated features
#includes synsets, a collection of synonymous words
import nltk
nltk.download('wordnet')
# function to preprocess a given text after applying several preprocessing functions
import re
wordnet_lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
# lower case the text.It is one of the most common preprocessing steps where the text is converted into the same case preferably lower case
text = text.lower()
# Replacing the repeating pattern of '
text = text.replace("'", "")
# Removing all punctuation symboks from the text
cleaned_text = "".join([i for i in text if i not in string.punctuation])
# Removing all URLs from the text
cleaned_text = re.sub(r'http\S+', '', cleaned_text)
# Splitting the text into separate tokens. Each token is a word of the text.
tokens = cleaned_text.split(" ")
# All tokens that are also a stopword are filtered out as they are not really necessary for model building.
filtered_tokens = [i for i in tokens if i not in stopwords]
# Applying lemmatization to the tokens. It stems the word but makes sure that it does not lose its meaning.
lemmatized_tokens = [wordnet_lemmatizer.lemmatize(word) for word in filtered_tokens]
# Removing all symbols that are not alphanumeric. We will consider only alphanumeric symbols.
sc_removed = [re.sub("[^a-zA-Z0-9]", ' ', token) for token in lemmatized_tokens]
# Returning the preprocessed text
return " ".join(sc_removed)
#preprocess_text function is optimized to run more efficiently, to produce 'clean_review' feature below
data['clean_review'] = data['review'].progress_apply(lambda x: preprocess_text(x))
data.head()
#lets look at the feature 'rating', to see if the majority of the customer ratings are positive or negative
#as a quick overview in the output below, the majority of the ratings are a '10' (highest rating)
#color=sb.color_palette()
#%matplotlib inline
#import plotly.offline as py
#py.init_notebook_mode(connected=True)
#import plotly.graph_objs as go
#import plotly.tools as tls
#import plotly.express as px
# Product Scores
#fig=px.histogram(data,x="rating")
#fig.update_traces(marker_color="turquoise", marker_line_color = 'rgb(8,48,107)', marker_line_width=1.5)
#fig.update_layout(title_text='Product Score')
#fig.show()
#create sentiment feature from ratings [17]
#if rating > 5 sentiment = 1 (positive)
#if rating < 5 sentiment = 0 (negative)
data['sentiment'] = data["rating"].apply(lambda x: 1 if x > 5 else 0)
data.head()
#we've now clasified ratings into positive and negative, 1 and 0 respectively
positive=data[data['sentiment']==1]
negative=data[data['sentiment']==0]
#next we will be using n-gram tokenization with n=2 to find out the most frequently occuring n-grams in the
#review texts of people with both positive and negative reviews
##we explore n-grams, rather than words, so that we can consider word collocations
def count_ngrams(dataframe,column,begin_ngram,end_ngram):
word_vectorizer = CountVectorizer(ngram_range=(begin_ngram,end_ngram), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(dataframe[column].dropna())
frequencies = sum(sparse_matrix).data
most_common = pd.DataFrame(frequencies,
index=word_vectorizer.get_feature_names_out(),
columns=['frequency']).sort_values('frequency',ascending=False)
most_common['ngram'] = most_common.index
most_common.reset_index()
return most_common
#to limit memory consumption, we'll first randomly sample 20,000 negative reviews and perform n-gram tokenization
#please note that this is a memory intensive task and might take a lot of time to run.
#we can fine tune (increase n size) the n-gram tokenizer to optimize the accuracy of our models. However,
#this is computatationally expensive. Hence, we shall use bigrams (as opposed to trigrams) in this instance.
#we perform this, as two words together could represent more meaning, as oppposd to words on their own
#These bigrams could improve the prediction of positive or negative sentiment, over single word format
sample_df = negative.sample(20000)
two_grams = count_ngrams(sample_df,'clean_review', 2, 2)
fig = px.bar(two_grams.sort_values('frequency',ascending=False)[0:10].iloc[::-1],
x="frequency",
y="ngram",
title='Most Common 2-gram words in negative reviews of people',
orientation='h')
fig.show()
#we will do the same for positive reviews
#no real inferences were made with the bigrams
#we possibly require to fine tune our n-gram tokenizer with bigger n-grams and/ or randomly sample more reviews
#to arrive at a conclusion. as this is computationaly expensive, we'll stop here.
sample_df = positive.sample(20000)
two_grams = count_ngrams(sample_df,'clean_review', 2, 2)
fig = px.bar(two_grams.sort_values('frequency',ascending=False)[0:10].iloc[::-1],
x="frequency",
y="ngram",
title='Most Common 2-gram words in postive reviews of people',
orientation='h')
fig.show()
# a pie chart to represent the distribution of sentiments of the reviews posted
#our dataset is imbalanced because just less than 30% of our reviews are considered as negative ones.
#this information will be very useful for the modelling part
size = [len(positive), len(negative)]
colors = ['lightblue', 'lightgreen']
labels = "Positive Sentiment","Negative Sentiment"
explode = [0, 0.1]
plt.rcParams['figure.figsize'] = (10, 10)
plt.pie(size, colors = colors, labels = labels, explode = explode, autopct = '%.2f%%')
plt.axis('off')
plt.title('Pie Chart Representation of Sentiments', fontsize = 25)
plt.legend()
plt.show()
# add number of characters column, for more exploration
data["nb_chars"] = data["clean_review"].apply(lambda x: len(x))
# add number of words column, for more exploration
data["nb_words"] = data["clean_review"].apply(lambda x: len(x.split(" ")))
data
# creating the word cloud for negative reviews
from wordcloud import WordCloud
# Creating the Word Cloud
final_wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
stopwords = stopwords,
min_font_size = 10).generate(' '.join(data[data['sentiment']==0]['clean_review']))
# Displaying the WordCloud
plt.figure(figsize = (10, 13), facecolor = None)
plt.title('Word Cloud for Negative Reviews')
plt.imshow(final_wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# Creating the Word Cloud for positive reviews
final_wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
stopwords = stopwords,
min_font_size = 10).generate(' '.join(data[data['sentiment']==1]['clean_review']))
# Displaying the WordCloud
plt.figure(figsize = (10, 13), facecolor = None)
plt.title('Word Cloud for Positive Reviews')
plt.imshow(final_wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
In Part 1, we utilized traditional machine learning models (mostly ensemble methods) to address our NLP (sentiment analysis) task. Also, we utilzed basic techniques for our text classification approach i.e. TF-IDF Vectorization and Bag of Words.
Here, in Part 2, we introduce LLMs. Most teams and NLP practitioners will not be involved in the pre-training of LLMs, but rather in their fine-tuning and deployment. However, to successfully pick and use a model, it is important to understand what is going on “under the hood”. In this section, we will look at the basic ingredients of an LLM [10]:
Each of these will affect not only the choice, but also the fine-tuning and deployment of our LLM.
Training Data
The quality of the training data has a direct impact on model performance — and also on the required size of the model. If we are smart in preparing the training data, we can improve model quality while reducing its size. One example is the T0 model, which is 16 times smaller than GPT-3, but outperforms it on a range of benchmark tasks. Instead of just using any text as training data, it works directly with task formulations, thus making its learning signal much more focussed.
Note: We often hear that language models are trained in an unsupervised manner. While this makes them appealing, it is technically wrong. Instead, well-formed text already provides the necessary learning signals, sparing us the tedious process of manual data annotation. The labels to be predicted correspond to past and/or future words in a sentence. Thus, annotation happens automatically and at scale, making possible the relatively quick progress in the field.
Input Representation
Large Language Models (LLMs) work by taking an input text and repeatedly predicting the next token or word. The input representation in LLMs is typically based on embeddings, which are dense vector representations of words or tokens. These embeddings capture some of the semantics of the input by placing semantically similar inputs close together in the embedding space
Pre-training Objective
As a rule of thumb, the pre-training objective provides an important hint: autoregressive (AR) models perform well on text generation tasks such as conversational AI, question answering and text summarisation, while auto-encoders (AE) (such as BERT) excel at “understanding” and structuring language, for example for sentiment analysis (such as our task) and various information extraction tasks. Models intended for zero-shot learning can theoretically perform all kinds of tasks as long as they receive appropriate prompts — however, their accuracy is generally lower than that of fine-tuned models.
Model Architecture
The basic building blocks of a language model are the encoder and the decoder. The encoder transforms the original input into a high-dimensional algebraic representation, also called a “hidden” vector. The decoder reproduces the hidden representation in an intelligible form such as another language, programming code, an image etc.
Since the introduction of the attention-based Transformer model, traditional recurrence has lost its popularity while the encoder-decoder idea lives on. Most Natural Language Understanding (NLU) tasks rely on the encoder, while Natural Language Generation (NLG) tasks need the decoder and sequence-to-sequence transduction requires both components.
Fine Tuning
NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering and information extraction. This is the time to apply transfer learning and reuse the existing linguistic knowledge for more specific challenges. During fine-tuning, a portion of the model is “freezed” and the rest is further trained with domain- or task-specific data.
NeuThink is experimental deep learning library built on top of PyTorch. Neuthink is a research project aimed to explore how the concept of differential programming can be implemented in context of Python language. One of definitions of differential programming is that it "enables programmers to write program sketches with slots that can be filled with behaviour trained from program input-output data".
Neuthink aims to simplify construction and usage of simple and moderately complex deep learning models, automating routine operations to allow the developer to focus on the task itself.
Neuthink is not a widely used library, but it was used in a number of research projects (example and commercial projects with good results.
Installing libraries
In the context of NLP, NeuThink can be used to build models for tasks such as sentiment analysis, natural language generation, paraphrasing, and summarization. However, it’s important to note that the choice of library can depend on many factors including the specific task at hand, the available data, computational resources, and personal preference [11].
# This code fetches a copy of the "neuthink" repository from GitHub and storing it on the local system. This
# allows us to work with the code and files from that repository on our own machine
!git clone https://github.com/meanotekai/neuthink
# These libraries serve different purposes - pystemmer is used for text processing and analysis, while reprint
# is used for more controlled and organized output formatting.
!pip install pystemmer
!pip install reprint
# This line of code is providing access to the functionalities defined in the neuthink.metastruct module. This
# means that any classes, functions, or variables defined in that module can now be used in the current Python
# script by referring to them as Struct.
import neuthink.metastruct as Struct
We'll be employing a neural bag-of-words baseline approach for our research endeavor. To facilitate this, we require word embeddings data. Consequently, we will proceed to acquire word embeddings of dimension 100 through the download of Word2Vec embeddings. Additional information regarding Word2Vec embeddings can be accessed here [12].
# Load train and test set into neuthink data structures
data_nbow = Struct.LoadCSV('/content/drugsComTrain_raw.csv?dl=1').Shuffle()
test_nbow = Struct.LoadCSV('/content/drugsComTest_raw.csv?dl=1')
# This code snippet changes the current working directory to /content/neuthink/wordvectors/ and then downloads two
# files (vectors_en_100.txt and words_en_100.txt) from Dropbox. Finally, it renames these files to remove the
# ?dl=1 query parameter from their names, making them accessible for further processing within the specified
# directory.
%cd /content/neuthink/wordvectors/
!wget https://www.dropbox.com/s/7xuvry8y9k85fne/vectors_en_100.txt?dl=1
!wget https://www.dropbox.com/s/ancaj3976it1rr6/words_en_100.txt?dl=1
!mv ./vectors_en_100.txt?dl=1 vectors_en_100.txt
!mv ./words_en_100.txt?dl=1 words_en_100.txt
# We already converted rating 1-10 into a 2 class binary classification problem, for the data pre-processing and
# data transformation sections above. However, for simplicity, we'll create a new field called "tag". We consider
# a review to be positive if rating is higher than 4 (this is somewhat arbitrary choice, but seems reasonable)
def convert_to_binary_aux(value):
if int(value['rating'])>4:
return 'positive'
else:
return 'negative'
def convert_to_binary(data):
for x in data:
x['tag'] = convert_to_binary_aux(x)
convert_to_binary(data_nbow)
convert_to_binary(test_nbow)
data_nbow[0]
data_nbow.Distinct('tag') #Check if tags are calculated properly
# The default model for text representation in Neuthink is the neural bag-of-words model, which streamlines the
#model formulation process. Internally, this model entails the creation of a Multilayer Perceptron (MLP) classifier
#comprising a single hidden layer with a size of 128, complemented by a softmax layer accommodating two classes
model = data_nbow.dMap(source='review',target='tag')
# This code is configuring and preparing the model for training by setting specific parameters for the compilation
#process. It's part of the process of training a DL/ LLM for text representation using the
#neural bag-of-words approach in Neuthink.
func = model.compile(noreprint=True, avg_steps=300,weight_decay=0, test_data= test_nbow)
# The initial classification approach employs a logistic regression classifier with n-grams features. Specifically,
# we'll utilize both unigram and bigram features to enhance the model's discriminative capacity
data = Struct.LoadCSV('/content/drugsComTrain_raw.csv?dl=1').Shuffle()
test = Struct.LoadCSV('/content/drugsComTest_raw.csv?dl=1')
import neuthink.metatext as Text
all_text = ' '.join(data[1:1000]['review'])
tokens = Text.Tokenize(all_text.lower())
bigrams = tokens.Text.Ngrams(size=2, clean=True) #clean=True means we skip punctuation when computing ngrams and size=2 specifies ngram size = 2 (bigrams)
bigrams = sorted([(bigrams[x], x) for x in bigrams],reverse=True)
unigrams = tokens.Text.Ngrams(size=1, clean=True)
unigrams = sorted([(unigrams[x], x) for x in unigrams],reverse=True)
bigrams[0:10] # a list of tuples of (bigram_count_in_the_corpus, bigram text)
unigrams[0:10]
# The amalgamation of these lists culminates in the creation of a unified dictionary. While there are potential
# optimization strategies that could be employed to enhance the efficiency of this dictionary, it is noteworthy
# that the current implementation performs adequately for its intended purpose
ngrams = [x[1] for x in bigrams[:1000]] + [x[1] for x in unigrams[:1000]]
# Let's check if everything is correct and sample ngrams from our dictionary
import random
[random.choice(ngrams) for i in range(0,10)]
convert_to_binary(data)
convert_to_binary(test)
# The function in question facilitates the transformation of text into a bag-of-n-grams feature vector. It
#generates a new NumPy vector, sized at 2000 (equivalent to the length of the n-grams dictionary), initialized
#with zero values. Subsequently, it assigns ones to positions within the vector corresponding to the presence of
#the specified n-gram in the given text
import numpy as np
import torch
def features_func_vec(nodes,i,source, mode=None):
text = nodes[i][source]
vec = np.zeros(len(ngrams))
for (i,x) in enumerate(ngrams):
if x in text:
vec[i]=1
return torch.tensor(vec).float()
# We define logistic regression linear classifier over our bag of features
data = data.VectorFeatures(funcname=features_func_vec, source='review',target='ngramfeatures').Classify(source='ngramfeatures', target='Classify1',class_target='tag')
# Model is trained with starting learning rate 0.01 (default), weight decay 0, using stochastic gradient descent
# with Adam algorithm and batch size of 50 (default). Accuracy on test set is calculated every 300 batches
f= data.compile(avg_steps=300, weight_decay=0, noreprint=True, test_data=test)
data.test_set.F1('tag','Classify1','positive')
data.test_set.F1('tag','Classify1','negative')
In 2018, BERT marked a new era of NLP. It was introduced as the first LLM on the basis of the new Transformer architecture. Since then, Transformer-based LLMs have gained strong momentum. Language modelling is especially attractive due to its universal usefulness. While many real-world NLP tasks such as sentiment analysis, information retrieval and information extraction do not need to generate language, the assumption is that a model that produces language also has the skills to solve a variety of more specialised linguistic challenges.
BERT is only an encoder, while the original transformer is composed of an encoder and decoder. Given that BERT uses an encoder that is very similar to the original encoder of the transformer, we can say that BERT is a transformer-based model [14].
#the transformers library provides several pretrained models and tools for natural language processing tasks. In the
#last line of code, the T5Tokenizer class and the T5ForConditionalGeneration class are imported,
#which are used to create and work with a text generation model using Google’s T5 architecture, one of the most
#powerful architectures for natural language processing currently available.
!pip install transformers
!pip install sentencepiece
data_bert = Struct.LoadCSV('/content/drugsComTrain_raw.csv?dl=1')#.Shuffle()
test_bert = Struct.LoadCSV('/content/drugsComTest_raw.csv?dl=1')
convert_to_binary(data_bert)
convert_to_binary(test_bert)
We will make a function to compute BERT features. For a deeper explanation of what BERT is, please see: Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018)[13].
We will import BERT from hugginface library and put it on a GPU (cuda:0) device, because computation is too slow otherwise. The function bert_vector computes feature vector by averaging outputs of layer 11 of BERT model (we use BERT-base, uncased version)
# Lets check the allocated GPU specs in Google Colab, we can use the !nvidia-smi command. This command will
# display information about the GPU, including the memory usage, temperature, and clock speed
# We're using the premium Nvidia A100 GPU - nice
!nvidia-smi
pip install torch torchvision -U
#Ensure we're utilising A100 Nvidia GPU (premium GPU) in Colab, from here onwards - Go to 'change runtime type' in Google Colab
#i.e. for all training and fine tuning.
#The LLMs are complex, hence the T4 GPU from Colab isn't powerful enough. Mac Silicon will also suffice.
#verification given above
from typing import List
import sys
import numpy as np
import torch #it's an open-source machine learning library widely used for creating and training deep learning models.
from transformers import BertTokenizer, BertModel
from transformers import logging
logging.set_verbosity_error()
# this part of the code creates BERT model
bertmodel = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
#labels = torch.tensor([1]).unsqueeze(0).to('cuda:0')
bertmodel.to('cuda:0')
# We will make a function to compute BERT features. For explanation of what BERT is, please see: Devlin, Jacob,
#et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint
#arXiv:1810.04805 (2018) (https://arxiv.org/abs/1810.04805). We will import BERT from hugginface library and put
#it on GPU (cuda:0) device, because computation is too slow otherwise. The function bert_vector computes feature
#vector by averaging outputs of layer 11 of BERT model (we use BERT-base, uncased version)
def bert_vector(text:str):
'''computes vector representation of text using BERT'''
with torch.no_grad():
text = text[0:1200] #we truncate the text to first 1200 characters. This is somewhat arbitrary choice. Increasing this to 1200 will improve accuracy slows training down a lot
inputs = tokenizer(text.lower(), return_tensors="pt").to('cuda:0')
outputs = bertmodel(**inputs, output_hidden_states=True)
q = outputs[2][11][0]
return q.mean(dim=0).cpu().detach().numpy()
# By importing torch, we gain access to a wide range of functions, classes, and tools provided by PyTorch, allowing
# us to build, train, and deploy machine learning and deep learning models
import torch
def features_func_bert(nodes, i:int, source:str, mode:str=None):
if len(nodes[i][source])==0:
nodes[i][source]="None"
tensor = torch.tensor(bert_vector(nodes[i][source])).float()
#tensor.to('cuda:0')
#tensor.requires_grad = False
return tensor
b = bert_vector("Hello") #this is just to test if bert_vector is working. We can see vector data below:
# Actual model definition:
#1. Generates BERT features from text (review field) by calling callback function, features_func.
#2. The model itself is multi-layer perceptron, with one hidden layer (128 neurons) following softmax layer
model = data_bert.VectorFeatures(source='review',target='bert_features', funcname=features_func_bert).dMap(source='bert_features', target='hidden1',size=128).dMap(source='hidden1',target='tag')
#This code is setting up and configuring a model for training and evaluation.
#The resulting func function holds important information/ configurations for later use when fine tuning BERT
func = model.compile(weight_decay=0, avg_steps=400, test_data=test_bert, noreprint=True)
model.test_set.F1('tag','dMap1','negative')
model.test_set.F1('tag','dMap1','positive')
from transformers import XLNetModel, XLNetTokenizer
from transformers import AdamW
import torch
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, TensorDataset, random_split
xl_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
xl_model = XLNetModel.from_pretrained("xlnet-base-cased", num_labels=2)
xl_model.to('cuda:0')
def xl_vector(text:str):
'''computes vector representation of text using BERT'''
with torch.no_grad():
text = text[0:1200] #we truncate the text to first 1200 characters. This is somewhat arbitrary choice. Increasing this to 1200 will improve accuracy slows training down a lot
inputs = xl_tokenizer(text.lower(), return_tensors="pt").to('cuda:0')
outputs = xl_model(**inputs, output_hidden_states=True)
q = outputs[2][11][0]
return q.mean(dim=0).cpu().detach().numpy()
def features_func_xl(nodes, i:int, source:str, mode:str=None):
if len(nodes[i][source])==0:
nodes[i][source]="None"
tensor = torch.tensor(xl_vector(nodes[i][source])).float()
#tensor.to('cuda:0')
#tensor.requires_grad = False
return tensor
x = xl_vector("Hello") #this is just to test if bert_vector is working. We can see vector data below:
data_xl = Struct.LoadCSV('/content/drugsComTrain_raw.csv?dl=1').Shuffle()
test_xl = Struct.LoadCSV('/content/drugsComTest_raw.csv?dl=1')
convert_to_binary(data_xl)
convert_to_binary(test_xl)
model_xlnet = data_xl.VectorFeatures(source='review',target='xl_features', funcname=features_func_xl).dMap(source='xl_features', target='hidden1',size=128).dMap(source='hidden1',target='tag')
#This code is setting up and configuring a model for training and evaluation.
#The resulting func function holds important information/ configurations for later use when fine tuning XLNet
func_xlnet = model_xlnet.compile(weight_decay=0, avg_steps=400, test_data=test_xl, noreprint=True)
model_xlnet.test_set.F1('tag','dMap1','negative')
model_xlnet.test_set.F1('tag','dMap1','positive')
conda install -c anaconda wget
!wget https://www.dropbox.com/s/wd35fbl314kcv6f/drugsComTest_raw.csv?dl=1
!wget https://www.dropbox.com/s/a7n2c8wtm9hu72w/drugsComTrain_raw.csv?dl=1
!pip install sentencepiece
!pip install transformers
#We're downsizing our DataFrame by a factor of 10, which is a two-dimensional tabular data structure commonly used
#in data analysis. This function is helpful in the context of executing our Large Language Models (LLMs),
#for optimizing our workflow
#LLMs, especially large ones like XLNet or BERT, can be memory-intensive. Reducing the size of the DataFrame can
#help manage memory usage, allowing us to work with larger models, hence faster computations
def reduce_dataframe_size(df, factor=10, random_seed=42):
"""
Reduce the size of a DataFrame by the specified factor.
Parameters:
- df (pd.DataFrame): The input DataFrame.
- factor (int): The factor by which to reduce the size. Default is 10.
- random_seed (int): Seed for random sampling. Default is 42.
Returns:
- pd.DataFrame: A reduced-size DataFrame.
"""
num_rows_to_keep = len(df) // factor
return df.sample(n=num_rows_to_keep, random_state=random_seed)
# This code reads data from two CSV files ('drugsComTrain_raw.csv' and 'drugsComTest_raw.csv') using pandas, and
# stores the data in DataFrames named train_df and test_df respectively. These DataFrames are now available for
#further data analysis and processing
import pandas as pd
import csv
train_df=pd.read_csv('drugsComTrain_raw.csv?dl=1')
test_df=pd.read_csv('drugsComTest_raw.csv?dl=1')
#train_df=pd.read_csv('/content/drugsComTrain_raw.csv?dl=1')
#test_df=pd.read_csv('/content/drugsComTest_raw.csv?dl=1')
train_df = reduce_dataframe_size(train_df)
test_df = reduce_dataframe_size(test_df)
train_df.head()
#The code plays a crucial role in preparing the data for training our LLM for binary classification. It transforms
#the raw ratings into labels (positive or negative) that can be used to train and evaluate the model's performance
def convert_to_binary_aux(value):
if int(value.rating) > 4:
return 'positive'
else:
return 'negative'
def convert_to_binary(data):
data['tag'] = data.apply(convert_to_binary_aux, axis=1)
convert_to_binary(train_df)
convert_to_binary(test_df)
# Prerequisite before importing libraries, IF importing in Mac Silicon
#import torch
#import math
# this ensures that the current MacOS version is at least 12.3+
#print(torch.backends.mps.is_available())
# this ensures that the current current PyTorch installation was built with MPS activated.
#print(torch.backends.mps.is_built())
#If both commands return True, then PyTorch has access to the GPU
# These libraries provide the necessary tools for working with BERT, training neural networks, handling data, and
#evaluating model performance. They are commonly used in natural language processing tasks and deep learning
#projects such as ours
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, TensorDataset, random_split
#this code is setting up a pre-trained BERT model for the specific task of classifying sequences of text. The
#tokenizer is getting it ready to understand and process text data, and the model is being loaded and prepared for
#binary classification. It's also ensuring that the computations are done on a GPU if one is available, which can
#greatly speed up the process.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2).to('cuda:0')
#('cuda:0') for Colab executions #('mps') for Mac Silicon executions
# Tokenizing reviews
# Truncate reviews to the first 1200 characters
# Large Language Models (LLMs), such as BERT or XLNet, have a maximum token limit. For example, BERT can handle
# sequences up to 512 tokens. By truncating the review to 1200 characters, we're ensuring that it fits within the
# model's capacity
texts = train_df['review'].str[:1200].tolist()
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to('cuda:0')
#torch.cuda.empty_cache()
# Converting 'positive' and 'negative' tags to binary labels
labels = torch.tensor((train_df['tag'] == 'positive').astype(int).values).to('cuda:0')
# Creating a DataLoader
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_dataloader = DataLoader(train_dataset, batch_size=25, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=25)
# This code block is preparing the tools needed to train a BERT model for a classification task. It sets up the
#loss function, optimizer, and learning rate scheduler. The loss function measures how well the model is doing,
#the optimizer adjusts the model's weights to improve performance, and the scheduler helps control the learning
#rate during training.
loss_fn = torch.nn.CrossEntropyLoss().to('cuda:0')
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader))
#This code is training a BERT model over three epochs. It iterates through batches of training data, computes the
#loss, and updates the model's parameters to improve its performance. It also keeps track of the time taken for
#each epoch
# As it's computationaly expensive, it took 824.82 seconds to execute (see output below)
num_epochs = 3
k=0
import time
for epoch in range(num_epochs):
model.train()
start_time = time.time() # Record the start time
for i, batch in enumerate(train_dataloader):
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]# This line of code takes the items in the batch (presumably input IDs, attention masks, and labels) and moves them to a specific device (likely a GPU) for faster computation.
optimizer.zero_grad()# This clears the gradients of the model's parameters. Gradients are used during backpropagation to update the model's weights.
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
print(i,len(train_dataloader))
print(k)
k=k+1
elapsed_time = time.time() - start_time # Calculate the elapsed time
print(f"Time elapsed: {elapsed_time:.2f} seconds")
#Evaluates a model's performance on a validation set by making predictions, computing the F1 score, and printing
#out the result
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in val_dataloader:
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]
logits = model(input_ids, attention_mask=attention_mask).logits
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
f1 = f1_score(all_labels, all_preds)
print(f"F1 Score on Validation Set: {f1}")
# Tokenizing reviews
# Truncate reviews to the first 1200 characters
# Prepares the test set by extracting and processing text snippets, converting them into a format suitable for the
#BERT model, and organizing them into batches for evaluation. It also handles the labels, converting them into a
#binary format (0 or 1) for evaluation.
test_texts = test_df['review'].str[:1200].tolist()
test_inputs = tokenizer(test_texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to('cuda:0')
# Converting 'positive' and 'negative' tags to binary labels
test_labels = torch.tensor((test_df['tag'] == 'positive').astype(int).values).to('mps')
# Creating a DataLoader
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'], test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=25)
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in test_dataloader:
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]
logits = model(input_ids, attention_mask=attention_mask).logits
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
f1 = f1_score(all_labels, all_preds)
print(f"F1 Score on Test Set: {f1}")
# These libraries and components provide the necessary tools for working with XLNet, training neural networks,
#handling data, and evaluating model performance. They are commonly used in natural language processing tasks and
#deep learning projects such as ours
from transformers import XLNetModel, XLNetTokenizer, XLNetForSequenceClassification
from transformers import AdamW
import torch
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, TensorDataset, random_split
#this code is setting up a powerful language model (XLNet) for the specific task of classifying sequences of text.
#The tokenizer is getting it ready to understand and process text data, and the model is being loaded and prepared
#for binary classification. It's also ensuring that the computations are done on a GPU if one is available, which
#can greatly speed up the process.
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2).to('cuda:0')
# Tokenizing reviews
# Truncate reviews to the first 1200 characters
# Large Language Models (LLMs), such as BERT or XLNet, have a maximum token limit. For example, BERT can handle
# sequences up to 512 tokens. By truncating the review to 1200 characters, we're ensuring that it fits within the
# model's capacity
texts = train_df['review'].str[:1200].tolist()
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to('cuda:0')
# Converting 'positive' and 'negative' tags to binary labels
labels = torch.tensor((train_df['tag'] == 'positive').astype(int).values).to('cuda:0')
# Creating a DataLoader
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_dataloader = DataLoader(train_dataset, batch_size=25, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=25)
# This code block is preparing the tools needed to train a large language model (XLNet) for a classification task.
# It sets up the loss function, optimizer, and learning rate scheduler. The loss function measures how well the
# model is doing, the optimizer adjusts the model's weights to improve performance, and the scheduler helps control
# the learning rate during training
loss_fn = torch.nn.CrossEntropyLoss().to('cuda:0')
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader))
# This code is training a large language model (XLNet) for a specific task over three epochs. It iterates through
# batches of training data, computes the loss, and updates the model's parameters to improve its performance. It
# also keeps track of the time taken for each epoch.
# As it's computationaly expensive, it took 2211.96 seconds to execute (see output below)
num_epochs = 3
k=0
import time
for epoch in range(num_epochs):# This line starts a loop that will run for the specified number of epochs (in this case, 3). Each iteration represents one pass through the entire dataset
model.train()
start_time = time.time() # Record the start time
for i, batch in enumerate(train_dataloader):# This starts a loop that iterates through the batches of data in the training dataloader. Each batch contains input data, attention masks, and labels
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
scheduler.step()
print(i,len(train_dataloader))
print(k)
k=k+1
elapsed_time = time.time() - start_time # Calculate the elapsed time
print(f"Time elapsed: {elapsed_time:.2f} seconds")
# Evaluates model's performance on a validation set by making predictions, computing the F1 score, and printing
# out the result
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in val_dataloader:
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]
logits = model(input_ids, attention_mask=attention_mask).logits
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
f1 = f1_score(all_labels, all_preds)
print(f"F1 Score on Validation Set: {f1}")
# Tokenizing reviews
# Truncate reviews to the first 1200 characters
test_texts = test_df['review'].str[:1200].tolist()
test_inputs = tokenizer(test_texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to('cuda:0')
# Converting 'positive' and 'negative' tags to binary labels
test_labels = torch.tensor((test_df['tag'] == 'positive').astype(int).values).to('cuda:0')
# Creating a DataLoader
test_dataset = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'], test_labels)
test_dataloader = DataLoader(test_dataset, batch_size=25)
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for batch in test_dataloader:
input_ids, attention_mask, labels = [item.to('cuda:0') for item in batch]
logits = model(input_ids, attention_mask=attention_mask).logits
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())
f1 = f1_score(all_labels, all_preds)
print(f"F1 Score on Test Set: {f1}")
Our Baseline Performance helped us decide on a modelling strategy for a LLM. We employed a neural bag-of-words baseline approach for our research endeavor. Acquiring word embeddings of dimension 100 through the download of Word2Vec embeddings. Then, for the classification approach, we employed a logistic regression classifier with n-grams features. After training this model, we achieved 0.83 accuracy. However, as performance was generally weak, we built a more complex, bigger, robust model.
Out of all the methods and the models we trained and tested in Part 1, the Random Forest Classifier provided the best results, with a F1 score of ~0.90.
Here, in Part 2, we trained and tested Large Language Models (as opposed to Machine Learning models in Part 1). BERT and XLNet LLMs were specifically assigned, as they are highly appropiate for text classification or sentiment analysis tasks, such as ours.
Model | Core Differentiator | Pre-training Objective | Parameters | Access |
---|---|---|---|---|
BERT | First Transformer Based LLM | AE | 370M | Source Code |
XLNet | Joint AE and AR | AE and AR | 110M | Source Code |
XLNet has a similar architecture to BERT. However, the major difference comes in it’s approach to pre-training.
To summarise the performances:
F1 scores for the 2 models presented in the notebook (models trained on 10-th of dataset) BERT as both feature extractor and classifier - 0.92 XLnet as both feature extractor and classifier - 0.93
Out of all the methods and the models in Part 1 and Part 2, XLNet provided the best results as both feature extractor and classifier, with a F1 score of 0.93. This hugely improves on our baseline score of 0.83, and outperforms the Machine Learning model (Random Forest) from Part 1.
This score is generally considered excellent for sentiment analysis tasks, especially when using a powerful model like XLNet. An F1 score of 1.0 represents perfect precision and recall, so a score of 0.93 indicates very high accuracy, precision, and recall in classifying sentiments for our drug product user reviews.
An F1 score above 0.9 is often indicative of a highly effective model in sentiment analysis. However, it's important to note that the specific context and requirements of the task can influence what constitutes a "good" F1 score. Additionally, it's recommended to consider other evaluation metrics and conduct thorough testing to ensure the model's performance is consistent across different datasets and scenarios.
Pre-Processing Text
We're using BERT's and XLNet's tokenizer directly instead of the 'clean_review' column from Part 1. Hence, there's no pre-processing steps to clean vocabularly.
Built-in Tokenization: BERT and XLNet comes with its own specialized tokenizer. It's designed to handle many of the preprocessing tasks we typically manually implement, like converting text to lowercase (for certain model variants) or removing special characters.
WordPiece Tokenization: BERT's tokenizer uses something called "WordPiece tokenization." What's great about this is that it breaks down words into smaller chunks. So, even if a word isn't in BERT's predefined vocabulary, its smaller pieces probably are. This makes BERT robust and capable of understanding a vast range of words without needing them explicitly in its vocabulary.
Contextual Understanding: BERT analyzes text bidirectionally. This means it understands the context by looking at words before and after a target word, making it super smart in grasping nuanced meanings.
So, when we use BERT and XLNet, we often rely on its tokenizer directly on raw text, bypassing the typical preprocessing steps like our 'clean_review' function from Part 1. This doesn't mean preprocessing is irrelevant, but BERT and XLNet's design allows it to handle raw text quite efficiently.
When Should We Use a Machime Learning Model i.e. Random Forest Versus a Large Language Model i.e BERT/ XLNet?
The Random Forest algorithm (best performimg model from Part 1) presents computational advantages in terms of training efficiency and does not necessitate GPU-accelerated processing for completion. It offers an alternative interpretation of decision trees while demonstrating superior performance.
In contrast, Neural Networks or LLMs such as BERT and XLNet demand a substantial volume of data, potentially surpassing what is readily available to a typical practitioner for achieving optimal efficacy. However, it is imperative to note that Neural Networks or LLMs can significantly reduce the interpretability of features, potentially leading to a diminishment in their substantive relevance for the sake of performance optimization. This consideration, of course, is contingent upon the specific characteristics and requirements of each individual project, and healthcare business.
If the primary objective is the establishment of a predictive model with minimal concern for the intricate interplay of variables, the deployment of a neural network or LLM is advisable; however, it necessitates substantial computational resources, and is of course time and cost-intensive.
Conversely, in cases where a comprehensive comprehension of the contributing variables is imperative, a customary outcome is a modest compromise in model performance to ensure an elucidation of the individual variable contributions to the predictive model.
Ultimately, the decision taken by the healthcare business should be based on the specific nature of our data, the complexity of the task, available resources, and the business requirements for transparency and interpretability.
This same method can be easily transferred to problems of different healthcare scenarios, as the methods used here are quite generalized and is not specific to work on drug reviews data only. We could also implement this model to measure patients perceptions about the care and treatment they receive and the kinds of services they want and make more personalised healthcare plans on patients prior experience.
This implementation was done using python but this can very well be reproduced in a different programming language like R. However, since Python offers a wide range of libraries and modules, it is easier to work on and even reproduce the results on different systems.
Overall, we have met our objective. Which was to review sentiment based on these drug user reviews text, using a supervised binary text classifier, which classified the user reviews as positive or negative. The overall satisifaction leaning more to the positive side.
By analyzing the sentiment expressed in online drug reviews, healthcare providers and manufacturers can gain a more comprehensive understanding of the strengths and weaknesses of their products. This information can inform product development and improvement efforts, and help to ensure that products meet the needs and expectations of patients and consumers.
[1] Sentiment Analysis of User-Generated Content on Drug Review Websites researchgate.net/publication/277625450_Sentiment_Analysis_of_User-Generated_Content_on_Drug_Review_Websites
[2] Classifying Drug Ratings Using User Reviews with Transformer-Based Language Models medrxiv.org/content/10.1101/2021.04.15.21255573v2.full
[3] Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts sciencedirect.com/science/article/pii/S1532046416300508
[4] A Gentle Introduction To Text Classification And Sentiment Analysis, Miguel González-Fierro miguelgfierro.com/blog/2017/a-gentle-introduction-to-text-classification-and-sentiment-analysis/
[5] UCI ML Drug Review dataset kaggle.com/jessicali9530/kuc-hackathon-winter-2018?select=drugsComTest_raw.csv
[6] Multi-Class Metrics Made Simple, Part II: the F1-score towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1
[7] How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
[8] Text Preprocessing in NLP with Python codes analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
[9] Importance of Text Pre-processing pluralsight.com/guides/importance-of-text-pre-processing
[10] Choosing the right language model for your NLP use case - A guide to understanding, selecting and deploying Large Language Models towardsdatascience.com/choosing-the-right-language-model-for-your-nlp-use-case-1288ef3c4929
[11] Neuthink Github Repository github.com/meanotekai/gapresolution
[12] word2vec code.google.com/archive/p/word2vec/
[13] Hugging Face/ BERT LLM huggingface.co/docs/transformers/model_doc/bert
[14] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arxiv.org/abs/1810.04805
[15] Hugging Face/ XLNet LLM huggingface.co/docs/transformers/model_doc/xlnet