Plotting distribution of ratings, to indicate the skewness. In our case, a skewness/ majority class for high scores indicating an imbalance in data
Defining a function to pre-process a given text (drug user reviews), after applying several pre-processing steps
Creating sentiment features from rating (1 and 0/ positive and negative)
n-gram tokenisation to consider word collocations
Word clouds for positive and negative reviews
Baseline model – for our use case, we used Bag of Words (BoW) and Naive Bayes from Sklearn package as a baseline
Classification approach – Supervised binary classification machine learning models utilised (Logistic Regression, Multinomial Naive Bayes, LightGBM and Random Forest) and incorporating TF-IDF vectorization (technique to quantify words in a set of documents or to model text data. We generally compute a score for each word to signify its importance in the document and corpus). This is then compared against our baseline model
Evaluation using F1 score, as data was imbalanced
Exploring further feature extraction techniques for our best performing model, Random Forest (F1 Score of 0.90)
Through implementation of several models on the same dataset with high dimensional features, we come to the conclusion that the random forest algorithm performs well on data with high dimensionality, such as ours. Preprocessing the text (with our preprocess_text function) before generating features from the text is a major part of data preparation and should not be ignored to achieve better quality results. It removes the noise and leaves us with clean, quality vocabulary.
Overall, we met our objective. Which was to review sentiment based on these drug user reviews text, using a supervised binary text classifier, which classified the user reviews as positive or negative. By analysing the sentiment expressed in online drug reviews, healthcare providers and manufacturers can gain a more comprehensive understanding of the strengths and weaknesses of their products. This information can inform product development and improvement efforts, and help to ensure that products meet the needs and expectations of patients and consumers.
Discover the code behind the insights – check out our GitHub repository for this Natural Language Processing project
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.Ok