Protected: Predicting Sentiment Based on Drug Product User Reviews (Using Real World Data) for Informed Decision Making – Part 1

Sentiment Analysis Using a Supervised Binary Text Classifier (Utilizing Machine Learning)

Key Points

Key Techniques to Deliver Our Project:

  1. Plotting distribution of ratings, to indicate the skewness.  In our case, a skewness/ majority class for high scores indicating an imbalance in data
  2. Defining a function to pre-process a given text (drug user reviews), after applying several pre-processing steps
  3. Creating sentiment features from rating (1 and 0/ positive and negative)
  4. n-gram tokenisation to consider word collocations
  5.  Word clouds for positive and negative reviews
  6. Baseline model – for our use case, we used Bag of Words (BoW) and Naive Bayes from Sklearn package as a baseline
  7. Classification approach – Supervised binary classification machine learning models utilised (Logistic Regression, Multinomial Naive Bayes, LightGBM and Random Forest) and incorporating TF-IDF vectorization (technique to quantify words in a set of documents or to model text data. We generally compute a score for each word to signify its importance in the document and corpus).  This is then compared against our baseline model
  8. Evaluation using F1 score, as data was imbalanced
  9. Exploring further feature extraction techniques for our best performing model, Random Forest (F1 Score of 0.90)
  10. Through implementation of several models on the same dataset with high dimensional features, we come to the conclusion that the random forest algorithm performs well  on data with high dimensionality, such as ours. Preprocessing the text (with our preprocess_text function) before generating features from the text is a major part of data preparation and should not be ignored to achieve better quality results. It removes the noise and leaves us with clean, quality vocabulary.
  11. Overall, we met our objective. Which was to review sentiment based on these drug user reviews text, using a supervised binary text classifier, which classified the user reviews as positive or negative.  By analysing the sentiment expressed in online drug reviews, healthcare providers and manufacturers can gain a more comprehensive understanding of the strengths and weaknesses of their products. This information can inform product development and improvement efforts, and help to ensure that products meet the needs and expectations of patients and consumers.

 

          Discover the code behind the insights – check out our GitHub repository for this Natural Language Processing project

This content is password protected. To view it please enter your password below:

Let's collaborate to develop better healthcare solutions for tomorrow

error: Data is Protected!