Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Natural Language Processing using Python/Exploratory Analysis of Textual Data.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Exploratory Analysis of Textual Data

Data Preprocessing

pip install textstat pip install TextBlob

#import required libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import re import string import math from sklearn.feature_extraction.text import CountVectorizer from wordcloud import WordCloud from textwrap import wrap from textblob import TextBlob
pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org numpy --upgrade spacy
pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org numpy --upgrade textstat
import textstat
# load dataset text=pd.read_csv('AWSReview.csv') text.shape
C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\3023027549.py:3: DtypeWarning: Columns (1,10) have mixed types. Specify dtype option on import or set low_memory=False. text=pd.read_csv('AWSReview.csv')
(34660, 21)
text.head()
#Select required features for analysis from the 21 given columns. text.isnull().sum()
id 0 name 6760 asins 2 brand 0 categories 0 keys 0 manufacturer 0 reviews.date 39 reviews.dateAdded 10621 reviews.dateSeen 0 reviews.didPurchase 34659 reviews.doRecommend 594 reviews.id 34659 reviews.numHelpful 529 reviews.rating 33 reviews.sourceURLs 0 reviews.text 1 reviews.title 6 reviews.userCity 34660 reviews.userProvince 34660 reviews.username 7 dtype: int64
#Select the the 4 key columns, product name, review content, users if they recommend the product, and number of people who found the review helpful textdata = text[['name','reviews.text','reviews.doRecommend','reviews.numHelpful']] textdata.head()
#Drop null values textdata.dropna(inplace=True) textdata.dropna(inplace=True)
C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\2047947588.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy textdata.dropna(inplace=True) C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\2047947588.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy textdata.dropna(inplace=True)
textdata.shape
(27409, 4)

1. dropna ()-

2.fillna(0) 3.fillna("median","mode","mean")

Median and Mean : mean, outliers: Median

10,500,800,700,900,1000,1200,1100,450

Mode:"A","A","c","d"

=2, A

textcopy=textdata.copy()
add =lambda x,y,z,a:x+y+z+a add(1,2,0,6)
9
n= [1,2,3,6,8,9,12] e_n=list(filter(lambda x:x%2 == 0,n)) e_n
[2, 6, 8, 12]
#Filter products based on number of reviews textdata=textdata.groupby(['name']).filter(lambda x: len(x)>300).reset_index(drop=True) print('Number of products matching the criteria is ',len(textdata['name'].unique()))
Number of products matching the criteria is 10
#convert datatype boolean to int textdata['reviews.doRecommend']=textdata['reviews.doRecommend'].astype(int) textdata['reviews.numHelpful']=textdata['reviews.numHelpful'].astype(int)
#Cleaning Text data. There are 10 unique product names. Remove unwanted characters from the names. textdata['name'].unique() textdata['name']=textdata['name'].apply(lambda x: x.split(',,,')[0]) textdata['name'].nunique()
10
textdata['name']
0 All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,... 1 All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,... 2 All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,... 3 All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,... 4 All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,... ... 26715 Amazon Fire Tv 26716 Amazon Fire Tv 26717 Amazon Fire Tv 26718 Amazon Fire Tv 26719 Amazon Fire Tv Name: name, Length: 26720, dtype: object
a= "i am very happy with the prpduct quality" a[5:15]
textdata['reviews.text']
0 This product so far has not disappointed. My c... 1 great for beginner or experienced person. Boug... 2 Inexpensive tablet for him to use and learn on... 3 I've had my Fire HD 8 two weeks now and I love... 4 I bought this for my grand daughter when she c... ... 26715 It has many uses. You can listen to music, che... 26716 Cost is not outrageous. Easy setup, fun to use... 26717 I knew about this from its crowd funding start... 26718 This is a neat product but did not fit my need... 26719 Responses well and there are lots of skills to... Name: reviews.text, Length: 26720, dtype: object
#Explore and clean the review text for text in enumerate(textdata['reviews.text'][10:15]): print('Review:\n',text)
Review: (0, 'Not easy for elderly users cease of ads that pop up.') Review: (1, 'Excellent product. Easy to use, large screen makes watching movies and reading easier.') Review: (2, 'Wanted my father to have his first tablet and this is a very good value. He can watch movies and play a few games. Easy enough for him to use.') Review: (3, 'Simply does everything I need. Thank youAnd silk works wonders') Review: (4, 'Got it as a present and love the size of the screen')
# While developing NLP models capital and lowercase letters are treated differently so its required to convert all words to lowercase, as few words are in capitals in the review text. textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: x.lower())
textdata['reviews.text']
0 this product so far has not disappointed. my c... 1 great for beginner or experienced person. boug... 2 inexpensive tablet for him to use and learn on... 3 i've had my fire hd 8 two weeks now and i love... 4 i bought this for my grand daughter when she c... ... 26715 it has many uses. you can listen to music, che... 26716 cost is not outrageous. easy setup, fun to use... 26717 i knew about this from its crowd funding start... 26718 this is a neat product but did not fit my need... 26719 responses well and there are lots of skills to... Name: reviews.text, Length: 26720, dtype: object
# Eliminate digits in the text using regular expressions textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: re.sub('\w*\d\w*','', x))
#Eliminate punctuaitons textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
textdata['reviews.text']
0 this product so far has not disappointed my ch... 1 great for beginner or experienced person bough... 2 inexpensive tablet for him to use and learn on... 3 ive had my fire hd two weeks now and i love i... 4 i bought this for my grand daughter when she c... ... 26715 it has many uses you can listen to music check... 26716 cost is not outrageous easy setup fun to use a... 26717 i knew about this from its crowd funding start... 26718 this is a neat product but did not fit my need... 26719 responses well and there are lots of skills to... Name: reviews.text, Length: 26720, dtype: object

-1, very negtive 0:Neural 1: Very positive

#Polarity in sentiment analysis refers to identifying sentiment orientation (positive, neutral, and negative) in written or spoken language. textdata['emotion']=textdata['reviews.text'].apply(lambda x:TextBlob(x).sentiment.polarity)
textdata['emotion']
0 0.325000 1 0.800000 2 0.600000 3 0.374583 4 0.368056 ... 26715 0.500000 26716 0.411111 26717 0.512500 26718 0.250000 26719 0.000000 Name: emotion, Length: 26720, dtype: float64
from textblob import TextBlob as tx tx("Hello i am happy ").sentiment
Sentiment(polarity=0.8, subjectivity=1.0)
from textblob import TextBlob as tx tx("Hello i am Sad and unhappy worst condition").sentiment
Sentiment(polarity=-0.7000000000000001, subjectivity=0.9666666666666667)
from textblob import TextBlob as tx tx("Hello").sentiment
Sentiment(polarity=0.0, subjectivity=0.0)
textdata['emotion']
product_polarity=pd.DataFrame(textdata.groupby('name')['emotion'].mean().sort_values(ascending=True)) plt.figure(figsize=(16,8)) plt.xlabel('Emotion') plt.ylabel('Products') plt.title('Polarity of Product Reviews') polarity_graph=plt.barh(np.arange(len(product_polarity.index)),product_polarity['emotion'],color='pink') for bar,product in zip(polarity_graph,product_polarity.index): plt.text(0.005,bar.get_y()+bar.get_width(),'{}'.format(product),va='center',fontsize=11,color='white') for bar,polarity in zip(polarity_graph,product_polarity['emotion']): plt.text(bar.get_width()+0.001,bar.get_y()+bar.get_width(),'%.3f'%polarity,va='center',fontsize=11,color='black') plt.yticks([]) plt.show()
Image in a Jupyter notebook

Here first few products have good feedback from the viewers whereas last few products depicted in the bar graph have lesser user ratings. This helps in understanding the popularity of products through user reviews.

Python package textstat is used to calculate statistics from text to determine readability of texts. We can use this to determine if reading time of reviews upvoted as helpful and non-helpful have any impact.

textcopy['reviews.doRecommend']=textcopy['reviews.doRecommend'].astype(int) textcopy['reviews.numHelpful']=textcopy['reviews.numHelpful'].astype(int)
#using textstat package textcopy['reading_time']=textcopy['reviews.text'].apply(lambda x: textstat.reading_time(x)) print('Reading Time of upvoted reviews is',textcopy[textcopy['reviews.numHelpful']>1]['reading_time'].mean()) print('Reading Time of not upvoted reviews is',textcopy[textcopy['reviews.numHelpful']<=1]['reading_time'].mean())
Reading Time of upvoted reviews is 3.6968225190839696 Reading Time of not upvoted reviews is 1.8005997496301354

Previous review was helpful to decide about product

  • Create a World Cloud review steps:

  • token, Stopwords