GitHub Repository: suyashi29/python-su
Path: blob/master/Natural Language Processing using Python/Exploratory Analysis of Textual Data.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Exploratory Analysis of Textual Data

Data Preprocessing

pip install textstat pip install TextBlob

In [1]:

#import required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import string
import math
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
from textwrap import wrap
from textblob import TextBlob

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org numpy --upgrade spacy

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org numpy --upgrade textstat

In [2]:

import textstat

In [3]:

# load dataset

text=pd.read_csv('AWSReview.csv')
text.shape

Out[3]:

C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\3023027549.py:3: DtypeWarning: Columns (1,10) have mixed types. Specify dtype option on import or set low_memory=False.
  text=pd.read_csv('AWSReview.csv')

(34660, 21)

In [4]:

text.head()

Out[4]:

In [5]:

#Select required features for analysis from the 21 given columns.

text.isnull().sum()

Out[5]:

id                          0
name                     6760
asins                       2
brand                       0
categories                  0
keys                        0
manufacturer                0
reviews.date               39
reviews.dateAdded       10621
reviews.dateSeen            0
reviews.didPurchase     34659
reviews.doRecommend       594
reviews.id              34659
reviews.numHelpful        529
reviews.rating             33
reviews.sourceURLs          0
reviews.text                1
reviews.title               6
reviews.userCity        34660
reviews.userProvince    34660
reviews.username            7
dtype: int64

In [6]:

#Select the the 4 key columns, product name, review content, users if they recommend the product, and number of people who found the review helpful
textdata = text[['name','reviews.text','reviews.doRecommend','reviews.numHelpful']]
textdata.head()

Out[6]:

In [9]:

#Drop null values
textdata.dropna(inplace=True)
textdata.dropna(inplace=True)

Out[9]:

C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\2047947588.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  textdata.dropna(inplace=True)
C:\Users\Suyashi144893\AppData\Local\Temp\1\ipykernel_3856\2047947588.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  textdata.dropna(inplace=True)

In [8]:

textdata.shape

Out[8]:

(27409, 4)

1. dropna ()-

2.fillna(0) 3.fillna("median","mode","mean")

Median and Mean : mean, outliers: Median

10,500,800,700,900,1000,1200,1100,450

Mode:"A","A","c","d"

=2, A

In [10]:

textcopy=textdata.copy()

In [11]:

add =lambda x,y,z,a:x+y+z+a
add(1,2,0,6)

Out[11]:

9

In [14]:

n= [1,2,3,6,8,9,12]
e_n=list(filter(lambda x:x%2 == 0,n))
e_n

Out[14]:

[2, 6, 8, 12]

In [15]:

#Filter products based on number of reviews
textdata=textdata.groupby(['name']).filter(lambda x: len(x)>300).reset_index(drop=True)
print('Number of products matching the criteria is ',len(textdata['name'].unique()))

Out[15]:

Number of products matching the criteria is  10

In [20]:

#convert datatype boolean to int
textdata['reviews.doRecommend']=textdata['reviews.doRecommend'].astype(int)
textdata['reviews.numHelpful']=textdata['reviews.numHelpful'].astype(int)

In [19]:

#Cleaning Text data. There are 10 unique product names. Remove unwanted characters from the names.
textdata['name'].unique()
textdata['name']=textdata['name'].apply(lambda x: x.split(',,,')[0])
textdata['name'].nunique()

Out[19]:

10

In [21]:

textdata['name']

Out[21]:

      All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
      All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
      All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
      All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
      All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
                               ...                        
                                     Amazon Fire Tv
                                     Amazon Fire Tv
                                     Amazon Fire Tv
                                     Amazon Fire Tv
                                     Amazon Fire Tv
Name: name, Length: 26720, dtype: object

a= "i am very happy with the prpduct quality"
a[5:15]

In [22]:

textdata['reviews.text']

Out[22]:

      This product so far has not disappointed. My c...
      great for beginner or experienced person. Boug...
      Inexpensive tablet for him to use and learn on...
      I've had my Fire HD 8 two weeks now and I love...
      I bought this for my grand daughter when she c...
                               ...                        
  It has many uses. You can listen to music, che...
  Cost is not outrageous. Easy setup, fun to use...
  I knew about this from its crowd funding start...
  This is a neat product but did not fit my need...
  Responses well and there are lots of skills to...
Name: reviews.text, Length: 26720, dtype: object

In [23]:

#Explore and clean the review text

for text in enumerate(textdata['reviews.text'][10:15]):
  print('Review:\n',text)

Out[23]:

Review:
 (0, 'Not easy for elderly users cease of ads that pop up.')
Review:
 (1, 'Excellent product. Easy to use, large screen makes watching movies and reading easier.')
Review:
 (2, 'Wanted my father to have his first tablet and this is a very good value. He can watch movies and play a few games. Easy enough for him to use.')
Review:
 (3, 'Simply does everything I need. Thank youAnd silk works wonders')
Review:
 (4, 'Got it as a present and love the size of the screen')

In [24]:

# While developing NLP models capital and lowercase letters are treated differently so its required to convert all words to lowercase, as few words are in capitals in the review text.

textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: x.lower())

In [25]:

textdata['reviews.text']

Out[25]:

      this product so far has not disappointed. my c...
      great for beginner or experienced person. boug...
      inexpensive tablet for him to use and learn on...
      i've had my fire hd 8 two weeks now and i love...
      i bought this for my grand daughter when she c...
                               ...                        
  it has many uses. you can listen to music, che...
  cost is not outrageous. easy setup, fun to use...
  i knew about this from its crowd funding start...
  this is a neat product but did not fit my need...
  responses well and there are lots of skills to...
Name: reviews.text, Length: 26720, dtype: object

In [26]:

# Eliminate digits in the text using regular expressions
textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: re.sub('\w*\d\w*','', x))

In [27]:

#Eliminate punctuaitons

textdata['reviews.text']=textdata['reviews.text'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))

In [28]:

textdata['reviews.text']

Out[28]:

      this product so far has not disappointed my ch...
      great for beginner or experienced person bough...
      inexpensive tablet for him to use and learn on...
      ive had my fire hd  two weeks now and i love i...
      i bought this for my grand daughter when she c...
                               ...                        
  it has many uses you can listen to music check...
  cost is not outrageous easy setup fun to use a...
  i knew about this from its crowd funding start...
  this is a neat product but did not fit my need...
  responses well and there are lots of skills to...
Name: reviews.text, Length: 26720, dtype: object

-1, very negtive 0:Neural 1: Very positive

In [29]:

#Polarity in sentiment analysis refers to identifying sentiment orientation (positive, neutral, and negative) in written or spoken language.

textdata['emotion']=textdata['reviews.text'].apply(lambda x:TextBlob(x).sentiment.polarity)

In [31]:

textdata['emotion']

Out[31]:

      0.325000
      0.800000
      0.600000
      0.374583
      0.368056
           ...   
  0.500000
  0.411111
  0.512500
  0.250000
  0.000000
Name: emotion, Length: 26720, dtype: float64

In [32]:

from textblob import TextBlob as tx
tx("Hello i am happy ").sentiment

Out[32]:

Sentiment(polarity=0.8, subjectivity=1.0)

In [33]:

from textblob import TextBlob as tx
tx("Hello i am Sad and unhappy worst condition").sentiment

Out[33]:

Sentiment(polarity=-0.7000000000000001, subjectivity=0.9666666666666667)

In [34]:

from textblob import TextBlob as tx
tx("Hello").sentiment

Out[34]:

Sentiment(polarity=0.0, subjectivity=0.0)

In [ ]:

textdata['emotion']

In [35]:

product_polarity=pd.DataFrame(textdata.groupby('name')['emotion'].mean().sort_values(ascending=True))

plt.figure(figsize=(16,8))
plt.xlabel('Emotion')
plt.ylabel('Products')
plt.title('Polarity of Product Reviews')
polarity_graph=plt.barh(np.arange(len(product_polarity.index)),product_polarity['emotion'],color='pink')


for bar,product in zip(polarity_graph,product_polarity.index):
  plt.text(0.005,bar.get_y()+bar.get_width(),'{}'.format(product),va='center',fontsize=11,color='white')

for bar,polarity in zip(polarity_graph,product_polarity['emotion']):
  plt.text(bar.get_width()+0.001,bar.get_y()+bar.get_width(),'%.3f'%polarity,va='center',fontsize=11,color='black')
  
plt.yticks([])
plt.show()

Out[35]:

Here first few products have good feedback from the viewers whereas last few products depicted in the bar graph have lesser user ratings. This helps in understanding the popularity of products through user reviews.

Python package textstat is used to calculate statistics from text to determine readability of texts. We can use this to determine if reading time of reviews upvoted as helpful and non-helpful have any impact.

In [36]:

textcopy['reviews.doRecommend']=textcopy['reviews.doRecommend'].astype(int)
textcopy['reviews.numHelpful']=textcopy['reviews.numHelpful'].astype(int)

In [37]:

#using textstat package

textcopy['reading_time']=textcopy['reviews.text'].apply(lambda x: textstat.reading_time(x))

print('Reading Time of upvoted reviews is',textcopy[textcopy['reviews.numHelpful']>1]['reading_time'].mean())
print('Reading Time of not upvoted reviews is',textcopy[textcopy['reviews.numHelpful']<=1]['reading_time'].mean())

Out[37]:

Reading Time of upvoted reviews is 3.6968225190839696
Reading Time of not upvoted reviews is 1.8005997496301354

Previous review was helpful to decide about product

Create a World Cloud review steps:
token, Stopwords

In [ ]:

Exploratory Analysis of Textual Data

Data Preprocessing

1. dropna ()-

Previous review was helpful to decide about product

Product

Resources

Company