Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Text Mining.ipynb
3087 views
Kernel: Python 3

TEXT MINNING

  • Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. It's also known as text analytics, although some people draw a distinction between the two terms; in that view, text analytics is an application enabled by the use of text mining techniques to sort through data sets.

  • Text mining has become more practical for data scientists and other users due to the development of big data platforms and deep learning algorithms that can analyze massive sets of unstructured data.

  • Finding frequency counts of words, length of the sentence, presence/absence of specific words is known as text mining. Natural language processing is one of the components of text minining.

  • conda install -c conda-forge textblob

image.png

%matplotlib inline import matplotlib.pyplot as plt from textblob import Word ,TextBlob import pandas as pd import csv from os.path import join import os
##We store data in our current directory directory and retrieve the path using the following command path = os.getcwd()
msg = pd.read_csv(join(path,"F:\\smsspamcollection\\SMSSpamCollection.csv"),sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"])
len(msg)
5574
msg.head()

For more clearity lets describe our data

msg.groupby('label').describe()

Let us also compute the length of each message and plot the distribution

msg['MessageLength'] = [len(x) for x in msg['message'].tolist()] plt.hist(msg['MessageLength'], bins = 50, normed = True) plt.xlabel('') plt.show()
C:\Users\HP\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6521: MatplotlibDeprecationWarning: The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead. alternative="'density'", removal="3.1")
Image in a Jupyter notebook

More statistics about the message length:

msg['MessageLength'].describe()
count 5574.000000 mean 80.478292 std 59.848302 min 2.000000 25% 36.000000 50% 62.000000 75% 122.000000 max 910.000000 Name: MessageLength, dtype: float64
msg.hist(column='MessageLength', by='label', bins=50, normed = True) plt.show()
C:\Users\HP\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6521: MatplotlibDeprecationWarning: The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead. alternative="'density'", removal="3.1")
Image in a Jupyter notebook

Lemming

  • Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words.

  • Lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

  • Now we would use textblob package from python to perform some text mining. We convert each message into a set of lemmas i.e. in the canonical form, dictionary form, or citation form.

import nltk nltk.download('punkt')
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\HP\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
True
# Lemmatize a word word = 'Aimy' w = Word(word) w.lemmatize() #> stripe
'Aimy'
sentence = " Hi I am Data Science Trainer" sent = TextBlob(sentence) " ". join([w.lemmatize() for w in sent.words])
'Hi I am Data Science Trainer'
def split_into_lemmas(s): words = TextBlob(s).words return [word.lemma for word in words] print(split_into_lemmas)
<function split_into_lemmas at 0x00000190E9654378>
msg['message'].apply(split_into_lemmas)[0]
['Go', 'until', 'jurong', 'point', 'crazy', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'Cine', 'there', 'got', 'amore', 'wat']

Plotting a nice word cloud

from wordcloud import WordCloud # use the command conda install wordcloud from wordcloud import STOPWORDS
words = ' '.join(msg['message'])
## REMOVE URLS & TAGS no_urls_no_tags = " ".join([word for word in words.split() if 'http' not in word and not word.startswith('@') and word != 'RT' ])
wordcloud = WordCloud( stopwords=STOPWORDS, background_color='black', max_words=2000, width=1800, height=1400 ).generate(no_urls_no_tags) plt.imshow(wordcloud) plt.axis('off') plt.show()