TEXT MINNING
Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. It's also known as text analytics, although some people draw a distinction between the two terms; in that view, text analytics is an application enabled by the use of text mining techniques to sort through data sets.
Text mining has become more practical for data scientists and other users due to the development of big data platforms and deep learning algorithms that can analyze massive sets of unstructured data.
Finding frequency counts of words, length of the sentence, presence/absence of specific words is known as text mining. Natural language processing is one of the components of text minining.
conda install -c conda-forge textblob
For more clearity lets describe our data
Let us also compute the length of each message and plot the distribution
More statistics about the message length:
Lemming
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words.
Lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.
Now we would use textblob package from python to perform some text mining. We convert each message into a set of lemmas i.e. in the canonical form, dictionary form, or citation form.