Path: blob/master/Key Python Libraries/Key Python Lib-Day 4.ipynb
3074 views
AI and NLP Libraries
1. TensorFlow
TensorFlow is an open-source and free software library mainly used for differential programming
It is a math library that is used by machine learning applications and neural networks
It helps in performing high-end numerical computations
TensorFlow can handle deep neural networks for image recognition, handwritten digit classification, recurrent neural networks, NLP (Natural Language Processing), word embedding, and PDE (Partial Differential Equation)
Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.
It is part of the TensorFlow library and allows you to define and train neural network models in just a few lines of code.
The model expects rows of data with 8 variables (the input_shape=(8,) argument).
The first hidden layer has 12 nodes and uses the relu activation function.
The second hidden layer has 8 nodes and uses the relu activation function.
The output layer has one node and uses the sigmoid activation function.
line of code that adds the first Dense layer is doing two things, defining the input or visible layer and the first hidden layer.
Compliling
Compiling the model uses the efficient numerical libraries under the covers (the so-called backend) such as Theano or TensorFlow. The backend automatically chooses the best way to represent the network for training and making predictions to run on your hardware, such as CPU, GPU, or even distributed.
When compiling, you must specify some additional properties required when training the network. Remember training a network means finding the best set of weights to map inputs to outputs in your dataset.
We will define the optimizer as the efficient stochastic gradient descent algorithm “adam“. This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems
The training process will run for a fixed number of epochs (iterations) through the dataset that you must specify using the epochs argument. You must also set the number of dataset rows that are considered before the model weights are updated within each epoch, called the batch size, and set using the batch_size argument.
Evaluate Keras Model
The evaluate() function will return a list with two values. The first will be the loss of the model on the dataset, and the second will be the accuracy of the model on the dataset. You are only interested in reporting the accuracy so ignore the loss value.
Natural language processing (NLP)
NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.
NLTK
NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum.
The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.
Installation
conda install -c anaconda nltk
STEMMING
A word stem is part of a word. It is sort of a normalization idea, but linguistic.
For example, the stem of the word magically is magic.
Python program to generate WordCloud
Working with categorical data
A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value
ML | Dummy classifiers using sklearn
DummyClassifier makes predictions that ignore the input features.
This classifier serves as a simple baseline to compare against other more complex classifiers.
The specific behavior of the baseline is selected with the strategy parameter.
All strategies make predictions that ignore the input feature values passed as the X argument to fit and predict. The predictions, however, typically depend on values observed in the y parameter passed to fit.
Note that the “stratified” and “uniform” strategies lead to non-deterministic predictions that can be rendered deterministic by setting the random_state parameter if needed. The other strategies are naturally deterministic and, once fit, always return a the same constant prediction for any value of X.
strategy{“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}, default=”prior”
Strategy to use to generate predictions.
“most_frequent”: the predict method always returns the most frequent class label in the observed y argument passed to fit. The predict_proba method returns the matching one-hot encoded vector.
“prior”: the predict method always returns the most frequent class label in the observed y argument passed to fit (like “most_frequent”). predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.
“stratified”: the predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.
“uniform”: generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.
“constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.