Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Key Python Libraries/Key Python Lib-Day 4.ipynb
3074 views
Kernel: Python 3 (ipykernel)

AI and NLP Libraries

1. TensorFlow

  • TensorFlow is an open-source and free software library mainly used for differential programming

  • It is a math library that is used by machine learning applications and neural networks

  • It helps in performing high-end numerical computations

  • TensorFlow can handle deep neural networks for image recognition, handwritten digit classification, recurrent neural networks, NLP (Natural Language Processing), word embedding, and PDE (Partial Differential Equation)

  • Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.

  • It is part of the TensorFlow library and allows you to define and train neural network models in just a few lines of code.

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org --ignore-installed --upgrade tensorflow Input - Random Functions - Sigmoid - O/P input (100 layers)-Dense and Activation functions- output) sigmoid - 0 and 1 - x1 ,x2
# first neural network with keras from numpy import loadtxt from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense
import pandas as pd dia = pd.read_csv("diabetes.csv") dia.head()
X = dia.iloc[:,:-1]# Excluding last one column =features y = dia.Outcome # Target variable
# define the keras model model = Sequential() model.add(Dense(12, input_shape=(8,), activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid'))
  • The model expects rows of data with 8 variables (the input_shape=(8,) argument).

  • The first hidden layer has 12 nodes and uses the relu activation function.

  • The second hidden layer has 8 nodes and uses the relu activation function.

  • The output layer has one node and uses the sigmoid activation function.

line of code that adds the first Dense layer is doing two things, defining the input or visible layer and the first hidden layer.

Compliling

  • Compiling the model uses the efficient numerical libraries under the covers (the so-called backend) such as Theano or TensorFlow. The backend automatically chooses the best way to represent the network for training and making predictions to run on your hardware, such as CPU, GPU, or even distributed.

  • When compiling, you must specify some additional properties required when training the network. Remember training a network means finding the best set of weights to map inputs to outputs in your dataset.

# compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  • We will define the optimizer as the efficient stochastic gradient descent algorithm “adam“. This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems

  • The training process will run for a fixed number of epochs (iterations) through the dataset that you must specify using the epochs argument. You must also set the number of dataset rows that are considered before the model weights are updated within each epoch, called the batch size, and set using the batch_size argument.

# fit the keras model on the dataset model.fit(X, y, epochs=150, batch_size=10)

Evaluate Keras Model

The evaluate() function will return a list with two values. The first will be the loss of the model on the dataset, and the second will be the accuracy of the model on the dataset. You are only interested in reporting the accuracy so ignore the loss value.

# evaluate the keras model _, accuracy = model.evaluate(X, y) print('Accuracy: %.2f' % (accuracy*100))

Natural language processing (NLP)

NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

NLTK

  • The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.

  • NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

Installation

  • conda install -c anaconda nltk image-2.png

import nltk nltk.download()
http://u:p@noidaproxy.corp.exlservice.com:8000
import nltk from nltk.tokenize import word_tokenize E = "Hello i am Suyashi Raiwani" print(word_tokenize(E))
['Hello', 'i', 'am', 'Suyashi', 'Raiwani']
from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize S = "Positive thinking is all! , really a matter of habits. If you are; not quite a positive thinker. Change Yourself?" print(sent_tokenize(S)) ## type(sent_tokenize(E_TEXT)) ##that ends with !,?,. print(word_tokenize(S))
['Positive thinking is all!', ', really a matter of habits.', 'If you are; not quite a positive thinker.', 'Change Yourself?'] ['Positive', 'thinking', 'is', 'all', '!', ',', 'really', 'a', 'matter', 'of', 'habits', '.', 'If', 'you', 'are', ';', 'not', 'quite', 'a', 'positive', 'thinker', '.', 'Change', 'Yourself', '?']
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize a = "I think i that Learning Data Science will bring a big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains" stop_words1 = set(stopwords.words('english'))#downloads the file with english stop words word_tokens = word_tokenize(a) filtered_sentence = [w for w in word_tokens if not w in stop_words1] print(word_tokens) print("**************************************") print(filtered_sentence) print("**************************************") print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
['I', 'think', 'i', 'that', 'Learning', 'Data', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains'] ************************************** ['I', 'think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains'] ************************************** The number of words stopped : 18
stop_words1
b=["I",".",","] #Creating your own Stop word list stop_words1=list(stop_words1) stop_words2 = b #downloads the file with english stop words stop_words=stop_words1+stop_words2 word_tokens = word_tokenize(a) filtered_sentence = [w for w in word_tokens if not w in stop_words] print(filtered_sentence) #print(word_tokens) #print(filtered_sentence) print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))
['think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', 'structured', 'unstructured', 'data', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains'] The number of words stopped : 24

STEMMING

A word stem is part of a word. It is sort of a normalization idea, but linguistic.

For example, the stem of the word magically is magic.

from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() ## defining stemmer s_words = ["Aimed","Aims","Aimed","Aim","Aiming",'Amy',"Aimm","ran","run","running"] for i in s_words: print(ps.stem(i))
aim aim aim aim aim ami aimm ran run run

Python program to generate WordCloud

# importing all necessary modules from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt import pandas as pd # Reads 'Youtube04-Eminem.csv' file df = pd.read_csv(r"Youtube04-Eminem.csv", encoding ="latin-1") comment_words = '' stopwords = set(STOPWORDS) # iterate through the csv file for val in df.CONTENT: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='white', stopwords = stopwords, min_font_size = 10).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook

Working with categorical data

  • A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value

# import required modules import pandas as pd import numpy as np # create dataset df = pd.DataFrame({'Job Roles': ['BA', 'BA', 'Consultant', 'Manager',"BA"], }) # display dataset print(df) # create dummy variables pd.get_dummies(df)
# import required modules import pandas as pd import numpy as np # create dataset s = pd.Series(list('Data')) # display dataset print(s) # create dummy variables pd.get_dummies(s)
# import required modules import pandas as pd import numpy as np # create dataset df = pd.DataFrame({'A': ['Red', 'Blue', 'Yellow',"Red","Red"], 'B': ['bg', 'header', 'Text',"Text",'bg'], 'C': [1, 2, 3,4,5]}) # display dataset print(df) # create dummy variables pd.get_dummies(df)
Probability = 1 - Yes .50 0 -No .99,.98,.45,.35,.20,.1, Y : 1,0 FP , FT,NA, mostly,partially

ML | Dummy classifiers using sklearn

DummyClassifier makes predictions that ignore the input features.

  • This classifier serves as a simple baseline to compare against other more complex classifiers.

  • The specific behavior of the baseline is selected with the strategy parameter.

  • All strategies make predictions that ignore the input feature values passed as the X argument to fit and predict. The predictions, however, typically depend on values observed in the y parameter passed to fit.

  • Note that the “stratified” and “uniform” strategies lead to non-deterministic predictions that can be rendered deterministic by setting the random_state parameter if needed. The other strategies are naturally deterministic and, once fit, always return a the same constant prediction for any value of X.

strategy{“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}, default=”prior”

Strategy to use to generate predictions.

  • “most_frequent”: the predict method always returns the most frequent class label in the observed y argument passed to fit. The predict_proba method returns the matching one-hot encoded vector.

  • “prior”: the predict method always returns the most frequent class label in the observed y argument passed to fit (like “most_frequent”). predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.

  • “stratified”: the predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.

  • “uniform”: generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.

  • “constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.

import numpy as np from sklearn.dummy import DummyClassifier X = np.array([-1, 1, 1, 1]) y = np.array([0, 1, 1, 1]) dummy_clf = DummyClassifier(strategy="prior") dummy_clf.fit(X, y) dummy_clf.predict(X) dummy_clf.score(X, y)
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import matplotlib.pyplot as plt import seaborn as sns
d=pd.read_csv("breastcancer.csv")
d.head(2) d.shape
pd.value_counts(d.diagnosis).to_frame().reset_index()
y = d['diagnosis'] X = d.drop('diagnosis', axis = 1) X = X.drop('id', axis = 1) # Separating the dependent and independent variable X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.3, random_state = 0) # Splitting the data into training and testing data
from sklearn.dummy import DummyClassifier strategies = ['most_frequent', 'stratified', 'uniform', 'constant'] test_scores = [] for s in strategies: if s =='constant': dclf = DummyClassifier(strategy = s, random_state = 0, constant ='M') else: dclf = DummyClassifier(strategy = s, random_state = 0) dclf.fit(X_train, y_train) score = dclf.score(X_test, y_test) test_scores.append(score)
ax = sns.stripplot(strategies, test_scores); ax.set(xlabel ='Strategy', ylabel ='Test Score') plt.show()
clf = KNeighborsClassifier(n_neighbors = 5) clf.fit(X_train, y_train) print(clf.score(X_test, y_test))