GitHub Repository: suyashi29/python-su
Path: blob/master/Key Python Libraries/Key Python Lib-Day 4.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

AI and NLP Libraries

1. TensorFlow

TensorFlow is an open-source and free software library mainly used for differential programming
It is a math library that is used by machine learning applications and neural networks
It helps in performing high-end numerical computations
TensorFlow can handle deep neural networks for image recognition, handwritten digit classification, recurrent neural networks, NLP (Natural Language Processing), word embedding, and PDE (Partial Differential Equation)
Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.
It is part of the TensorFlow library and allows you to define and train neural network models in just a few lines of code.

pip install wordcloud --trusted-host pypi.org --trusted-host files.pythonhosted.org --ignore-installed --upgrade tensorflow

Input - Random Functions - Sigmoid - O/P 


input (100 layers)-Dense and Activation functions- output)

sigmoid - 0 and 1 - x1 ,x2

In [ ]:

# first neural network with keras
from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [ ]:

import pandas as pd
dia = pd.read_csv("diabetes.csv")
dia.head()

In [ ]:

X = dia.iloc[:,:-1]# Excluding last one column =features
y = dia.Outcome # Target variable

In [ ]:


# define the keras model
model = Sequential()
model.add(Dense(12, input_shape=(8,), activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

The model expects rows of data with 8 variables (the input_shape=(8,) argument).
The first hidden layer has 12 nodes and uses the relu activation function.
The second hidden layer has 8 nodes and uses the relu activation function.
The output layer has one node and uses the sigmoid activation function.

line of code that adds the first Dense layer is doing two things, defining the input or visible layer and the first hidden layer.

Compliling

Compiling the model uses the efficient numerical libraries under the covers (the so-called backend) such as Theano or TensorFlow. The backend automatically chooses the best way to represent the network for training and making predictions to run on your hardware, such as CPU, GPU, or even distributed.
When compiling, you must specify some additional properties required when training the network. Remember training a network means finding the best set of weights to map inputs to outputs in your dataset.

In [ ]:

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

We will define the optimizer as the efficient stochastic gradient descent algorithm “adam“. This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems
The training process will run for a fixed number of epochs (iterations) through the dataset that you must specify using the epochs argument. You must also set the number of dataset rows that are considered before the model weights are updated within each epoch, called the batch size, and set using the batch_size argument.

In [ ]:

# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=10)

Evaluate Keras Model

The evaluate() function will return a list with two values. The first will be the loss of the model on the dataset, and the second will be the accuracy of the model on the dataset. You are only interested in reporting the accuracy so ignore the loss value.

In [ ]:


# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

Natural language processing (NLP)

NLP is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

NLTK

NLTK is a popular Python framework for dealing with data of human language. It includes a set of text processing libraries for classification and semantic reasoning, as well as wrappers for industrial-strength NLP libraries and an active discussion forum.

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.
NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about.

Installation

conda install -c anaconda nltk

import nltk
nltk.download()

http://u:p@noidaproxy.corp.exlservice.com:8000

In [1]:

import nltk
from nltk.tokenize import  word_tokenize

E = "Hello i am Suyashi Raiwani"

print(word_tokenize(E))

Out[1]:

['Hello', 'i', 'am', 'Suyashi', 'Raiwani']

In [5]:

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize


S = "Positive thinking is all! , really a matter of habits. If you are; not quite a positive thinker. Change Yourself?"

print(sent_tokenize(S))
## type(sent_tokenize(E_TEXT)) ##that ends with !,?,.
print(word_tokenize(S))

Out[5]:

['Positive thinking is all!', ', really a matter of habits.', 'If you are; not quite a positive thinker.', 'Change Yourself?']
['Positive', 'thinking', 'is', 'all', '!', ',', 'really', 'a', 'matter', 'of', 'habits', '.', 'If', 'you', 'are', ';', 'not', 'quite', 'a', 'positive', 'thinker', '.', 'Change', 'Yourself', '?']

In [11]:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

a = "I think i that Learning  Data Science will bring a  big leap in your Carrier Profile. Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains"

stop_words1 = set(stopwords.words('english'))#downloads the file with english stop words
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words1]


print(word_tokens)
print("**************************************")
print(filtered_sentence)
print("**************************************")
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

Out[11]:

['I', 'think', 'i', 'that', 'Learning', 'Data', 'Science', 'will', 'bring', 'a', 'big', 'leap', 'in', 'your', 'Carrier', 'Profile', '.', 'Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'noisy', ',', 'structured', 'and', 'unstructured', 'data', ',', 'and', 'apply', 'knowledge', 'from', 'data', 'across', 'a', 'broad', 'range', 'of', 'application', 'domains']
**************************************
['I', 'think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', '.', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', ',', 'structured', 'unstructured', 'data', ',', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
**************************************
The number of words stopped : 18

stop_words1

In [14]:

b=["I",".",","]  #Creating your own Stop word list
stop_words1=list(stop_words1)
stop_words2 = b #downloads the file with english stop words
stop_words=stop_words1+stop_words2
word_tokens = word_tokenize(a)

filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

#print(word_tokens)
#print(filtered_sentence)
print("The number of words stopped :",(len(word_tokens)-len(filtered_sentence)))

Out[14]:

['think', 'Learning', 'Data', 'Science', 'bring', 'big', 'leap', 'Carrier', 'Profile', 'Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'noisy', 'structured', 'unstructured', 'data', 'apply', 'knowledge', 'data', 'across', 'broad', 'range', 'application', 'domains']
The number of words stopped : 24

STEMMING

A word stem is part of a word. It is sort of a normalization idea, but linguistic.

For example, the stem of the word magically is magic.

In [17]:

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer() ## defining stemmer
s_words = ["Aimed","Aims","Aimed","Aim","Aiming",'Amy',"Aimm","ran","run","running"]
for i in s_words:
    print(ps.stem(i))

Out[17]:

aim
aim
aim
aim
aim
ami
aimm
ran
run
run

Python program to generate WordCloud

In [18]:



# importing all necessary modules
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd

# Reads 'Youtube04-Eminem.csv' file
df = pd.read_csv(r"Youtube04-Eminem.csv", encoding ="latin-1")

comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file
for val in df.CONTENT:
	
	# typecaste each val to string
	val = str(val)

	# split the value
	tokens = val.split()
	
	# Converts each token into lowercase
	for i in range(len(tokens)):
		tokens[i] = tokens[i].lower()
	
	comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 800, height = 800,
				background_color ='white',
				stopwords = stopwords,
				min_font_size = 10).generate(comment_words)

# plot the WordCloud image					
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

Out[18]:

Working with categorical data

A dummy variable is a binary variable that indicates whether a separate categorical variable takes on a specific value

In [ ]:

# import required modules
import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({'Job Roles': ['BA', 'BA', 'Consultant', 'Manager',"BA"],
				})

# display dataset
print(df)

# create dummy variables
pd.get_dummies(df)

In [ ]:

# import required modules
import pandas as pd
import numpy as np

# create dataset
s = pd.Series(list('Data'))

# display dataset
print(s)

# create dummy variables
pd.get_dummies(s)

In [ ]:

# import required modules
import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({'A': ['Red', 'Blue', 'Yellow',"Red","Red"],
				'B': ['bg', 'header', 'Text',"Text",'bg'],
				'C': [1, 2, 3,4,5]})

# display dataset
print(df)

# create dummy variables
pd.get_dummies(df)

Probability = 1 - Yes    .50 
              0 -No
    .99,.98,.45,.35,.20,.1,
    
    Y : 1,0
        
        FP , FT,NA, mostly,partially

ML | Dummy classifiers using sklearn

DummyClassifier makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more complex classifiers.
The specific behavior of the baseline is selected with the strategy parameter.
All strategies make predictions that ignore the input feature values passed as the X argument to fit and predict. The predictions, however, typically depend on values observed in the y parameter passed to fit.
Note that the “stratified” and “uniform” strategies lead to non-deterministic predictions that can be rendered deterministic by setting the random_state parameter if needed. The other strategies are naturally deterministic and, once fit, always return a the same constant prediction for any value of X.

strategy{“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}, default=”prior”

Strategy to use to generate predictions.

“most_frequent”: the predict method always returns the most frequent class label in the observed y argument passed to fit. The predict_proba method returns the matching one-hot encoded vector.
“prior”: the predict method always returns the most frequent class label in the observed y argument passed to fit (like “most_frequent”). predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.
“stratified”: the predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.
“uniform”: generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.
“constant”: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.

In [ ]:

import numpy as np
from sklearn.dummy import DummyClassifier
X = np.array([-1, 1, 1, 1])
y = np.array([0, 1, 1, 1])
dummy_clf = DummyClassifier(strategy="prior")
dummy_clf.fit(X, y)

dummy_clf.predict(X)

dummy_clf.score(X, y)

In [ ]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns

In [ ]:

d=pd.read_csv("breastcancer.csv")

In [ ]:

d.head(2)
d.shape

In [ ]:

pd.value_counts(d.diagnosis).to_frame().reset_index()

In [ ]:


y = d['diagnosis']
X = d.drop('diagnosis', axis = 1)
X = X.drop('id', axis = 1)
# Separating the dependent and independent variable

X_train, X_test, y_train, y_test = train_test_split(
			X, y, test_size = 0.3, random_state = 0)
# Splitting the data into training and testing data

In [ ]:

from sklearn.dummy import DummyClassifier
strategies = ['most_frequent', 'stratified', 'uniform', 'constant']

test_scores = []
for s in strategies:
	if s =='constant':
		dclf = DummyClassifier(strategy = s, random_state = 0, constant ='M')
	else:
		dclf = DummyClassifier(strategy = s, random_state = 0)
	dclf.fit(X_train, y_train)
	score = dclf.score(X_test, y_test)
	test_scores.append(score)

In [ ]:

ax = sns.stripplot(strategies, test_scores);
ax.set(xlabel ='Strategy', ylabel ='Test Score')
plt.show()

In [ ]:

clf = KNeighborsClassifier(n_neighbors = 5)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

In [ ]: