CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
veeralakrishna

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: veeralakrishna/DataCamp-Project-Solutions-Python
Path: blob/master/Book Recommendations from Charles Darwin/notebook.ipynb
Views: 1229
Kernel: Python 3

1. Darwin's bibliography

Charles Darwin

Charles Darwin is one of the few universal figures of science. His most renowned work is without a doubt his "On the Origin of Species" published in 1859 which introduced the concept of natural selection. But Darwin wrote many other books on a wide range of topics, including geology, plants or his personal life. In this notebook, we will automatically detect how closely related his books are to each other.

To this purpose, we will develop the bases of a content-based book recommendation system, which will determine which books are close to each other based on how similar the discussed topics are. The methods we will use are commonly used in text- or documents-heavy industries such as legal, tech or customer support to perform some common task such as text classification or handling search engine queries.

Let's take a look at the books we'll use in our recommendation system.

# Import library import glob # The books files are contained in this folder folder = "datasets/" # List all the .txt files and sort them alphabetically files = glob.glob(folder + "*.txt") files.sort()

2. Load the contents of each book into Python

As a first step, we need to load the content of these books into Python and do some basic pre-processing to facilitate the downstream analyses. We call such a collection of texts a corpus. We will also store the titles for these books for future reference and print their respective length to get a gauge for their contents.

# Import libraries import re, os # Initialize the object that will contain the texts and titles txts = [] titles = [] for n in files: # Open each file f = open(n, encoding='utf-8-sig') # Remove all non-alpha-numeric characters data = re.sub('[\W_]+', ' ', f.read()) # Store the texts and titles of the books in two separate lists titles.append(os.path.basename(n).replace('.txt', '')) txts.append(data) # Print the length, in characters, of each book [len(t) for t in txts]
[123231, 496068, 1776539, 617088, 913713, 624232, 335920, 523021, 797401, 901406, 1047518, 1010643, 767492, 1660866, 298319, 916267, 1093567, 1043499, 341447, 1149574]

3. Find "On the Origin of Species"

For the next parts of this analysis, we will often check the results returned by our method for a given book. For consistency, we will refer to Darwin's most famous book: "On the Origin of Species." Let's find to which index this book is associated.

# Browse the list containing all the titles for i in range(len(titles)): # Store the index if the title is "OriginofSpecies" if titles[i] == 'OriginofSpecies': ori = i break # Print the stored index ori
15

4. Tokenize the corpus

As a next step, we need to transform the corpus into a format that is easier to deal with for the downstream analyses. We will tokenize our corpus, i.e., transform each text into a list of the individual words (called tokens) it is made of. To check the output of our process, we will print the first 20 tokens of "On the Origin of Species".

# Define a list of stop words stoplist = set('for a of the and to in to be which some is at that we i who whom show via may my our might as well'.split()) # Convert the text to lower case txts_lower_case = [txt.lower() for txt in txts] # Transform the text into tokens txts_split = [txt.split() for txt in txts_lower_case] # Remove tokens which are part of the list of stop words texts = [[word for word in txt if word not in stoplist] for txt in txts_split] # Print the first 20 tokens for the "On the Origin of Species" book texts[ori][: 20]
['on', 'origin', 'species', 'but', 'with', 'regard', 'material', 'world', 'can', 'least', 'go', 'so', 'far', 'this', 'can', 'perceive', 'events', 'are', 'brought', 'about']

5. Stemming of the tokenized corpus

If you have read On the Origin of Species, you will have noticed that Charles Darwin can use different words to refer to a similar concept. For example, the concept of selection can be described by words such as selection, selective, select or selects. This will dilute the weight given to this concept in the book and potentially bias the results of the analysis.

To solve this issue, it is a common practice to use a stemming process, which will group together the inflected forms of a word so they can be analysed as a single item: the stem. In our On the Origin of Species example, the words related to the concept of selection would be gathered under the select stem.

As we are analysing 20 full books, the stemming algorithm can take several minutes to run and, in order to make the process faster, we will directly load the final results from a pickle file and review the method used to generate it.

import pickle # Load the stemmed tokens list from the pregenerated pickle file texts_stem = pickle.load(open('datasets/texts_stem.p', 'rb')) # Print the 20 first stemmed tokens from the "On the Origin of Species" book texts_stem[ori][: 20]
['on', 'origin', 'speci', 'but', 'with', 'regard', 'materi', 'world', 'can', 'least', 'go', 'so', 'far', 'thi', 'can', 'perceiv', 'event', 'are', 'brought', 'about']

6. Building a bag-of-words model

Now that we have transformed the texts into stemmed tokens, we need to build models that will be useable by downstream algorithms.

First, we need to will create a universe of all words contained in our corpus of Charles Darwin's books, which we call a dictionary. Then, using the stemmed tokens and the dictionary, we will create bag-of-words models (BoW) of each of our texts. The BoW models will represent our books as a list of all uniques tokens they contain associated with their respective number of occurrences.

To better understand the structure of such a model, we will print the five first elements of one of the "On the Origin of Species" BoW model.

# Load the functions allowing to create and use dictionaries from gensim import corpora # Create a dictionary from the stemmed tokens dictionary = corpora.Dictionary(texts_stem) # Create a bag-of-words model for each book, using the previously generated dictionary bows = [dictionary.doc2bow(txt) for txt in texts_stem] # Print the first five elements of the On the Origin of species' BoW model bows[ori][: 5]
[(0, 11), (5, 51), (6, 1), (8, 2), (21, 1)]

7. The most common words of a given book

The results returned by the bag-of-words model is certainly easy to use for a computer but hard to interpret for a human. It is not straightforward to understand which stemmed tokens are present in a given book from Charles Darwin, and how many occurrences we can find.

In order to better understand how the model has been generated and visualize its content, we will transform it into a DataFrame and display the 10 most common stems for the book "On the Origin of Species".

# Import pandas to create and manipulate DataFrames import pandas as pd # Convert the BoW model for "On the Origin of Species" into a DataFrame df_bow_origin = pd.DataFrame(bows[ori]) # Add the column names to the DataFrame df_bow_origin.columns = ['index', 'occurrences'] # Add a column containing the token corresponding to the dictionary index df_bow_origin['token'] = df_bow_origin['index'].apply(lambda x: dictionary[x]) # Sort the DataFrame by descending number of occurrences and print the first 10 values df_bow_origin = df_bow_origin.sort_values('occurrences', ascending=False) df_bow_origin.head(10)

8. Build a tf-idf model

If it wasn't for the presence of the stem "speci", we would have a hard time to guess this BoW model comes from the On the Origin of Species book. The most recurring words are, apart from few exceptions, very common and unlikely to carry any information peculiar to the given book. We need to use an additional step in order to determine which tokens are the most specific to a book.

To do so, we will use a tf-idf model (term frequency–inverse document frequency). This model defines the importance of each word depending on how frequent it is in this text and how infrequent it is in all the other documents. As a result, a high tf-idf score for a word will indicate that this word is specific to this text.

After computing those scores, we will print the 10 words most specific to the "On the Origin of Species" book (i.e., the 10 words with the highest tf-idf score).

# Load the gensim functions that will allow us to generate tf-idf models from gensim.models import TfidfModel # Generate the tf-idf model model = TfidfModel(bows) # Print the model for "On the Origin of Species" model[bows[ori]]
[(8, 0.00020383224047642202), (21, 0.0005716037746542094), (23, 0.0017118699041370883), (27, 0.0006458270601429994), (28, 0.0025678048562056324), (31, 0.0008559349520685442), (35, 0.00101497410751472), (36, 0.00101497410751472), (51, 0.000886740665721021), (54, 0.00202994821502944), (56, 0.0023757190244598344), (57, 0.00010191612023821101), (63, 0.0027544680933525786), (64, 0.000509580601191055), (66, 0.00020383224047642202), (67, 0.0023757190244598344), (68, 0.00202994821502944), (75, 0.0013772340466762893), (76, 0.0004433703328605105), (78, 0.004171843479607349), (80, 0.0020859217398036746), (83, 0.00857405661981314), (84, 0.000509580601191055), (88, 0.002445986885717064), (89, 0.0033632319678609636), (90, 0.000886740665721021), (91, 0.0016747506839411234), (94, 0.000886740665721021), (95, 0.0004433703328605105), (96, 0.003546962662884084), (97, 0.0016306579238113761), (102, 0.037686478293143394), (104, 0.000917245082143899), (106, 0.001417375386254771), (108, 0.0035434384656369273), (109, 0.005299638252386972), (111, 0.0022421546452406423), (114, 0.0015287418035731652), (123, 0.0509769226270304), (125, 0.009580115302391834), (126, 0.004171843479607349), (127, 0.001417375386254771), (137, 0.020026648174904797), (139, 0.007749924721715993), (141, 0.00101916120238211), (143, 0.004751438048919669), (144, 0.0047597336465067894), (154, 0.0010467191774632021), (156, 0.013962508472634907), (165, 0.006781184131501494), (167, 0.000611496721429266), (172, 0.021771758891234606), (176, 0.0033632319678609636), (178, 0.0031035923300235736), (186, 0.009274366941677202), (188, 0.0016747506839411234), (192, 0.006257765219411025), (196, 0.0038749623608579963), (197, 0.0013772340466762893), (198, 0.000886740665721021), (204, 0.0042521261587643135), (207, 0.0022603947105004976), (212, 0.001324909563096743), (214, 0.020395035311583484), (215, 0.021720943436859957), (219, 0.0019374811804289981), (220, 0.004396220545345449), (221, 0.0054618131386103995), (222, 0.0018206043795367997), (223, 0.0007086876931273855), (224, 0.007703414568616897), (226, 0.0030574836071463303), (230, 0.005135609712411265), (231, 0.0027544680933525786), (235, 0.002445986885717064), (236, 0.0012916541202859988), (237, 0.0020934383549264042), (241, 0.0029308136968969663), (242, 0.0015865778821689297), (243, 0.008263404280057736), (245, 0.004520789421000995), (246, 0.003465148088099174), (247, 0.001732574044049587), (249, 0.0018840945194337638), (251, 0.008343686959214698), (252, 0.0030449223225441605), (253, 0.0006458270601429994), (261, 0.001324909563096743), (269, 0.0006458270601429994), (271, 0.0008373753419705617), (276, 0.009921627703783398), (278, 0.0034296226479252566), (280, 0.0021260630793821567), (283, 0.02138147122013851), (285, 0.001417375386254771), (287, 0.0013301109985815313), (288, 0.006886170233381447), (290, 0.0035434384656369273), (291, 0.0018206043795367997), (296, 0.009839160268154101), (298, 0.042694255446964965), (300, 0.0016747506839411234), (301, 0.000611496721429266), (302, 0.0042521261587643135), (303, 0.004171843479607349), (304, 0.0020859217398036746), (311, 0.0016306579238113761), (313, 0.007086876931273855), (323, 0.0047597336465067894), (325, 0.021606217490500734), (327, 0.001732574044049587), (329, 0.01790404260679176), (335, 0.0004433703328605105), (336, 0.0023922081541910096), (338, 0.0020859217398036746), (339, 0.001417375386254771), (344, 0.001417375386254771), (345, 0.011527628654373272), (346, 0.00020383224047642202), (348, 0.0006280315064779214), (349, 0.0006458270601429994), (351, 0.0036412087590735995), (354, 0.007980665991489189), (356, 0.0018840945194337638), (358, 0.013772340466762893), (359, 0.0022603947105004976), (362, 0.00101497410751472), (367, 0.0023922081541910096), (369, 0.19772097392783367), (370, 0.043694505108883196), (371, 0.000509580601191055), (372, 0.0023922081541910096), (373, 0.0016306579238113761), (374, 0.001732574044049587), (375, 0.001936406284526009), (376, 0.006489658900271853), (377, 0.0030449223225441605), (380, 0.0016145676503574987), (387, 0.0007086876931273855), (388, 0.0042521261587643135), (389, 0.0005716037746542094), (391, 0.0023922081541910096), (400, 0.00202994821502944), (406, 0.013772340466762893), (407, 0.0005716037746542094), (409, 0.0035588452033748874), (411, 0.0007086876931273855), (412, 0.004751438048919669), (418, 0.0027544680933525786), (421, 0.0027544680933525786), (424, 0.007538884401734597), (425, 0.00811979286011776), (426, 0.0032291353007149973), (429, 0.000886740665721021), (431, 0.000886740665721021), (432, 0.010149741075147201), (433, 0.0016306579238113761), (434, 0.00405989643005888), (436, 0.0021260630793821567), (442, 0.009212940010656012), (446, 0.004171843479607349), (448, 0.021562415055741965), (449, 0.03291890683694216), (450, 0.0035434384656369273), (453, 0.0011210773226203211), (454, 0.00405989643005888), (456, 0.011301973552502492), (457, 0.0020859217398036746), (458, 0.0003229135300714997), (463, 0.01790404260679176), (464, 0.0027544680933525786), (465, 0.0008559349520685442), (468, 0.00202994821502944), (470, 0.0018206043795367997), (478, 0.010923626277220799), (482, 0.0344479258546676), (484, 0.013772340466762893), (486, 0.0020859217398036746), (489, 0.0054618131386103995), (490, 0.022603947105004983), (491, 0.00405989643005888), (493, 0.0007086876931273855), (497, 0.012839024281028162), (498, 0.0035434384656369273), (502, 0.00637818923814647), (505, 0.011970998987233783), (507, 0.0009687405902144991), (514, 0.0025121260259116855), (520, 0.004280477050004863), (524, 0.0020859217398036746), (526, 0.005716037746542095), (527, 0.0027544680933525786), (529, 0.006318799454769082), (531, 0.003546962662884084), (532, 0.012791353704852355), (534, 0.005320443994326125), (536, 0.0014268256833349542), (541, 0.0011432075493084189), (543, 0.01110604517518251), (544, 0.0031401575323896065), (546, 0.0017148113239626283), (551, 0.004784416308382019), (552, 0.0031731557643378595), (558, 0.0023922081541910096), (559, 0.00020383224047642202), (561, 0.0022421546452406423), (562, 0.005197722132148762), (563, 0.006908346571257135), (564, 0.0011432075493084189), (565, 0.00405989643005888), (566, 0.0013772340466762893), (569, 0.04034670029030646), (573, 0.010630315396910783), (576, 0.0025121260259116855), (578, 0.0028580188732710474), (579, 0.003159399727384541), (582, 0.0035434384656369273), (586, 0.003563578536689751), (590, 0.0017118699041370883), (592, 0.0010467191774632021), (593, 0.0018206043795367997), (594, 0.015898914757160917), (596, 0.0015865778821689297), (598, 0.0023922081541910096), (600, 0.01251553043882205), (601, 0.00202994821502944), (604, 0.05582357591330961), (605, 0.00637818923814647), (606, 0.0005716037746542094), (616, 0.0010467191774632021), (619, 0.00710481875260304), (620, 0.024945049756828264), (626, 0.0013772340466762893), (628, 0.00041868767098528084), (629, 0.004197875890929496), (633, 0.0013301109985815313), (635, 0.004960813851891699), (636, 0.005991544664479809), (641, 0.006346311528675719), (646, 0.0027544680933525786), (649, 0.005763814327186636), (652, 0.000917245082143899), (653, 0.0011432075493084189), (654, 0.0023757190244598344), (655, 0.0016747506839411234), (657, 0.0030449223225441605), (658, 0.007104097661572994), (660, 0.005144433971887885), (662, 0.003563578536689751), (663, 0.0011878595122299172), (665, 0.00101497410751472), (666, 0.006395676852426178), (667, 0.01623958572023552), (668, 0.005669501545019084), (670, 0.011723254787587865), (674, 0.002445986885717064), (675, 0.0450089246309177), (676, 0.061655829302082535), (678, 0.00040766448095284403), (679, 0.005144433971887885), (681, 0.0007134128416674771), (682, 0.0015865778821689297), (685, 0.0020859217398036746), (686, 0.0006458270601429994), (693, 0.0035670642083373855), (698, 0.0022864150986168378), (705, 0.0034237398082741766), (712, 0.0017118699041370883), (713, 0.0013772340466762893), (720, 0.0029308136968969663), (722, 0.005074870537573601), (726, 0.01771719232818464), (727, 0.004396220545345449), (728, 0.0031731557643378595), (729, 0.006089844645088321), (730, 0.01028886794377577), (731, 0.0017148113239626283), (740, 0.008357121859533303), (741, 0.0066990027357644935), (743, 0.006930296176198348), (744, 0.01033323296228799), (745, 0.006257765219411025), (748, 0.017254559827750243), (750, 0.07127157073379503), (752, 0.1459896647842414), (753, 0.06980942681543291), (758, 0.004197875890929496), (769, 0.0017118699041370883), (770, 0.0010467191774632021), (771, 0.0007134128416674771), (776, 0.00020934383549264042), (783, 0.006346311528675719), (784, 0.004171843479607349), (785, 0.002834750772509542), (786, 0.004171843479607349), (787, 0.006930296176198348), (788, 0.002955567486908119), (789, 0.003546962662884084), (793, 0.014208195323145987), (797, 0.004197875890929496), (798, 0.024359378580353284), (803, 0.010149741075147201), (805, 0.0034237398082741766), (806, 0.02571547930590961), (810, 0.0011878595122299172), (811, 0.006886170233381447), (815, 0.0018840945194337638), (816, 0.0012560630129558428), (817, 0.0015865778821689297), (819, 0.0026602219971630626), (820, 0.006458270601429995), (821, 0.006207184660047147), (822, 0.0688958517093352), (825, 0.0003229135300714997), (830, 0.001222993442858532), (831, 0.003552048830786497), (832, 0.02256933073236843), (833, 0.01097906002243099), (834, 0.0017118699041370883), (838, 0.001936406284526009), (839, 0.0016306579238113761), (842, 0.0013772340466762893), (848, 0.006886170233381447), (849, 0.0011432075493084189), (851, 0.0020859217398036746), (857, 0.0005716037746542094), (859, 0.008343686959214698), (861, 0.0018840945194337638), (862, 0.001732574044049587), (866, 0.004433703328605105), (867, 0.004784416308382019), (870, 0.0003229135300714997), (873, 0.01886292456358891), (875, 0.0009687405902144991), (878, 0.00202994821502944), (879, 0.005489530011215495), (881, 0.0007086876931273855), (883, 0.0008153289619056881), (885, 0.0023922081541910096), (887, 0.001732574044049587), (891, 0.0006458270601429994), (892, 0.0013772340466762893), (894, 0.002445986885717064), (897, 0.005991544664479809), (898, 0.0011432075493084189), (903, 0.0011878595122299172), (905, 0.0025678048562056324), (909, 0.002649819126193486), (910, 0.005508936186705157), (917, 0.05877026247301295), (919, 0.0017118699041370883), (921, 0.0011210773226203211), (924, 0.0054015543726251836), (925, 0.004784416308382019), (929, 0.013718490591701027), (931, 0.00101497410751472), (935, 0.0026602219971630626), (936, 0.0025121260259116855), (937, 0.0009687405902144991), (939, 0.0016306579238113761), (944, 0.000886740665721021), (945, 0.00710481875260304), (948, 0.0023757190244598344), (951, 0.012692623057351438), (952, 0.006089844645088321), (953, 0.005074870537573601), (956, 0.010149741075147201), (957, 0.0012560630129558428), (958, 0.0008559349520685442), (962, 0.0011432075493084189), (966, 0.12753430785821307), (967, 0.047703783053191846), (971, 0.0019374811804289981), (973, 0.000509580601191055), (975, 0.00040766448095284403), (976, 0.00203832240476422), (980, 0.009502876097839338), (981, 0.016526808560115472), (982, 0.00101497410751472), (985, 0.03886905667648624), (988, 0.004382393170243073), (992, 0.0014268256833349542), (994, 0.007932889410844648), (995, 0.021725146310165012), (996, 0.011527628654373272), (997, 0.0022603947105004976), (998, 0.03014551231094022), (999, 0.009519467293013579), (1000, 0.00010191612023821101), (1004, 0.0011878595122299172), (1007, 0.011339003090038168), (1009, 0.004520789421000995), (1010, 0.0034296226479252566), (1012, 0.013194663397691361), (1013, 0.006089844645088321), (1016, 0.018752566123830826), (1018, 0.005911134973816238), (1019, 0.00020383224047642202), (1020, 0.0007134128416674771), (1022, 0.0022864150986168378), (1023, 0.0012916541202859988), (1024, 0.055986327757063456), (1026, 0.0017148113239626283), (1029, 0.00710481875260304), (1030, 0.0030449223225441605), (1031, 0.000611496721429266), (1037, 0.0007086876931273855), (1039, 0.0005716037746542094), (1042, 0.01110604517518251), (1045, 0.02000613211289733), (1048, 0.007703414568616897), (1050, 0.0015865778821689297), (1053, 0.006859245295850513), (1060, 0.002445986885717064), (1061, 0.019686503897576514), (1062, 0.004279674760342721), (1065, 0.009640638326734025), (1067, 0.0023757190244598344), (1068, 0.00931077699007072), (1077, 0.0011878595122299172), (1082, 0.004131702140028868), (1083, 0.06973566050781355), (1085, 0.01623958572023552), (1086, 0.0250310608776441), (1087, 0.001417375386254771), (1088, 0.0005716037746542094), (1089, 0.00968740590214499), (1092, 0.11889753916880946), (1093, 0.01110604517518251), (1094, 0.0013772340466762893), (1095, 0.0018206043795367997), (1096, 0.0045728301972336755), (1098, 0.0006458270601429994), (1103, 0.0031731557643378595), (1106, 0.0047597336465067894), (1107, 0.0022603947105004976), (1108, 0.009593515278639267), (1109, 0.0018206043795367997), (1110, 0.005489530011215495), (1112, 0.001773481331442042), (1115, 0.02203574474682063), (1117, 0.0022603947105004976), (1118, 0.004877073661465615), (1119, 0.0036412087590735995), (1120, 0.010149741075147201), (1122, 0.010690735610069255), (1123, 0.0031731557643378595), (1124, 0.0003229135300714997), (1125, 0.0020934383549264042), (1132, 0.0006280315064779214), (1135, 0.0031731557643378595), (1136, 0.0020859217398036746), (1142, 0.001222993442858532), (1145, 0.000611496721429266), (1146, 0.00101497410751472), (1148, 0.001324909563096743), (1154, 0.0015865778821689297), (1158, 0.0020859217398036746), (1161, 0.0015865778821689297), (1167, 0.005508936186705157), (1169, 0.001417375386254771), (1173, 0.001773481331442042), (1174, 0.0031401575323896065), (1175, 0.0009687405902144991), (1176, 0.0005716037746542094), (1177, 0.0023922081541910096), (1179, 0.02203574474682063), (1180, 0.0008373753419705617), (1182, 0.0015865778821689297), (1185, 0.00020934383549264042), (1186, 0.0027544680933525786), (1187, 0.001417375386254771), (1192, 0.006908346571257135), (1193, 0.00710481875260304), (1196, 0.003197838426213089), (1198, 0.0273207687812881), (1200, 0.0009687405902144991), (1208, 0.0034296226479252566), (1209, 0.0016306579238113761), (1210, 0.0022603947105004976), (1212, 0.00405989643005888), (1214, 0.0036412087590735995), (1218, 0.00020934383549264042), (1223, 0.018291320788934702), (1224, 0.006257765219411025), (1225, 0.006395676852426178), (1227, 0.0013772340466762893), (1228, 0.004520789421000995), (1229, 0.005233595887316011), (1230, 0.0021402385250024313), (1232, 0.00831501658560942), (1234, 0.004171843479607349), (1236, 0.030149182634514715), (1239, 0.0006280315064779214), (1240, 0.0016306579238113761), (1241, 0.000509580601191055), (1245, 0.004279674760342721), (1246, 0.0013772340466762893), (1247, 0.0028580188732710474), (1248, 0.02138147122013851), (1249, 0.0007086876931273855), (1251, 0.008263404280057736), (1254, 0.009103021897684), (1255, 0.12827247245605677), (1257, 0.0027544680933525786), (1258, 0.004001226422579465), (1259, 0.016687373918429397), (1260, 0.0016306579238113761), (1261, 0.0023922081541910096), (1264, 0.006489658900271853), (1267, 0.0018840945194337638), (1270, 0.0054618131386103995), (1272, 0.0008559349520685442), (1273, 0.0016306579238113761), (1275, 0.004382393170243073), (1278, 0.0016306579238113761), (1281, 0.10809521561292246), (1283, 0.005716037746542095), (1284, 0.0015865778821689297), (1285, 0.0022864150986168378), (1286, 0.0012560630129558428), (1290, 0.0042521261587643135), (1292, 0.00020383224047642202), (1293, 0.01116471518266192), (1294, 0.0018206043795367997), (1296, 0.000886740665721021), (1297, 0.0033495013678822468), (1300, 0.0006458270601429994), (1301, 0.005144433971887885), (1302, 0.005508936186705157), (1305, 0.00101497410751472), (1307, 0.006781184131501494), (1310, 0.0015287418035731652), (1311, 0.005442939722808651), (1314, 0.023338791534550322), (1315, 0.11339003090038167), (1317, 0.006280315064779213), (1319, 0.01116471518266192), (1320, 0.002751735246431697), (1322, 0.0011432075493084189), (1323, 0.041978758909294964), (1324, 0.010923626277220799), (1325, 0.001732574044049587), (1326, 0.00020934383549264042), (1327, 0.00020383224047642202), (1328, 0.0035588452033748874), (1331, 0.0022168516643025524), (1333, 0.09545783036725297), (1334, 0.0005716037746542094), (1335, 0.0031731557643378595), (1336, 0.0015865778821689297), (1337, 0.001222993442858532), (1338, 0.0021260630793821567), (1339, 0.0005716037746542094), (1340, 0.0017118699041370883), (1342, 0.07023893638049075), (1346, 0.013772340466762893), (1347, 0.0022603947105004976), (1349, 0.005991544664479809), (1351, 0.0012560630129558428), (1355, 0.0025121260259116855), (1356, 0.000611496721429266), (1359, 0.0030449223225441605), (1360, 0.0014654068484484832), (1363, 0.04312483011148393), (1364, 0.01323945473293149), (1379, 0.0022864150986168378), (1380, 0.0006280315064779214), (1381, 0.000886740665721021), (1390, 0.008504252317528627), (1393, 0.0006458270601429994), (1396, 0.00800245284515893), (1402, 0.0007086876931273855), (1406, 0.0020859217398036746), (1407, 0.0084240363243497), (1410, 0.0011432075493084189), (1411, 0.0032291353007149973), (1414, 0.001417375386254771), (1416, 0.0023757190244598344), (1418, 0.001222993442858532), (1423, 0.005320443994326125), (1425, 0.0022864150986168378), (1426, 0.0025678048562056324), (1428, 0.026796010943057974), (1430, 0.02341297879349692), (1433, 0.006287641521196303), (1439, 0.0027544680933525786), (1441, 0.0036412087590735995), (1442, 0.0023922081541910096), (1443, 0.0013772340466762893), (1445, 0.038426710078508466), (1449, 0.0031731557643378595), (1450, 0.0008373753419705617), (1451, 0.0008559349520685442), (1457, 0.0054015543726251836), (1464, 0.004877073661465615), (1465, 0.0017148113239626283), (1471, 0.031288248913130784), (1476, 0.003197838426213089), (1478, 0.0025678048562056324), (1480, 0.000917245082143899), (1482, 0.006257765219411025), (1488, 0.0011432075493084189), (1489, 0.005166616481143995), (1492, 0.007176624462573028), (1493, 0.0010467191774632021), (1497, 0.006886170233381447), (1500, 0.000509580601191055), (1502, 0.005074870537573601), (1503, 0.002751735246431697), (1506, 0.02482211360998778), (1520, 0.0027214698614043257), (1523, 0.016526808560115472), (1524, 0.0729948323921207), (1525, 0.004171843479607349), (1530, 0.0023757190244598344), (1532, 0.0036412087590735995), (1533, 0.0023757190244598344), (1534, 0.0021260630793821567), (1535, 0.0008373753419705617), (1536, 0.024095381566331106), (1540, 0.0013772340466762893), (1541, 0.00857405661981314), (1542, 0.0014268256833349542), (1543, 0.00931077699007072), (1544, 0.0032291353007149973), (1546, 0.009568832616764038), (1548, 0.006395676852426178), (1554, 0.017817892683448758), (1557, 0.0013772340466762893), (1559, 0.001732574044049587), (1561, 0.005911134973816238), (1566, 0.004960813851891699), (1568, 0.013772340466762893), (1572, 0.0015865778821689297), (1576, 0.0016306579238113761), (1577, 0.00020383224047642202), (1578, 0.000611496721429266), (1581, 0.012414369320094295), (1583, 0.0047597336465067894), (1587, 0.0017148113239626283), (1588, 0.0013772340466762893), (1589, 0.004171843479607349), (1590, 0.0020859217398036746), (1598, 0.00040766448095284403), (1601, 0.0022168516643025524), (1605, 0.0023757190244598344), (1607, 0.0034296226479252566), (1609, 0.004784416308382019), (1613, 0.005716037746542095), (1616, 0.008263404280057736), (1619, 0.0008559349520685442), (1624, 0.005135609712411265), (1625, 0.017254559827750243), (1627, 0.0011432075493084189), (1628, 0.040583868000448865), (1629, 0.002445986885717064), (1635, 0.007703414568616897), (1636, 0.003563578536689751), (1637, 0.013772340466762893), (1640, 0.009103021897684), (1642, 0.0035434384656369273), (1643, 0.0034296226479252566), (1644, 0.0020859217398036746), (1646, 0.006257765219411025), (1647, 0.0014268256833349542), (1648, 0.005991544664479809), (1649, 0.000509580601191055), (1650, 0.0012560630129558428), (1655, 0.0032613158476227522), (1657, 0.00998777978334468), (1661, 0.0016747506839411234), (1665, 0.002751735246431697), (1666, 0.0015287418035731652), (1667, 0.0012916541202859988), (1668, 0.004751438048919669), (1670, 0.03715424535252361), (1677, 0.0015287418035731652), (1680, 0.00101916120238211), (1684, 0.008263404280057736), (1686, 0.0014654068484484832), (1692, 0.0008153289619056881), (1695, 0.0012916541202859988), (1696, 0.012179689290176642), (1701, 0.005135609712411265), (1705, 0.0031035923300235736), (1708, 0.0035588452033748874), (1710, 0.001222993442858532), (1712, 0.000305748360714633), (1715, 0.0011878595122299172), (1716, 0.001732574044049587), (1719, 0.0003229135300714997), (1722, 0.0027544680933525786), (1727, 0.0011432075493084189), (1728, 0.10571949658846262), (1735, 0.00202994821502944), (1743, 0.0020934383549264042), (1744, 0.0009687405902144991), (1752, 0.026602219971630627), (1754, 0.007282417518147199), (1759, 0.008263404280057736), (1761, 0.0031401575323896065), (1762, 0.0008373753419705617), (1763, 0.0059392975611495865), (1766, 0.0014654068484484832), (1768, 0.007176624462573028), (1770, 0.0023757190244598344), (1772, 0.01626276408930234), (1774, 0.0031731557643378595), (1775, 0.0054618131386103995), (1778, 0.006859245295850513), (1779, 0.007282417518147199), (1780, 0.0010467191774632021), (1781, 0.002834750772509542), (1782, 0.004960813851891699), (1784, 0.0027544680933525786), (1785, 0.006395676852426178), (1793, 0.0020859217398036746), (1794, 0.0008559349520685442), (1797, 0.01883056894550797), (1798, 0.02109663688930968), (1799, 0.0047597336465067894), (1806, 0.002955567486908119), (1808, 0.006346311528675719), (1813, 0.0004433703328605105), (1815, 0.0013772340466762893), (1816, 0.0013772340466762893), (1820, 0.0023757190244598344), (1823, 0.008559349520685442), (1825, 0.0054618131386103995), (1832, 0.0011210773226203211), (1833, 0.0031035923300235736), (1834, 0.0008373753419705617), (1835, 0.0007086876931273855), (1838, 0.0005716037746542094), (1840, 0.0006280315064779214), (1841, 0.0015865778821689297), (1846, 0.006886170233381447), (1847, 0.005144433971887885), (1848, 0.009640638326734025), (1849, 0.00101916120238211), (1850, 0.005508936186705157), (1851, 0.009415284472753985), (1853, 0.017118699041370884), (1854, 0.0017118699041370883), (1857, 0.004382393170243073), (1858, 0.0029062217706434974), (1859, 0.04178560929766652), (1860, 0.003552048830786497), (1862, 0.00710481875260304), (1864, 0.0021260630793821567), (1866, 0.004131702140028868), (1869, 0.006346311528675719), (1876, 0.00020934383549264042), (1878, 0.07361040587789479), (1881, 0.020026648174904797), (1884, 0.006089844645088321), (1885, 0.00857405661981314), (1898, 0.005812443541286995), (1899, 0.0027544680933525786), (1904, 0.005508936186705157), (1905, 0.005135609712411265), (1906, 0.0008559349520685442), (1907, 0.0006458270601429994), (1908, 0.00040766448095284403), (1909, 0.0034237398082741766), (1910, 0.0027544680933525786), (1915, 0.0023922081541910096), (1916, 0.0013301109985815313), (1922, 0.01251553043882205), (1923, 0.0031731557643378595), (1926, 0.0025678048562056324), (1933, 0.001834490164287798), (1934, 0.000305748360714633), (1935, 0.007703414568616897), (1938, 0.006207184660047147), (1940, 0.006847479616548353), (1941, 0.0033495013678822468), (1942, 0.00010191612023821101), (1944, 0.0011878595122299172), (1945, 0.0012560630129558428), (1948, 0.0005716037746542094), (1949, 0.00499388989167234), (1951, 0.010690735610069255), (1952, 0.0023757190244598344), (1953, 0.0027544680933525786), (1958, 0.004784416308382019), (1959, 0.00020934383549264042), (1964, 0.007282417518147199), (1965, 0.00101497410751472), (1966, 0.01097906002243099), (1967, 0.00499388989167234), (1968, 0.005135609712411265), (1974, 0.01417375386254771), (1979, 0.0021260630793821567), (1980, 0.004279674760342721), (1981, 0.00203832240476422), (1982, 0.0013772340466762893), (1985, 0.015865778821689297), (1986, 0.061627316548935177), (1990, 0.011961040770955049), (1991, 0.0003229135300714997), (1993, 0.02065851070014434), (1994, 0.009134766967632482), (1997, 0.0025833082405719975), (2000, 0.0025121260259116855), (2001, 0.0007134128416674771), (2002, 0.001732574044049587), (2003, 0.00101497410751472), (2007, 0.0016306579238113761), (2010, 0.003977532874360168), (2012, 0.013607349307021628), (2013, 0.0013301109985815313), (2014, 0.0011210773226203211), (2018, 0.001417375386254771), (2020, 0.0016306579238113761), (2021, 0.0017118699041370883), (2022, 0.008255205739295092), (2023, 0.0005716037746542094), (2026, 0.00481490821633073), (2030, 0.0005716037746542094), (2031, 0.0018206043795367997), (2032, 0.0020859217398036746), (2037, 0.0005716037746542094), (2039, 0.008583097255198258), (2044, 0.0059392975611495865), (2045, 0.0010467191774632021), (2049, 0.0035434384656369273), (2051, 0.0008153289619056881), (2053, 0.003563578536689751), (2054, 0.000886740665721021), (2055, 0.001732574044049587), (2065, 0.001834490164287798), (2066, 0.0005716037746542094), (2067, 0.009921627703783398), (2068, 0.0018206043795367997), (2069, 0.0005716037746542094), (2073, 0.004751438048919669), (2074, 0.00101497410751472), (2076, 0.0031035923300235736), (2078, 0.0015865778821689297), (2082, 0.0011878595122299172), (2083, 0.0014654068484484832), (2084, 0.0011432075493084189), (2086, 0.0054618131386103995), (2087, 0.011304567116602583), (2088, 0.0003229135300714997), (2089, 0.000917245082143899), (2090, 0.003197838426213089), (2095, 0.0013772340466762893), (2096, 0.000886740665721021), (2102, 0.0027544680933525786), (2108, 0.004171843479607349), (2110, 0.0008559349520685442), (2111, 0.000917245082143899), (2114, 0.007086876931273855), (2116, 0.0023757190244598344), (2117, 0.004131702140028868), (2118, 0.02065851070014434), (2119, 0.004843702951072495), (2125, 0.0011878595122299172), (2127, 0.0023922081541910096), (2128, 0.0059392975611495865), (2133, 0.001417375386254771), (2134, 0.0038749623608579963), (2135, 0.0020859217398036746), (2136, 0.007932889410844648), (2138, 0.0021260630793821567), (2144, 0.039395450668722964), (2145, 0.001417375386254771), (2148, 0.0022421546452406423), (2152, 0.002302782190419045), (2154, 0.0014268256833349542), (2155, 0.0054618131386103995), (2156, 0.04131702140028868), (2158, 0.07758980825058934), (2159, 0.024359378580353284), (2162, 0.014290094366355236), (2164, 0.3274137142248521), (2165, 0.007537295658628679), (2169, 0.002547903005955275), (2170, 0.0012916541202859988), (2172, 0.00202994821502944), (2176, 0.006089844645088321), (2180, 0.0029308136968969663), (2183, 0.0026602219971630626), (2186, 0.02858018873271047), (2187, 0.04536455245963284), (2195, 0.013718490591701027), (2197, 0.02837570130307267), (2200, 0.0007086876931273855), (2202, 0.0013772340466762893), (2206, 0.01657650946497207), (2208, 0.00101497410751472), (2210, 0.026167446886849497), (2222, 0.0018206043795367997), (2223, 0.0018206043795367997), (2226, 0.0021402385250024313), (2227, 0.00202994821502944), (2229, 0.0054618131386103995), (2232, 0.00010191612023821101), (2233, 0.002445986885717064), (2234, 0.006346311528675719), (2235, 0.0047597336465067894), (2237, 0.00041868767098528084), (2238, 0.00020383224047642202), (2240, 0.0012916541202859988), (2241, 0.009640638326734025), (2242, 0.002834750772509542), (2244, 0.07437063852051962), (2249, 0.000886740665721021), (2250, 0.0022168516643025524), (2255, 0.005716037746542095), (2258, 0.0023922081541910096), (2264, 0.000305748360714633), (2266, 0.0020859217398036746), (2267, 0.00460556438083809), (2272, 0.004131702140028868), (2273, 0.004171843479607349), (2274, 0.002751735246431697), (2277, 0.0021260630793821567), (2279, 0.0023922081541910096), (2280, 0.005166616481143995), (2281, 0.0008373753419705617), (2282, 0.001732574044049587), (2284, 0.01306645463452909), (2285, 0.0023757190244598344), (2289, 0.006420715575007294), (2290, 0.005299638252386972), (2292, 0.004279674760342721), (2294, 0.0015865778821689297), (2296, 0.0015865778821689297), (2297, 0.00041868767098528084), (2300, 0.004001226422579465), (2302, 0.009103021897684), (2303, 0.00203832240476422), (2305, 0.005508936186705157), (2309, 0.01420963750520608), (2311, 0.008263404280057736), (2313, 0.0074308490705047225), (2315, 0.000886740665721021), (2317, 0.0022421546452406423), (2319, 0.0022864150986168378), (2320, 0.0031035923300235736), (2322, 0.0039903329957445945), (2325, 0.14818621969714912), (2328, 0.00101497410751472), (2330, 0.04287028309906571), (2332, 0.0020934383549264042), (2335, 0.004001226422579465), (2336, 0.01116471518266192), (2337, 0.0031731557643378595), (2339, 0.0030574836071463303), (2340, 0.03718531926025981), (2342, 0.0014654068484484832), (2343, 0.0025678048562056324), (2344, 0.0014654068484484832), (2346, 0.004960813851891699), (2349, 0.0034237398082741766), (2352, 0.0013301109985815313), (2353, 0.0009687405902144991), (2357, 0.06874270623335639), (2359, 0.0015865778821689297), (2361, 0.002649819126193486), (2363, 0.0006458270601429994), (2364, 0.007439876777389404), (2368, 0.027404300902897444), (2369, 0.0042521261587643135), (2370, 0.0013772340466762893), (2375, 0.007980665991489189), (2376, 0.003159399727384541), (2377, 0.0030574836071463303), (2378, 0.00857405661981314), (2381, 0.00202994821502944), (2382, 0.0016747506839411234), (2383, 0.000886740665721021), (2385, 0.0006458270601429994), (2387, 0.0003229135300714997), (2389, 0.0042521261587643135), (2392, 0.0006280315064779214), (2393, 0.00203832240476422), (2396, 0.006257765219411025), (2399, 0.011084258321512764), (2401, 0.00101497410751472), (2408, 0.016538163003918593), (2409, 0.019005752195678675), (2417, 0.0007086876931273855), (2418, 0.004131702140028868), (2419, 0.0013772340466762893), (2420, 0.0013301109985815313), (2421, 0.01064088798865225), (2422, 0.014187850651536335), (2423, 0.0019374811804289981), (2431, 0.0005716037746542094), (2432, 0.003159399727384541), (2433, 0.00010191612023821101), (2434, 0.0020934383549264042), (2444, 0.0005716037746542094), (2446, 0.02909879313347702), (2449, 0.0047597336465067894), ...]

9. The results of the tf-idf model

Once again, the format of those results is hard to interpret for a human. Therefore, we will transform it into a more readable version and display the 10 most specific words for the "On the Origin of Species" book.

# Convert the tf-idf model for "On the Origin of Species" into a DataFrame df_tfidf = pd.DataFrame(model[bows[ori]]) # Name the columns of the DataFrame id and score df_tfidf.columns = ['id', 'score'] # Add the tokens corresponding to the numerical indices for better readability df_tfidf['token'] = df_tfidf['id'].apply(lambda x: dictionary[x]) # Sort the DataFrame by descending tf-idf score and print the first 10 rows. df_tfidf = df_tfidf.sort_values('score', ascending=False) df_tfidf.head(10)

10. Compute distance between texts

The results of the tf-idf algorithm now return stemmed tokens which are specific to each book. We can, for example, see that topics such as selection, breeding or domestication are defining "On the Origin of Species" (and yes, in this book, Charles Darwin talks quite a lot about pigeons too). Now that we have a model associating tokens to how specific they are to each book, we can measure how related to books are between each other.

To this purpose, we will use a measure of similarity called cosine similarity and we will visualize the results as a distance matrix, i.e., a matrix showing all pairwise distances between Darwin's books.

# Load the library allowing similarity computations from gensim import similarities # Compute the similarity matrix (pairwise distance between all texts) sims = similarities.MatrixSimilarity(model[bows]) # Transform the resulting list into a dataframe sim_df = pd.DataFrame(list(sims)) # Add the titles of the books as columns and index of the dataframe sim_df.columns = titles sim_df.index = titles # Print the resulting matrix sim_df

11. The book most similar to "On the Origin of Species"

We now have a matrix containing all the similarity measures between any pair of books from Charles Darwin! We can now use this matrix to quickly extract the information we need, i.e., the distance between one book and one or several others.

As a first step, we will display which books are the most similar to "On the Origin of Species," more specifically we will produce a bar chart showing all books ranked by how similar they are to Darwin's landmark work.

# This is needed to display plots in a notebook %matplotlib inline # Import libraries import matplotlib.pyplot as plt # Select the column corresponding to "On the Origin of Species" and v = sim_df['OriginofSpecies'] # Sort by ascending scores v_sorted = v.sort_values() # Plot this data has a horizontal bar plot v_sorted.plot.barh(x='lab', y='val', rot=0).plot() # Modify the axes labels and plot title for a better readability plt.xlabel("Score") plt.ylabel("Book") plt.title("Similarity")
<matplotlib.text.Text at 0x7f2ab6a990f0>
Image in a Jupyter notebook

12. Which books have similar content?

This turns out to be extremely useful if we want to determine a given book's most similar work. For example, we have just seen that if you enjoyed "On the Origin of Species," you can read books discussing similar concepts such as "The Variation of Animals and Plants under Domestication" or "The Descent of Man, and Selection in Relation to Sex." If you are familiar with Darwin's work, these suggestions will likely seem natural to you. Indeed, On the Origin of Species has a whole chapter about domestication and The Descent of Man, and Selection in Relation to Sex applies the theory of natural selection to human evolution. Hence, the results make sense.

However, we now want to have a better understanding of the big picture and see how Darwin's books are generally related to each other (in terms of topics discussed). To this purpose, we will represent the whole similarity matrix as a dendrogram, which is a standard tool to display such data. This last approach will display all the information about book similarities at once. For example, we can find a book's closest relative but, also, we can visualize which groups of books have similar topics (e.g., the cluster about Charles Darwin personal life with his autobiography and letters). If you are familiar with Darwin's bibliography, the results should not surprise you too much, which indicates the method gives good results. Otherwise, next time you read one of the author's book, you will know which other books to read next in order to learn more about the topics it addressed.

# Import libraries from scipy.cluster import hierarchy # Compute the clusters from the similarity matrix, # using the Ward variance minimization algorithm Z = hierarchy.linkage(sims, 'ward') # Display this result as a horizontal dendrogram hierarchy.dendrogram(Z, leaf_font_size=8, labels=sim_df.index, orientation='left')
{'color_list': ['g', 'g', 'r', 'r', 'r', 'c', 'm', 'm', 'm', 'y', 'k', 'k', 'k', 'k', 'b', 'b', 'b', 'b', 'b'], 'dcoord': [[0.0, 0.26140245804628653, 0.26140245804628653, 0.0], [0.0, 1.3236891503157402, 1.3236891503157402, 0.26140245804628653], [0.0, 0.8674414401120458, 0.8674414401120458, 0.0], [0.0, 1.1416964159543324, 1.1416964159543324, 0.8674414401120458], [0.0, 1.1971849472385157, 1.1971849472385157, 1.1416964159543324], [0.0, 0.676323296013227, 0.676323296013227, 0.0], [0.0, 0.895141801056005, 0.895141801056005, 0.0], [0.0, 1.1046849489956518, 1.1046849489956518, 0.0], [0.895141801056005, 1.5287533196378202, 1.5287533196378202, 1.1046849489956518], [0.0, 0.8619943484228049, 0.8619943484228049, 0.0], [0.0, 1.064324404547908, 1.064324404547908, 0.0], [0.0, 1.365011903546023, 1.365011903546023, 0.0], [0.0, 1.4204909279171327, 1.4204909279171327, 1.365011903546023], [1.064324404547908, 1.5488868742803852, 1.5488868742803852, 1.4204909279171327], [0.8619943484228049, 1.8572464779479094, 1.8572464779479094, 1.5488868742803852], [1.5287533196378202, 1.9985744355383679, 1.9985744355383679, 1.8572464779479094], [0.676323296013227, 2.085712731266052, 2.085712731266052, 1.9985744355383679], [1.1971849472385157, 2.1421801586882854, 2.1421801586882854, 2.085712731266052], [1.3236891503157402, 2.5482366072620737, 2.5482366072620737, 2.1421801586882854]], 'icoord': [[15.0, 15.0, 25.0, 25.0], [5.0, 5.0, 20.0, 20.0], [55.0, 55.0, 65.0, 65.0], [45.0, 45.0, 60.0, 60.0], [35.0, 35.0, 52.5, 52.5], [75.0, 75.0, 85.0, 85.0], [95.0, 95.0, 105.0, 105.0], [115.0, 115.0, 125.0, 125.0], [100.0, 100.0, 120.0, 120.0], [135.0, 135.0, 145.0, 145.0], [155.0, 155.0, 165.0, 165.0], [185.0, 185.0, 195.0, 195.0], [175.0, 175.0, 190.0, 190.0], [160.0, 160.0, 182.5, 182.5], [140.0, 140.0, 171.25, 171.25], [110.0, 110.0, 155.625, 155.625], [80.0, 80.0, 132.8125, 132.8125], [43.75, 43.75, 106.40625, 106.40625], [12.5, 12.5, 75.078125, 75.078125]], 'ivl': ['Autobiography', 'LifeandLettersVol1', 'LifeandLettersVol2', 'DescentofMan', 'FoundationsOriginofSpecies', 'OriginofSpecies', 'VariationPlantsAnimalsDomestication', 'MonographCirripedia', 'MonographCirripediaVol2', 'GeologicalObservationsSouthAmerica', 'VolcanicIslands', 'CoralReefs', 'VoyageBeagle', 'DifferentFormsofFlowers', 'EffectsCrossSelfFertilization', 'InsectivorousPlants', 'MovementClimbingPlants', 'ExpressionofEmotionManAnimals', 'FormationVegetableMould', 'PowerMovementPlants'], 'leaves': [0, 10, 11, 2, 7, 15, 17, 12, 13, 8, 18, 1, 19, 3, 4, 9, 14, 5, 6, 16]}
Image in a Jupyter notebook