Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/document_clustering.ipynb
1904 views
Kernel: Python 3
%matplotlib inline

Clustering text documents using k-means

This is an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays.

Two feature extraction methods can be used in this example:

  • TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most frequent words to features indices and hence compute a word occurrence frequency (sparse) matrix. The word frequencies are then reweighted using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.

  • HashingVectorizer hashes word occurrences to a fixed dimensional space, possibly with collisions. The word count vectors are then normalized to each have l2-norm equal to one (projected to the euclidean unit-ball) which seems to be important for k-means to work in high dimensional space.

    HashingVectorizer does not provide IDF weighting as this is a stateless model (the fit method does nothing). When IDF weighting is needed it can be added by pipelining its output to a TfidfTransformer instance.

Two algorithms are demoed: ordinary k-means and its more scalable cousin minibatch k-means.

Additionally, latent semantic analysis can also be used to reduce dimensionality and discover latent patterns in the data.

It can be noted that k-means (and minibatch k-means) are very sensitive to feature scaling and that in this case the IDF weighting helps improve the quality of the clustering by quite a lot as measured against the "ground truth" provided by the class label assignments of the 20 newsgroups dataset.

This improvement is not visible in the Silhouette Coefficient which is small for both as this measure seem to suffer from the phenomenon called "Concentration of Measure" or "Curse of Dimensionality" for high dimensional datasets such as text data. Other measures such as V-measure and Adjusted Rand Index are information theoretic based evaluation scores: as they are only based on cluster assignments rather than distances, hence not affected by the curse of dimensionality.

Note: as k-means is optimizing a non-convex objective function, it will likely end up in a local optimum. Several runs with independent random init might be necessary to get a good convergence.

Load in Data

# Author: Peter Prettenhofer <[email protected]> # Lars Buitinck # License: BSD 3 clause from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import Normalizer from sklearn import metrics from sklearn.cluster import KMeans from time import time import numpy as np import pandas as pd # ############################################################################# # Load some categories from the training set categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', ] # Uncomment the following to do the analysis on all the categories # categories = None print("Loading 20 newsgroups dataset for categories:") print(categories) dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42) print("%d documents" % len(dataset.data)) print("%d categories" % len(dataset.target_names)) print()
Downloading 20news dataset. This may take a few minutes. Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
Loading 20 newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] 3387 documents 4 categories

Examine Data and Labels

labels = dataset.target true_k = np.unique(labels).shape[0] print("True number of Clusters: ", true_k)
True number of Clusters: 4

Sample Article

dataset.data[0].split('\n')
['From: [email protected] (Tammy R Healy)', 'Subject: Re: who are we to judge, Bobby?', 'Lines: 38', 'Organization: Walla Walla College', 'Lines: 38', '', 'In article <[email protected]> [email protected] (S.N. Mozumder ) writes:', '>From: [email protected] (S.N. Mozumder )', '>Subject: Re: who are we to judge, Bobby?', '>Date: Wed, 14 Apr 1993 21:33:56 GMT', '>In article <[email protected]> [email protected] (TAMMY R HEALY) writes:', '>>Bobby,', '>>', '>>I would like to take the liberty to quote from a Christian writer named ', '>>Ellen G. White. I hope that what she said will help you to edit your ', '>>remarks in this group in the future.', '>>', '>>"Do not set yourself as a standard. Do not make your opinions, your views ', '>>of duty, your interpretations of scripture, a criterion for others and in ', '>>your heart condemn them if they do not come up to your ideal."', '>> Thoughts Fromthe Mount of Blessing p. 124', '>>', ">>I hope quoting this doesn't make the atheists gag, but I think Ellen White ", '>>put it better than I could.', '>> ', '>>Tammy', '>', '>Point?', '>', '>Peace,', '>', '>Bobby Mozumder', '>', 'My point is that you set up your views as the only way to believe. Saying ', 'that all eveil in this world is caused by atheism is ridiculous and ', 'counterproductive to dialogue in this newsgroups. I see in your posts a ', "spirit of condemnation of the atheists in this newsgroup bacause they don'", "t believe exactly as you do. If you're here to try to convert the atheists ", "here, you're failing miserably. Who wants to be in position of constantly ", 'defending themselves agaist insulting attacks, like you seem to like to do?!', "I'm sorry you're so blind that you didn't get the messgae in the quote, ", 'everyone else has seemed to.', '', 'Tammy', '']
dataset.data[500].split('\n')
['From: [email protected] (Tim Ciceran)', 'Subject: Re: Best FTP Viewer please.', 'Organization: Brock University, St. Catharines Ontario', 'X-Newsreader: TIN [version 1.1 PL9]', 'Lines: 19', '', '[email protected] wrote:', ': ==============================================================================', ": Could someone please tell me the Best FTP'able viewer available for MSDOS", ': I am running a 486 33mhz with SVGA monitor.', ': I need to look at gifs mainly and it would be advantageous if it ran', ': under windows...........thanks', '', 'FTP to wuarchive.wustl.edu,', 'change into mirrors/msdos/graphics', 'get "grfwk61t.zip"', 'This is the DOS version of Graphic Workshop. There is a Windows version which', "you could probably find in the mirrors/msdos/windows3 directory but I don't ", 'know what the file name is. ', '', '-- ', '', 'TMC', '([email protected])', '', '']

Converting Text data to Numeric using CountVectorizer

print("Extracting features from the training dataset using a TFidF vectorizer") t0 = time() vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, ngram_range=(1,2)) X = vectorizer.fit_transform(dataset.data) print("done in %fs" % (time() - t0)) print("n_samples: %d, n_features: %d" % X.shape) print()
Extracting features from the training dataset using a TFidF vectorizer done in 3.653120s n_samples: 3387, n_features: 361840
X.shape
(3387, 361840)
# ############################################################################# # Do the actual clustering km = KMeans(n_clusters=true_k, max_iter = 1000) print("Clustering sparse data with %s" % km) t0 = time() km.fit(X) print("done in %0.3fs" % (time() - t0)) print()
Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000, n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0) done in 161.488s

Look at the Silhouette Score and provide some other metrics for you to investigate on your own

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_)) print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_)) print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_)) print("Adjusted Rand-Index: %.3f" % metrics.adjusted_rand_score(labels, km.labels_)) print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, km.labels_, sample_size=1000)) print()
Homogeneity: 0.478 Completeness: 0.549 V-measure: 0.511 Adjusted Rand-Index: 0.423 Silhouette Coefficient: 0.004

Lets look at some of the predictions!

pd.DataFrame(list(zip(range(10),km.predict(X)[:10])), columns = ['Observation num.', 'Cluster'])
dataset.data[9].split('\n')
['From: [email protected]', 'Subject: Re: A Little Too Satanic', 'Organization: Texas A&M University', 'Lines: 21', 'NNTP-Posting-Host: tamvm1.tamu.edu', '', 'In article <[email protected]>', '[email protected] (Charley Wingate) writes:', ' ', '>', '>Nanci Ann Miller writes:', '>', ']The "corrupted over and over" theory is pretty weak. Comparison of the', ']current hebrew text with old versions and translations shows that the text', ']has in fact changed very little over a space of some two millennia. This', "]shouldn't be all that suprising; people who believe in a text in this manner", ']are likely to makes some pains to make good copies.', ' ', 'Tell it to King James, mate.', ' ', ']C. Wingate + "The peace of God, it is no peace,', '] + but strife closed in the sod.', '][email protected] + Yet, brothers, pray for but one thing:', ']tove!mangoe + the marv\'lous peace of God."', ' ', ' ', 'John Burke, [email protected]', '']
dataset.data[0].split('\n')
['From: [email protected] (Tammy R Healy)', 'Subject: Re: who are we to judge, Bobby?', 'Lines: 38', 'Organization: Walla Walla College', 'Lines: 38', '', 'In article <[email protected]> [email protected] (S.N. Mozumder ) writes:', '>From: [email protected] (S.N. Mozumder )', '>Subject: Re: who are we to judge, Bobby?', '>Date: Wed, 14 Apr 1993 21:33:56 GMT', '>In article <[email protected]> [email protected] (TAMMY R HEALY) writes:', '>>Bobby,', '>>', '>>I would like to take the liberty to quote from a Christian writer named ', '>>Ellen G. White. I hope that what she said will help you to edit your ', '>>remarks in this group in the future.', '>>', '>>"Do not set yourself as a standard. Do not make your opinions, your views ', '>>of duty, your interpretations of scripture, a criterion for others and in ', '>>your heart condemn them if they do not come up to your ideal."', '>> Thoughts Fromthe Mount of Blessing p. 124', '>>', ">>I hope quoting this doesn't make the atheists gag, but I think Ellen White ", '>>put it better than I could.', '>> ', '>>Tammy', '>', '>Point?', '>', '>Peace,', '>', '>Bobby Mozumder', '>', 'My point is that you set up your views as the only way to believe. Saying ', 'that all eveil in this world is caused by atheism is ridiculous and ', 'counterproductive to dialogue in this newsgroups. I see in your posts a ', "spirit of condemnation of the atheists in this newsgroup bacause they don'", "t believe exactly as you do. If you're here to try to convert the atheists ", "here, you're failing miserably. Who wants to be in position of constantly ", 'defending themselves agaist insulting attacks, like you seem to like to do?!', "I'm sorry you're so blind that you didn't get the messgae in the quote, ", 'everyone else has seemed to.', '', 'Tammy', '']
dataset.data[7].split('\n')
['From: [email protected] (Tom)', 'Subject: Moonbase race', 'X-Added: Forwarded by Space Digest', 'Organization: [via International Space University]', 'Original-Sender: [email protected]', 'Distribution: sci', 'Lines: 22', '', 'George William Herbert sez:', '', '>Hmm. $1 billion, lesse... I can probably launch 100 tons to LEO at', '>$200 million, in five years, which gives about 20 tons to the lunar', '>surface one-way. Say five tons of that is a return vehicle and its', '>fuel, a bigger Mercury or something (might get that as low as two', ">tons), leaving fifteen tons for a one-man habitat and a year's supplies?", '>Gee, with that sort of mass margins I can build the systems off', '>the shelf for about another hundred million tops. That leaves', ">about $700 million profit. I like this idea 8-) Let's see", '>if you guys can push someone to make it happen 8-) 8-)', '', "I like your optimism, George. I don't know doots about raising that kind", 'of dough, but if you need people to split the work and the $700M, you just', 'give me a ring :-) Living alone for a year on the moon sounds horrid, but', "I'd even try that, if I got a bigger cut. :-)", '', '-Tommy Mac', '-------------------------------------------------------------------------', 'Tom McWilliams 517-355-2178 wk \\\\ As the radius of vision increases,', '[email protected] 336-9591 hm \\\\ the circumference of mystery grows.', '-------------------------------------------------------------------------', '']
dataset.data[8].split('\n')
['From: [email protected] (Ron Baalke)', "Subject: JPL's VLBI Project Meets with International Space Agencies", 'Organization: Jet Propulsion Laboratory', 'Lines: 112', 'Distribution: world', 'NNTP-Posting-Host: kelvin.jpl.nasa.gov', 'Keywords: VLBI, JPL', 'News-Software: VAX/VMS VNEWS 1.41 ', '', 'From the "JPL Universe"', 'April 23, 1993', '', 'VLBI project meets with international space agencies', '', 'By Ed McNevin', " Members of JPL's Space Very Long Baseline Interferometry", '(VLBI) project team recently concluded a week-long series of', 'meetings with officials from Russia and Japan.', ' The meetings were part of "Space VLBI Week" held at JPL in', 'early March and were intended to maintain cooperation between', 'international space agencies participating in the development of', 'the U.S. Space VLBI Project, a recently approved JPL flight', 'project set for launch in 1995.', ' U.S. Space VLBI will utilize two Earth-orbiting spacecraft', '-- the Japanese VSOP (VLBI Space Observing Program) satellite', 'with its 8-meter radio telescope, and a Russian RADIOASTRON', '10-meter satellite. Both spacecraft will team up with', 'ground-based radio telescopes located around the world to create', 'a radio telescope network that astronomers hope will expand radio', 'telescope observing power by a factor of 10.', " Japan's VSOP satellite will use a limited six-hour orbit to", 'conduct imaging science, while the Russian RADIOASTRON spacecraft', 'will exploit a larger, 28-hour Earth orbit to conduct exploratory', 'radio astronomy. Each satellite will point at a source target for', 'roughly 24 hours, while approximately 20 ground-based radio', 'telescopes will simultaneously point at the same source object', 'while within view on Earth.', " According to Dr. Joel Smith, JPL's project manager for the", 'U.S. Space VLBI, meetings like those held at JPL will permit', 'Japan and Russia, who have little previous experience in radio', 'interferometry, to establish working relationships with the radio', 'astronomy communities that will be vital during the complex', 'observations required by the Space VLBI project.', ' "One of our main activities is developing the methodology', 'for international coordination, because the two spacecraft', 'simultaneously rely on the corresponding tracking stations while', 'using the ground-based radio telescopes to observe the same', 'celestial objects," said Smith.', ' Three new tracking antennas are being built at DSN', 'facilities and other three other tracking facilities located in', 'Japan, Russia and Green Bank, W.Va. This global network of', 'ground-based radio telescopes will use precision clocks and', 'high-speed recorders to collect observation data and forward the', 'information to a correlator located at the National Radio', 'Astronomy Observatory in Socorro, N.M. The correlator will', 'combine and process data, then make it available to mission', 'investigators in Moscow, Tokyo, and JPL via electronic mail.', ' Smith is optimistic that the massive radio telescope created', 'by the Space VLBI network will provide radio astronomers with', 'better resolution than has ever been achieved before by', 'ground-based radio telescopes, allowing astronomers to take a', 'closer look at distant objects in space.', ' "There is a long history of radio astronomy using', 'ground-based telescopes," said Smith. "What we intend to do is to', 'extend radio astronomy into Earth orbit. Our goal is to look', 'deeper into the cores of galactic nuclei, quasars and other', 'active radio sources to understand what drives those things we', 'have seen so far with radio astronomy."', ' Smith noted that if one examines "the active galactic', "nuclei, you'll find jets appearing to spew at speeds greater than", 'light, and at energy levels that are millions of times greater', 'than you would expect."', ' He said some astronomers believe that black holes may be', 'located in the cores of these galaxies, and that they may fuel', 'the jets. Smith hopes that "by using Space VLBI to look further', 'into the cores, this theory may be supported or disproved."', ' Russian space-flight hardware, including transponders and', 'transmitters, are now being tested in the United States, and', 'Japanese hardware is scheduled to arrive for testing later this', 'year. Analysis of this hardware will permit U.S. scientists and', 'engineers to understand how to modify the high-speed VLBA', 'Correlator operating at the NRAO in order to accommodate the odd', 'data patterns that will originate from the more than 20', 'ground-based radio telescopes involved in Space VLBI.', ' Smith is particularly pleased that meetings with the', 'Japanese and Russian space agency officials -- like those held at', 'JPL in March -- have proceeded smoothly. Yet he knows that the', "political uncertainty in Russia could jeopardize that country's", 'participation in the project.', ' "Nothing is ever smooth," he said, "but the Russians have', 'been incredibly open with us. We always anticipated some', 'likelihood that we will not succeed because of political factors', 'beyond our control, yet there tends to be a way of keeping these', 'things going, because scientists on both sides are trying hard,', 'and people recognize the value of cooperation at this level."', ' Smith points out that the Japanese space agency has more at', 'stake than just fulfilling an international commitment to a', 'science mission.', ' "The Japanese have been extremely cooperative, since', 'international cooperation is essential to their science mission,"', 'he said.', ' But Smith also noted that Japanese space agency officials', 'look at the U.S. Space VLBI mission as an opportunity to showcase', 'the technology involved with VSOP spacecraft, and their highly', 'regarded Mach V launch vehicle.', ' Yet regardless of the risks involved in undertaking such an', "ambitious project, JPL's Smith is satisfied that planning for the", 'Space VLBI Project is beyond the significant financial and', 'political hurdles that otherwise might threaten the project.', ' "Fortunately, we have the virtue of having two partners, and', 'if either falls out, we would still have something with the', 'other. By themselves, both spacecraft are independent,', 'scientifically exciting missions."', ' ###', ' ___ _____ ___', ' /_ /| /____/ \\ /_ /| Ron Baalke | [email protected]', ' | | | | __ \\ /| | | | Jet Propulsion Lab |', ' ___| | | | |__) |/ | | |__ M/S 525-3684 Telos | The aweto from New Zealand', '/___| | | | ___/ | |/__ /| Pasadena, CA 91109 | is part caterpillar and', '|_____|/ |_|/ |_____|/ | part vegetable.', '', '']