Path: blob/master/Week 9/Programming Assignment - 8/ex8/ex8.m
616 views
%% Machine Learning Online Class1% Exercise 8 | Anomaly Detection and Collaborative Filtering2%3% Instructions4% ------------5%6% This file contains code that helps you get started on the7% exercise. You will need to complete the following functions:8%9% estimateGaussian.m10% selectThreshold.m11% cofiCostFunc.m12%13% For this exercise, you will not need to change any code in this file,14% or any other files other than those mentioned above.15%1617%% Initialization18clear ; close all; clc1920%% ================== Part 1: Load Example Dataset ===================21% We start this exercise by using a small dataset that is easy to22% visualize.23%24% Our example case consists of 2 network server statistics across25% several machines: the latency and throughput of each machine.26% This exercise will help us find possibly faulty (or very fast) machines.27%2829fprintf('Visualizing example dataset for outlier detection.\n\n');3031% The following command loads the dataset. You should now have the32% variables X, Xval, yval in your environment33load('ex8data1.mat');3435% Visualize the example dataset36plot(X(:, 1), X(:, 2), 'bx');37axis([0 30 0 30]);38xlabel('Latency (ms)');39ylabel('Throughput (mb/s)');4041fprintf('Program paused. Press enter to continue.\n');42pause434445%% ================== Part 2: Estimate the dataset statistics ===================46% For this exercise, we assume a Gaussian distribution for the dataset.47%48% We first estimate the parameters of our assumed Gaussian distribution,49% then compute the probabilities for each of the points and then visualize50% both the overall distribution and where each of the points falls in51% terms of that distribution.52%53fprintf('Visualizing Gaussian fit.\n\n');5455% Estimate my and sigma256[mu sigma2] = estimateGaussian(X);5758% Returns the density of the multivariate normal at each data point (row)59% of X60p = multivariateGaussian(X, mu, sigma2);6162% Visualize the fit63visualizeFit(X, mu, sigma2);64xlabel('Latency (ms)');65ylabel('Throughput (mb/s)');6667fprintf('Program paused. Press enter to continue.\n');68pause;6970%% ================== Part 3: Find Outliers ===================71% Now you will find a good epsilon threshold using a cross-validation set72% probabilities given the estimated Gaussian distribution73%7475pval = multivariateGaussian(Xval, mu, sigma2);7677[epsilon F1] = selectThreshold(yval, pval);78fprintf('Best epsilon found using cross-validation: %e\n', epsilon);79fprintf('Best F1 on Cross Validation Set: %f\n', F1);80fprintf(' (you should see a value epsilon of about 8.99e-05)\n');81fprintf(' (you should see a Best F1 value of 0.875000)\n\n');8283% Find the outliers in the training set and plot the84outliers = find(p < epsilon);8586% Draw a red circle around those outliers87hold on88plot(X(outliers, 1), X(outliers, 2), 'ro', 'LineWidth', 2, 'MarkerSize', 10);89hold off9091fprintf('Program paused. Press enter to continue.\n');92pause;9394%% ================== Part 4: Multidimensional Outliers ===================95% We will now use the code from the previous part and apply it to a96% harder problem in which more features describe each datapoint and only97% some features indicate whether a point is an outlier.98%99100% Loads the second dataset. You should now have the101% variables X, Xval, yval in your environment102load('ex8data2.mat');103104% Apply the same steps to the larger dataset105[mu sigma2] = estimateGaussian(X);106107% Training set108p = multivariateGaussian(X, mu, sigma2);109110% Cross-validation set111pval = multivariateGaussian(Xval, mu, sigma2);112113% Find the best threshold114[epsilon F1] = selectThreshold(yval, pval);115116fprintf('Best epsilon found using cross-validation: %e\n', epsilon);117fprintf('Best F1 on Cross Validation Set: %f\n', F1);118fprintf(' (you should see a value epsilon of about 1.38e-18)\n');119fprintf(' (you should see a Best F1 value of 0.615385)\n');120fprintf('# Outliers found: %d\n\n', sum(p < epsilon));121122123