Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
SSQ
GitHub Repository: SSQ/Coursera-Stanford-Greedy-Algorithms-Minimum-Spanning-Trees-and-Dynamic-Programming
Path: blob/master/Programming Assignment 2/Cluster 1.ipynb
423 views
Kernel: Python [conda env:DAND]

Clustering 1

1 point 1.In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing k-clustering.

Download the text file below.

clustering1.txt

This file describes a distance function (equivalently, a complete graph with edge costs). It has the following format:

[number_of_nodes]

[edge 1 node 1] [edge 1 node 2] [edge 1 cost]

[edge 2 node 1] [edge 2 node 2] [edge 2 cost]

...

There is one edge (i,j) for each choice of 1≤i<j≤n, where n is the number of nodes.

For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250. You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number k of clusters is set to 4. What is the maximum spacing of a 4-clustering?

ADVICE: If you're not getting the correct answer, try debugging your algorithm using some small test cases. And then post them to the discussion forum!

import pandas as pd import numpy as np import time
# get the file path file_path = 'clustering1.txt' # file_path = 'test 1.txt' # convert text file to np.array type clustering_data = np.loadtxt(file_path) int_clustering_data = clustering_data.astype(int) print('int_clustering_data: ') print(int_clustering_data)
int_clustering_data: [[ 1 2 6808] [ 1 3 5250] [ 1 4 74] ..., [ 498 499 5758] [ 498 500 4881] [ 499 500 8273]]

convert directed graph into undirected graph

data = np.copy(int_clustering_data) data[:,[0,1]] = int_clustering_data[:,[1,0]] print('data: ') print(data) print('int_clustering_data: ') print(int_clustering_data) undirected_graph = np.concatenate((int_clustering_data, data)) print('undirected_graph: ') print(undirected_graph)

# test1 initialize a cluster test_clusters = {} for x in range(1,10): test_clusters[x] = [x] print('test_clusters: ') print(test_clusters)
test_clusters: {1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]}
# test2 extend and del key test_dic = {1:[1],2:[2],3:[3,4]} print('test_dic') print(test_dic) test_dic[1].extend(test_dic[3]) print('test_dic') print(test_dic) del test_dic[3] print('test_dic') print(test_dic)
test_dic {1: [1], 2: [2], 3: [3, 4]} test_dic {1: [1, 3, 4], 2: [2], 3: [3, 4]} test_dic {1: [1, 3, 4], 2: [2]}
# clustering algorithm # test1 can be found before # Initially, each point in a separate cluster,we have 500 nodes clusters = {} for x in range(1,501): clusters[x] = [x] print('clusters: ') print(clusters) # print('clusters: ') # print(clusters) # sort the distance of nodes sorted_index = np.argsort(int_clustering_data[:,2], axis=0) # set the number of clusters k = 4 # def a function to get the key by value in dictionary # d: dictionary # k: key def get_key_by_value(d, k): for leader, nodes in d.items(): if k in nodes: return leader # test2 can be found before # c: dictionary. cluster # a: key. added value in dict # d: key. deleted value in dict def update_cluster(c, a, d): c[a].extend(c[d]) del c[d] # initialize number of iteration iteration = 0 # print('whole_edges: ') print(len(sorted_index)) # core algorithm start_time = time.time() for i in sorted_index: # print('i: ') # print(i) #iteration = iteration + 1 # print('iteration: ') # print(iteration) min_edge = int_clustering_data[i,:] key1 = get_key_by_value(clusters, min_edge[0]) key2 = get_key_by_value(clusters, min_edge[1]) if key1 == key2: continue elif len(clusters[key1]) >= len(clusters[key2]): update_cluster(clusters, key1, key2) else: update_cluster(clusters, key2, key1) # print('k: ') # print(len(clusters)) if len(clusters) == k: break end_time = time.time() print('time: ' + str(end_time - start_time) + 's ') print('time: ' + str((end_time - start_time) / 60) + 'min') print('time: ' + str((end_time - start_time) / 3600) + 'h') print('clusters: ') print(clusters)
clusters: {1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9], 10: [10], 11: [11], 12: [12], 13: [13], 14: [14], 15: [15], 16: [16], 17: [17], 18: [18], 19: [19], 20: [20], 21: [21], 22: [22], 23: [23], 24: [24], 25: [25], 26: [26], 27: [27], 28: [28], 29: [29], 30: [30], 31: [31], 32: [32], 33: [33], 34: [34], 35: [35], 36: [36], 37: [37], 38: [38], 39: [39], 40: [40], 41: [41], 42: [42], 43: [43], 44: [44], 45: [45], 46: [46], 47: [47], 48: [48], 49: [49], 50: [50], 51: [51], 52: [52], 53: [53], 54: [54], 55: [55], 56: [56], 57: [57], 58: [58], 59: [59], 60: [60], 61: [61], 62: [62], 63: [63], 64: [64], 65: [65], 66: [66], 67: [67], 68: [68], 69: [69], 70: [70], 71: [71], 72: [72], 73: [73], 74: [74], 75: [75], 76: [76], 77: [77], 78: [78], 79: [79], 80: [80], 81: [81], 82: [82], 83: [83], 84: [84], 85: [85], 86: [86], 87: [87], 88: [88], 89: [89], 90: [90], 91: [91], 92: [92], 93: [93], 94: [94], 95: [95], 96: [96], 97: [97], 98: [98], 99: [99], 100: [100], 101: [101], 102: [102], 103: [103], 104: [104], 105: [105], 106: [106], 107: [107], 108: [108], 109: [109], 110: [110], 111: [111], 112: [112], 113: [113], 114: [114], 115: [115], 116: [116], 117: [117], 118: [118], 119: [119], 120: [120], 121: [121], 122: [122], 123: [123], 124: [124], 125: [125], 126: [126], 127: [127], 128: [128], 129: [129], 130: [130], 131: [131], 132: [132], 133: [133], 134: [134], 135: [135], 136: [136], 137: [137], 138: [138], 139: [139], 140: [140], 141: [141], 142: [142], 143: [143], 144: [144], 145: [145], 146: [146], 147: [147], 148: [148], 149: [149], 150: [150], 151: [151], 152: [152], 153: [153], 154: [154], 155: [155], 156: [156], 157: [157], 158: [158], 159: [159], 160: [160], 161: [161], 162: [162], 163: [163], 164: [164], 165: [165], 166: [166], 167: [167], 168: [168], 169: [169], 170: [170], 171: [171], 172: [172], 173: [173], 174: [174], 175: [175], 176: [176], 177: [177], 178: [178], 179: [179], 180: [180], 181: [181], 182: [182], 183: [183], 184: [184], 185: [185], 186: [186], 187: [187], 188: [188], 189: [189], 190: [190], 191: [191], 192: [192], 193: [193], 194: [194], 195: [195], 196: [196], 197: [197], 198: [198], 199: [199], 200: [200], 201: [201], 202: [202], 203: [203], 204: [204], 205: [205], 206: [206], 207: [207], 208: [208], 209: [209], 210: [210], 211: [211], 212: [212], 213: [213], 214: [214], 215: [215], 216: [216], 217: [217], 218: [218], 219: [219], 220: [220], 221: [221], 222: [222], 223: [223], 224: [224], 225: [225], 226: [226], 227: [227], 228: [228], 229: [229], 230: [230], 231: [231], 232: [232], 233: [233], 234: [234], 235: [235], 236: [236], 237: [237], 238: [238], 239: [239], 240: [240], 241: [241], 242: [242], 243: [243], 244: [244], 245: [245], 246: [246], 247: [247], 248: [248], 249: [249], 250: [250], 251: [251], 252: [252], 253: [253], 254: [254], 255: [255], 256: [256], 257: [257], 258: [258], 259: [259], 260: [260], 261: [261], 262: [262], 263: [263], 264: [264], 265: [265], 266: [266], 267: [267], 268: [268], 269: [269], 270: [270], 271: [271], 272: [272], 273: [273], 274: [274], 275: [275], 276: [276], 277: [277], 278: [278], 279: [279], 280: [280], 281: [281], 282: [282], 283: [283], 284: [284], 285: [285], 286: [286], 287: [287], 288: [288], 289: [289], 290: [290], 291: [291], 292: [292], 293: [293], 294: [294], 295: [295], 296: [296], 297: [297], 298: [298], 299: [299], 300: [300], 301: [301], 302: [302], 303: [303], 304: [304], 305: [305], 306: [306], 307: [307], 308: [308], 309: [309], 310: [310], 311: [311], 312: [312], 313: [313], 314: [314], 315: [315], 316: [316], 317: [317], 318: [318], 319: [319], 320: [320], 321: [321], 322: [322], 323: [323], 324: [324], 325: [325], 326: [326], 327: [327], 328: [328], 329: [329], 330: [330], 331: [331], 332: [332], 333: [333], 334: [334], 335: [335], 336: [336], 337: [337], 338: [338], 339: [339], 340: [340], 341: [341], 342: [342], 343: [343], 344: [344], 345: [345], 346: [346], 347: [347], 348: [348], 349: [349], 350: [350], 351: [351], 352: [352], 353: [353], 354: [354], 355: [355], 356: [356], 357: [357], 358: [358], 359: [359], 360: [360], 361: [361], 362: [362], 363: [363], 364: [364], 365: [365], 366: [366], 367: [367], 368: [368], 369: [369], 370: [370], 371: [371], 372: [372], 373: [373], 374: [374], 375: [375], 376: [376], 377: [377], 378: [378], 379: [379], 380: [380], 381: [381], 382: [382], 383: [383], 384: [384], 385: [385], 386: [386], 387: [387], 388: [388], 389: [389], 390: [390], 391: [391], 392: [392], 393: [393], 394: [394], 395: [395], 396: [396], 397: [397], 398: [398], 399: [399], 400: [400], 401: [401], 402: [402], 403: [403], 404: [404], 405: [405], 406: [406], 407: [407], 408: [408], 409: [409], 410: [410], 411: [411], 412: [412], 413: [413], 414: [414], 415: [415], 416: [416], 417: [417], 418: [418], 419: [419], 420: [420], 421: [421], 422: [422], 423: [423], 424: [424], 425: [425], 426: [426], 427: [427], 428: [428], 429: [429], 430: [430], 431: [431], 432: [432], 433: [433], 434: [434], 435: [435], 436: [436], 437: [437], 438: [438], 439: [439], 440: [440], 441: [441], 442: [442], 443: [443], 444: [444], 445: [445], 446: [446], 447: [447], 448: [448], 449: [449], 450: [450], 451: [451], 452: [452], 453: [453], 454: [454], 455: [455], 456: [456], 457: [457], 458: [458], 459: [459], 460: [460], 461: [461], 462: [462], 463: [463], 464: [464], 465: [465], 466: [466], 467: [467], 468: [468], 469: [469], 470: [470], 471: [471], 472: [472], 473: [473], 474: [474], 475: [475], 476: [476], 477: [477], 478: [478], 479: [479], 480: [480], 481: [481], 482: [482], 483: [483], 484: [484], 485: [485], 486: [486], 487: [487], 488: [488], 489: [489], 490: [490], 491: [491], 492: [492], 493: [493], 494: [494], 495: [495], 496: [496], 497: [497], 498: [498], 499: [499], 500: [500]} whole_edges: 124750 time: 0.134000062943s time: 0.00223333438238min time: 3.72222397063e-05h clusters: {102: [102, 438, 154, 137, 149, 157, 287, 158, 101, 11, 424, 98, 112, 51, 113, 122, 231, 449, 372, 423, 415, 180, 341, 125, 46, 9, 358, 486, 47, 393, 33, 96, 429, 61, 230, 18, 198, 404, 477, 166, 268, 347, 427, 123, 294, 352, 192, 442, 239, 187, 60, 175, 22, 296, 146, 147, 256, 79, 138, 255, 463, 52, 30, 48, 368, 72, 291, 270, 272, 374, 38, 126, 480, 288, 208, 25, 92, 387, 257, 273, 17, 400, 476, 370, 31, 6, 203, 289, 371, 223, 94, 124, 375, 454, 382, 56, 350, 153, 355, 499, 210, 337, 19, 293, 362, 80, 189, 214, 361, 82, 378, 397, 128, 311, 326, 377, 395, 238, 300, 34, 353, 473, 474, 90, 298, 420, 130, 440, 493, 172, 161, 271, 55, 322, 162, 332, 356, 7, 222, 277, 269, 263, 264, 179, 338, 14, 71, 27, 487, 316, 354, 360, 202, 205, 275, 234, 484, 199, 117, 439, 365, 262, 167, 417, 32, 472, 200, 168, 452, 243, 489, 160, 176, 308, 475, 107, 185, 328, 469, 419, 245, 83, 88, 485, 68, 116, 3, 426, 460, 220, 389, 139, 430, 183, 260, 190, 267, 305, 145, 212, 37, 278, 321, 21, 8, 301, 416, 295, 491, 425, 471, 110, 453, 274, 109, 436, 120, 495, 226, 197, 290, 327, 244, 385, 143, 457, 114, 488, 29, 312, 279, 329, 335, 132, 303, 67, 62, 240, 115, 40, 150, 412, 433, 20, 283, 182, 225, 235, 106, 266, 155, 258, 36, 215, 500, 84, 249, 391, 217, 134, 448, 119, 318, 144, 259, 221, 410, 281, 186, 306, 431, 173, 345, 184, 392, 103, 339, 4, 366, 178, 169, 330, 418, 388, 479, 75, 434, 10, 26, 390, 309, 407, 292, 331, 219, 236, 181, 39, 369, 41, 481, 53, 131, 276, 58, 171, 351, 285, 177, 237, 461, 152, 401, 478, 78, 437, 447, 406, 446, 376, 284, 336, 156, 16, 396, 85, 333, 76, 320, 458, 250, 451, 74, 280, 43, 363, 435, 121, 286, 57, 45, 66, 224, 422, 97, 12, 373, 344, 432, 140, 357, 89, 142, 13, 343, 403, 402, 108, 253, 324, 193, 492, 170, 456, 482, 496, 386, 141, 44, 325, 159, 86, 24, 455, 334, 229, 297, 444, 459, 118, 64, 413, 69, 73, 105, 59, 241, 408, 5, 314, 95, 164, 383, 104, 304, 265, 466, 111, 261, 428, 91, 247, 494, 100, 467, 201, 195, 54, 148, 381, 450, 349, 319, 151, 398, 411, 299, 211, 252, 421, 445, 204, 405, 188, 380, 498, 129, 196, 23, 317, 15, 163, 379, 399, 342, 213, 2, 483, 28, 133, 470, 248, 443, 49, 359, 233, 81, 409, 135, 497, 136, 323, 1, 348, 227, 209, 70, 468, 302, 441, 464, 246, 207, 346, 42, 307, 315, 313, 242, 99, 194, 77, 251, 65, 127, 367, 364, 87, 206, 310, 232, 174, 218, 465, 490, 216, 340, 35, 228, 191, 254, 394, 50, 93, 63, 282, 165], 384: [384], 414: [414], 462: [462]}
clusters_keys = clusters.keys() # [102, 384, 414, 462] # initialize spacing = [] distance=0 # set all spacing of k-clusters for i in range(len(clusters_keys)-1): j = i+1 for k in range(j, len(clusters_keys)): spacing.append([clusters_keys[i], clusters_keys[k], distance]) spacing = np.array(spacing) print('spacing: ') print(spacing)
spacing: [[102 384 0] [102 414 0] [102 462 0] [384 414 0] [384 462 0] [414 462 0]]
# test print(spacing[:,0]==102) print(spacing[:,0]) print(spacing[:,1]==462) print(np.all([spacing[:,0]==102, spacing[:,1]==462], axis=0)) test = np.all([spacing[:,0]==102, spacing[:,1]==462], axis=0) print(np.argwhere(test==True)) print(np.argwhere(test==True)[0][0]) print(type(np.argwhere(test==True))) print(spacing[np.argwhere(test==False),:]) print(len(spacing[np.argwhere(test==False),:])) print(spacing[np.argwhere(test==False)[0][0],:])
[ True True True False False False] [102 102 102 384 384 414] [False False True False True True] [False False True False False False] [[2]] 2 <type 'numpy.ndarray'> [[[102 384 0]] [[102 414 0]] [[384 414 0]] [[384 462 0]] [[414 462 0]]] 5 [102 384 0]
def min_distance(int_clustering_data, clusters, cluster1, cluster2): minimum = int_clustering_data[:,2].max() # print('minimum: ') # print(minimum) for i in clusters[cluster1]: # print('i: ') # print(i) for j in clusters[cluster2]: # print('j: ') # print(j) if i < j: find_position = np.all([int_clustering_data[:,0]==i, int_clustering_data[:,1]==j], axis=0) find_edge = np.argwhere(find_position==True) actural_position = find_edge[0][0] distance = int_clustering_data[actural_position, 2] if minimum > distance: minimum = distance else: find_position = np.all([int_clustering_data[:,0]==j, int_clustering_data[:,1]==i], axis=0) find_edge = np.argwhere(find_position==True) actural_position = find_edge[0][0] distance = int_clustering_data[actural_position, 2] if minimum > distance: minimum = distance # print('minimum: ') # print(minimum) return minimum
for i in range(len(spacing)): spacing[i,2] = min_distance(int_clustering_data, clusters, spacing[i,0], spacing[i,1])
print('spacing: ') print(spacing) print('max distance clusters: ') print(spacing[np.argmax(spacing[:,2]), :]) print('max distance: ') print(np.amax(spacing[:,2]))
spacing: [[ 102 384 123] [ 102 414 106] [ 102 462 107] [ 384 414 1162] [ 384 462 1641] [ 414 462 746]] max distance clusters: [ 384 462 1641] max distance: 1641

Question1 Answer

1641