Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/python/algorithms/search_sort.ipynb
1480 views
Kernel: Python 3 (ipykernel)
from jupyterthemes import get_themes from jupyterthemes.stylefx import set_nb_theme themes = get_themes() set_nb_theme(themes[1])
%load_ext watermark %watermark -a 'Ethen' -d -t -v -p jupyterthemes
Ethen 2016-12-26 22:23:12 CPython 3.5.2 IPython 4.2.0 jupyterthemes 0.13.9

Searching, Hashing, Sorting

Following the online book, Problem Solving with Algorithms and Data Structures. Chapter 6 introduces methods for searching and sorting numbers.

Searching

We can take advantage of a ordered list by doing a binary search. We start by searching in the middle, if it is not the item that we're searching for, we can use the ordered nature of the list to eliminate half of the remaining items.

def binary_search(testlist, query): if not testlist: return False else: mid_idx = len(testlist) // 2 mid = testlist[mid_idx] if mid == query: return True elif query < mid: return binary_search(testlist[:mid_idx], query) else: return binary_search(testlist[mid_idx + 1:], query)
testlist = [0, 1, 2, 8, 13, 17, 19, 32, 42] query = 19 print(binary_search(testlist, query)) query = 3 print(binary_search(testlist, query))
True False

Keep in mind that this approach requires sorting the list, which may not be ideal if we're simply going to search for 1 number on the very large list (since we have to first sort the list, which is not a cheap operation).

Hashing

Quick notes:

Dictionary or map that let us stor key value pairs is typically implemented using hash table.

  • An element that we wish to add is converted into an integer by using a hash function. hash = hash_function(key)

  • Th resulting hash is independent of the underlying array size, and it is then reduced to an index by using the modulo operator. index = hash % array_size

Python has a built in hash function that can calculate has hash values for arbitrarily large objects in a fast manner. Hash function that is used for dictionary or maps should be deterministic so we can look up the values afterwards, fast to compute, else the overhead of hashing would offset the benefit it brings for fast lookup, and uniform distributed, this one is related to avoiding hash collision.

Hash collision happens when two different inputs produce the same hash value. To avoid this a well implemented hash table should be able to resolve hash collision, common techniques include linear probing and separate chaining. Also wehn our hash table is almost saturated (the number of elements is close to the array size we've defined), a.k.a the load factor is larger than some portion, then it should be able to dynamically resize the hash table to maintain our dictionary and map's performance.

class HashTable: """ a.k.a python's dictionary the initial size of the table has been chosen to be 11, although this number is arbitrary, it's important for it to be a prime number so that collision resolution will be efficient; this implementation does not handle resizing the hashtable when it runs out of the original size """ def __init__(self): # slot will hold the key and data will hold the value self.size = 11 self.slot = [None] * self.size self.data = [None] * self.size def _put(self, key, value): hash_value = self._hash(key) if self.slot[hash_value] == None: self.slot[hash_value] = key self.data[hash_value] = value elif self.slot[hash_value] == key: # replace the original key value self.data[hash_value] = value else: # rehash to get the next location possible # if a collision is to occurr next_slot = self._rehash(hash_value) slot_value = self.slot[next_slot] while slot_value != None and slot_value != key: next_slot = self._rehash(next_slot) slot_value = self.slot[next_slot] if self.slot[next_slot] == None: self.slot[next_slot] = key self.data[next_slot] = value else: self.data[next_slot] = value def _get(self, key): data = None stop = False found = False start_slot = self._hash(key) next_slot = start_slot while self.slot[next_slot] != None and not found and not stop: if self.slot[next_slot] == key: data = self.data[next_slot] found = True else: # if we rehash to the starting value # then it means the data is not here next_slot = self._rehash(next_slot) if next_slot == start_slot: stop = True return data def _hash(self, key): return key % self.size def _rehash(self, oldhash): """ a simple plus 1 rehash, where we add 1 to the original value and hash it again to see if the slot it empty (None) """ return (oldhash + 1) % self.size def __getitem__(self, key): # allow access using``[]`` syntax return self._get(key) def __setitem__(self, key, value): self._put(key, value)
H = HashTable() H[54] = 'cat' H[26] = 'dog' H[93] = 'lion' H[17] = 'tiger' H[77] = 'bird' H[44] = 'goat' H[55] = 'pig' print(H.slot) print(H.data)
[77, 44, 55, None, 26, 93, 17, None, None, None, 54] ['bird', 'goat', 'pig', None, 'dog', 'lion', 'tiger', None, None, None, 'cat']
print(H[55]) print(H[20])
pig None

Sorting

Merge Sort

def merge_sort(alist): if len(alist) > 1: mid = len(alist) // 2 left_half = alist[:mid] right_half = alist[mid:] merge_sort(left_half) merge_sort(right_half) # loop through the left and right half, # compare the value and fill them back # to the original list i, j, k = 0, 0, 0 while i < len(left_half) and j < len(right_half): if left_half[i] < right_half[j]: alist[k] = left_half[i] i += 1 else: alist[k] = right_half[j] j += 1 k += 1 # after filling in the sorted value, # there will be left-over values on # either the left or right half, simply # append all the left-over values back while i < len(left_half): alist[k] = left_half[i] i += 1 k += 1 while j < len(right_half): alist[k] = right_half[j] j += 1 k += 1 return alist
alist = [54, 26, 93, 17, 77] merge_sort(alist)
[17, 26, 54, 77, 93]

Quick Sort

def quick_sort(alist): _sort(alist, 0, len(alist) - 1) def _sort(alist, first, last): if first < last: split_point = _partition(alist, first, last) _sort(alist, first, split_point - 1) _sort(alist, split_point + 1, last) def _partition(alist, first, last): right = last left = first + 1 pivot_value = alist[first] # find the split point of the pivot and move all other # items to the appropriate side of the list (i.e. if # the item is greater than pivot, then it should be # on the right hand side and vice versa) done = False while not done: while left <= right and alist[left] <= pivot_value: left += 1 while alist[right] >= pivot_value and right >= left: right -= 1 if right <= left: done = True else: alist[right], alist[left] = alist[left], alist[right] # swap pivot value to split point alist[first], alist[right] = alist[right], alist[first] return right
# list sorted in place alist = [54, 26, 93, 17, 77, 50] quick_sort(alist) alist
[17, 26, 50, 54, 77, 93]