Path: blob/master/3 - Natural Language Processing with Sequence Models/Week 1/C3W1_L3_Data Generators.ipynb
65 views
Data generators
In Python, a generator is a function that behaves like an iterator. It will return the next item. Here is a link to review python generators. In many AI applications, it is advantageous to have a data generator to handle loading and transforming data for different applications.
You will now implement a custom data generator, using a common pattern that you will use during all assignments of this course. In the following example, we use a set of samples a, to derive a new set of samples, with more elements than the original set.
Note: Pay attention to the use of list lines_index and variable index to traverse the original list.
Shuffling the data order
In the next example, we will do the same as before, but shuffling the order of the elements in the output list. Note that here, our strategy of traversing using lines_index and index becomes very important, because we can simulate a shuffle in the input data, without doing that in reality.
Note: We call an epoch each time that an algorithm passes over all the training examples. Shuffling the examples for each epoch is known to reduce variance, making the models more general and overfit less.
Exercise
Instructions: Implement a data generator function that takes in batch_size, x, y shuffle where x could be a large list of samples, and y is a list of the tags associated with those samples. Return a subset of those inputs in a tuple of two arrays (X,Y). Each is an array of dimension (batch_size). If shuffle=True, the data will be traversed in a random form.
Details:
This code as an outer loop
Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop.
It has an inner loop that stores in temporal lists (X, Y) the data samples to be included in the next batch.
There are three slightly out of the ordinary features.
The first is the use of a list of a predefined size to store the data for each batch. Using a predefined size list reduces the computation time if the elements in the array are of a fixed size, like numbers. If the elements are of different sizes, it is better to use an empty array and append one element at a time during the loop.
The second is tracking the current location in the incoming lists of samples. Generators variables hold their values between invocations, so we create an
indexvariable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use theindexto access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list.The third also relates to wrapping. Because
batch_sizeand the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset theindexto 0. We can re-shuffle the list of indexes to produce different batches each time.
If your function is correct, all the tests must pass.
All tests passed!
If you could not solve the exercise, just run the next code to see the answer.