Jupyter notebook M725F17/06 - Random Numbers and Random Graphs/Random Numbers and Random Graphs.ipynb
Random Numbers and Random Graphs
created 2/16/2017, updated 2/16/2017
Why random graphs?
Often when studying the structure of a particular network, for example the Internet or the connections between a group of friends on Facebook, we're not as much interested in the exact structure of this one particular network as we are in structural trends in "similar" networks. But where does one find a bunch of networks similar to the Internet to study? That's a huge and interesting question! What experts in network science frequently try to do is to create a network model: a way of generating random networks that (hopefully) share many of the important characteristics of the given network. Thus, rather than attempting to study the structure of the Internet (as it exists right now), we might try to study structural patterns of a hypothetical class of "alternate-Universe Internets", of which our current Internet is a particular instance.
I won't try to describe the incredible effort required to define a random network model that "acts like" the Internet or any other network. That's still an open question, with plenty of ongoing debates. Instead, the purpose of this notebook is to discuss, from a mathematical and algorithmic point of view, what randomness even means on a deterministic computer, the tools we have available for generating random networks, and how we can use random sampling to collect statistics from random network models. The first big question is, how do you make a computer (which is entirely deterministic) create anything random at all?
Random vs. pseudo-random
The short answer is, you don't. A deterministic machine can't behave randomly. By definition. So, there are really only two options.
Find a way to introduce true randomness. For cryptographic applications, this is essential. If you need randomness to encrypt a message, then anything deterministic is bad. There are a lot of interesting ways of trying to capture entropy from outside the computer in order to remove determinism: counting milliseconds between user keystrokes, timing the duration of hard-drive reads, sampling the signal from a microphone input port with no microphone attached. It is a nontrivial task to try to bring true randomness into the realm of the computer. Thankfully, that's not really what we need in this case.
Find a way to fake randomness. This is the idea behind the pseudo-random number generator. (Pseudo, borrowed from the Greek, meaning "false".) In most applications other than computer security, true randomness isn't really what we need. We just need something that sort of looks random to the casual observer.
A simple pseudo-random number generator (PRNG)
PRNGs are essentially deterministic sequences of numbers that give the impression of randomness. There are a lot of subtleties in defining a good PRNG, but, simply speaking, a human being who does not know the algorithm should have a hard time figuring out what number will come up next. Every PRNG begins from a seed state. The sequence of numbers produced by the PRNG are completely determined by its seed state. In order to change the sequence of number produced by the PRNG, you need to change the seed state. Choosing the seed state yourself can be beneficial for some applications. A good PRNG will be able to produce a relatively unpredictable sequence of numbers, but the sequence is easily reproduced as long as you know the seed. That will turn out to be convenient for testing your code. On the other hand, if you want to make your code run differently each time it's executed, you can base the seed on something with a degree of unpredictability, such as the computer's system time. (This is a reasonable thing to do for many applications, but is not a good idea if you're working with ciphers.)
One example of a simple PRNG (too simple for modern applications) is the Lehmer random number generator. Here is a very simple implementation of this PRNG. Notice how the PRNG is definitely deterministic, but it's pretty hard for the human brain to predict what happens next.
Two standard random number libraries
Modern PRNGs are significantly more sophisticated that the one above, but they have the same basic structure. Probably the two most widely-used random number generation libraries in Python are random
(part of the Python Standard Library) and numpy.random
(a submodule of NumPy). It's worth spending some time browsing the documentation for both in order to get a sense of the functions each make available.
For our current purposes, we'll want to at least know two things.
How to seed each PRNG. Seeding a PRNG simply means initializing its seed state to a particular value. Often, if you do not choose a seed explicitly, the PRNG will be seeded with something semi-random, like the system time.
How to generate random real numbers in the interval with the uniform distribution. A random variable is said to be uniformly distributed on (sometimes written ) if, for all , the probability that is :
The following code shows how to generate a bunch of random samples of such an and plot a histogram of the sample values. If the PRNG is good and if we have enough samples, the histogram should be pretty much flat across the interval .
Flipping coins
With a working PRNG that can generate uniform values on , it's now possible to simulate a coin flip. The probability of "heads" when flipping a fair coin is 1/2, and that's exactly the same as the probability that our uniform random is less than 1/2. Here's an example of how to use this fact.
In the output, we can think of True
as "heads" and False
as "tails". Suppose we want to count the number of heads. One option is to first convert the Boolean True/False into an integer. The reason this works is that True
, when converted to an integer in Python, becomes the number 1, while False
becomes 0. Then we can simply sum the entries in the resulting list to count the number of heads. Try running the code below several times. Since we are not re-seeding the PRNG, the results should appear random. You should expect to see "heads" roughly half the time.
The code above provides an example of sampling the Binomial distribution. In order to sample a binomially distributed random variable , we consider a sequence of independent experiments (coin flips), each with a probability of success (heads). The value of is the number of successful experiments out of the total. For example, suppose and . There are 3 possible values for , 0, 1 or 2. The probability that is the probability that both experiments fail. Since the experiments are independent, it follows that For , one of two things must happen: either the first experiment is successful and the second unsuccessful, or the first is unsuccessful and the second successful. Thus Finally, for the case , both experiments must be successful, so As a sanity check, notice that
The following code cell shows how you could approximate this analytical result by simulation.
The following function for sampling a binomial random variable is incorrect. Your task is to fix it. Do not use any library function other than uniform random sampling! NumPy provides a routine for sampling binomial variables, but the point here is to practice writing some Python code. You'll use NumPy's routine to check your own below. Since there is a (pseudo) random component to the algorithm, your experimental distribution won't match NumPy's exactly, but it should be very close.
Our first random graph: G(n,p)
Now we're ready to consider our first random graph model, the Erdős–Rényi model. (Actually, there are two subtly different Erdős–Rényi models, we'll be considering the one referred to as G(n,p).) Take a little time to read about the G(n,p) model on Wikipedia. This is one of the most famous random graph models. It has been widely studied, and has a lot of interesting properties. Soon, we'll see how to create random networks using networkx
, but first, here's a short code that shows how to create the adjacency matrix for a G(n,p) network.
Here it is in action. (Try running this several times with different parameters.)
networkx
can create networks from adjacency matrices.
But networkx
also has its own (possibly more efficient) method for creating G(n,p) graphs.
Sometimes, when you're testing your code, it's helpful to pass networkx
a seed to ensure that you always create the same "random" graph. (Note that the default layout algorithm is also randomized. It's possible, but not quite as easy to seed the PRNG in charge of layout.)
Collecting statistics on random graphs
If you look back at the Wikipedia article, you'll see that, for large G(n,p) graphs, the degree distribution on the vertices is approximately Poisson: It's important to keep in mind that this is only for large graphs. For small graphs, the vertex degrees are strongly correlated. For example, in a G(2,p) graph, either both vertices have degree 0 or both have degree 1. For large graphs, there is still some correlation, but as , the vertex degrees can be treated as essentially uncorrelated.
In search of the giant component
Here's another fun fact about G(n,p) graphs. Obviously, if , the graph will definitely be disconnected and if the graph will definitely be connected. For somewhere in between 0 and 1 (depending on ), there is a transition between the graph being likely to be disconnected and being likely to be connected. (You can find the precise statement of this fact in various places.)
Here, we'll experimentally look at a related question. Suppose that we choose a positive constant and look at G(n,p) graphs with and with . For this choice of , these graphs are very likely to be disconnected and, thus, will probably have several connected components. The question is, will the components all be fairly small, or will there be at least one really big component?
The following (incomplete) code attempts to address this question. For a range of and values, a G(n,p(n)) graph is created and (when the code is completed) the size of the largest component is measured. These data are then plotted on a series of scatter plots.
- Fix the code.
- You don't have to solve it exactly like I would, but you should be efficient.
- I was able to find the size of the largest component in a single line using the Python standard functions len, max, the NetworkX function connected_components, and Python list comprehension.
- It would be a very good idea for you to work out your code in a separate cell on some small test graphs to make sure you get it working correctly. This is a standard best-practice in programming: design your code in small chunks. You'll be much better able to verify that your code is correct if you keep it separate from the rest of the code below until you have it working the way you want.
- Explain the result.
After your code is working, take a look at the plots. Do some digging in textbooks or on the internet and find the the mathematical result that is demonstrated in this example. Beneath the box marked "Answer" below, write a paragraph or so describing what you find.
Remove this line and type your answer here