Uniformly distributed random numbers

I read somewhere for C++ 11, the standard way to generate uniformly distributed random numbers is like this:

std::random_device rd;
std::seed_seq seed2{rd(), rd(), rd(), rd(), rd(), rd(), rd(), rd()};

std::mt19937 gen(seed2);
std::uniform_int_distribution<double> uniform_distribution(1, 90);

for (int i=0; i<1000000; ++i) {
    int number = uniform_distribution(gen);
}

I tested this with a large batch of random numbers. I lump the occurrence into 3 bins and they are:
bin0(1-30)=65490857
bin1(31-60)=65509162
bin2(61-90)=65499981
Is there any way to generate a better, more uniform results.

I did a a chi-squared test on those numbers and the test failed to reject the hypothesis that those counts come from a uniform distribution. In other words, there is no statistical reason to think that the distribution of numbers is not uniform. There is no reason to expect all of the counts to be exactly the same, even with a very large sample: They are random numbers and there is always some noise in random numbers.

If you need the counts in each bin to be the same, what you can do is create a list of numbers that has an equal number of each of the numbers you want. For example, 1,1,2,2,3,3, then shuffle that list and take numbers from it. This only works if the total number of numbers you want to sample is a multiple of the number of possible values that you want. In the example, with three values, you must sample a multiple of 3 in order to get an equal number of each value.

Thanks very much for your reply. First, I have to say that I am not really a mathematician. The main reason I raise this question is that the differences in occurrence of the numbers (a-b) in uniform_distribution(a, b); does not seems to converge or get smaller and smaller when the number of samples get larger and larger. I did not formally tally the stats, but just ran the code a number of times with widely different number of samples and the variation of outcome does not get smaller percentage-wise.
I first run the code without separate them into bins, and is the same, I separate them into bins just in case there is some issue with lower occurrence for initial, ending values in the [a,b] range and to even out any irregular things that I do not aware of and get a more close occurrence for each bin.

Your point about the proportional difference not decreasing is an important one, because it brings up one of the things that people have a hard time understanding about random numbers. Statistical intuition would suggest that as the sample size gets larger, the sample should more closely reflect the population of numbers being sampled from (the uniform distribution in this case). The problem is that this is only correct on average, across a whole bunch of samples. I’ll try to illustrate this with the following example:

I took 1 million samples from a uniform distribution on the interval (0,1). Then, for the first 10,000 samples, I found out what proportion of them were less than 1/3 (I would expect it to be about 1/3). I then did this at different numbers of samples: the first 10,100, 10,200, etc. until I had done this for all of the samples. The result is plotted below.

As you can see by looking at the scale of the y axis, the proportion less than 1/3 is about 1/3, as expected (I specifically didn’t do less than 10,000 because it’s too variable there). Note that I plotted the x-axis in log scale and that 1/3 is indicated by the horizontal dashed line. You can see that as the number of samples increases, the proportion less than 1/3 stabilizes. However, one interesting thing is that it is not always moving toward 1/3. For example, at about 100,000 samples (1e+05), the proportion is fairly accurate. However, at around 316,000 samples, it has become relatively inaccurate, before becoming more accurate again toward 1 million samples.

One thing to take away from this is that 1 million is not really large enough to get extremely accurate proportions, even though this is what we intuitively expect.

On average, however, if you have 1 million samples, the proportion will be more accurate than if you have 10,000 samples. But for any 1, or 2, or 10, or whatever series of numbers, those may not be more accurate at 1 million than at 10,000. You might need to average across a large number of series of numbers before you will be able to see that 1 million samples is more accurate than 10,000. My point is that looking at just one series of numbers, even if that series has 10 million samples in it, can be misleading.

One of the reasons that statistical intuition doesn’t work that well in this case is the (correct) idea is that the proportion will get more accurate as the sample becomes arbitrarily large. In particular, if the sample is infinitely large, exactly 1/3 of the numbers will be less than 1/3. This is where it breaks: For a human, 1 million is indistinguishable from infinity, but there really is a big difference between the two.

Another thing I noticed and forgot to mention before is that from the result I ran before, it seems the occurrence somewhat stewed. As see also from your plot, the proportion less than 1/3 consistently lower than 1/3 but never get above, that is quite strange, one would expect it suppose to swing in either side.

Yeah, the staying on one side seems funny. I’ve plotted a bunch of different series of numbers, where you can see that some series tend to stay on particular sides, but not necessarily the same side for each series. Some bounce back and forth a little.

Part of the reason that a single series might stay on one side is its history: If it happened to get more low numbers at the start, those low numbers keep it low even if the distribution of numbers later on is a uniform distribution. In order to counteract this history, the series of numbers would need to get a non-uniform distribution of numbers, biased away from its history.

I should mention that the numbers I sampled were from R, not C++, but the same principles should apply.

I wonder if for a single plot (let say, the 1st plot you show in black color), you are using a fix random seed through or for each run in the x-axis, the seed is different for each run. I mean let say, the random seed for 31623 is the same as 316228 and same for all others in x-axis? If they the seed are all different, then is quite strange.

The properties of these random numbers raise my concern is that: since the way we use this type of random number, is usually say for example, we want to draw 1000 random points in a 1000x1000 pixels window, then we would generate 1000 random numbers. 1000 is not a large number and I am OK with these numbers might skew in some spaces in the window space. But one would normally suppose that every time I run this program, it would give another set of 1000 random numbers skew in another way, but not skew consistently in a certain way.

Another way I test it is using Chao game to generate the Sierpinski triangle (the code is very simple). The result generate still quite OK, but seems somewhat not as good as those show in textbooks (which the author say using some special method to create uniform random number). That’s why I am not sure if there is some proper way to use the C++ random generator that need to aware of.

For the plots I made, the random seed was the same for each series, but it differs between series. So at a single x point, like 31623, the seeds are different for each series. I don’t understand the point you bring up about the series skewing the same way. Different series skew different ways. I plotted 10 (or 11?) series and while most of them happen to skew high, 10 is a small sample, so it’s hard to come to clear conclusions. Do you mean that because some of the series are skewed over 1 million samples that each block of 1000 samples might be skewed in the same direction? In that case, see the plot below.

In this plot, I’ve taken 1 million samples from a uniform on [0,1], then divided it up into 1000 smaller samples, each of containing 1000 numbers. I’ve taken the proportion of the 1000 numbers that are less than 1/3, which is the x-axis of the histogram. The solid vertical line is what the mean should be, which is 1/3, and the dashed vertical line is what the mean is (just under 1/3). This means that over the 1 million samples, it was biased a little low. What I want to draw your attention to is the fact that there is a huge amount of variability, such that for any bin of 1000 samples taken from the larger series, it has almost the same probability of being above or below the mean.

I don’t know if there are specifics about how to use the C++ random number generation stuff that you should know. I have run into bugs in the Microsoft standard library implementations of some random number generation stuff, but I think they are fixed now. Other standard library implementations are probably pretty bug-free at this point, as well. I think that the random number generation stuff should work as expected without needing to be done in any special way. In my experience, it mostly works well.

Thanks for your plots, they are very informative. What I mean is, let say for the first plot in black, I was expecting that the random sequence would swing some times above 0.333, and the swing magnitudes, both above and below get smaller and smaller when the sample size get larger and larger. That’s that way, at least for myself, would expected when using a random sequence. But from your plot, obviously that is not the case. Anyway, thanks for your plots.