Generating Normally Distributed Random Numbers – Sample Data that Makes Sense

Like many of you, I’m often asked to develop mockups (working demos of dashboards and reports).

In my zeal to make my mockup as realistic as possible, I spend a good bit of time generating sample data.

.

For example, if I’m mocking up an HR dashboard, I’ll dedicate some brain power to generating salary data. Now in order to make salary data realistic, there has to be a believable mix of high salary and low salary records. I can’t just use the RAND and RANDBETWEEN functions Excel provides. These functions generate random data in a “Uniform Distribution”. That is to say, every number has an equal chance of occurring with an equal frequency.

.

Let’s say I use this formula and copy it down to 500 rows.

=RANDBETWEEN(1000,200000)

I instantly create 500 salary numbers, but there’s a problem.

Plotting my newly created data in a histogram reveals the issue.

In this chart, the columns represent the count of records that fall within each salary range. Notice that there are just as many records in the 191k-200k range as in the 51k-60k range. This is definitely not something that holds true in the real world.

.

In reality, salary data has a distribution closer to the one shown in the histogram below.

There are lots people with salaries between 40k and 80k; and far fewer with salaries above 150k.

So if the RAND and RANDBETWEEN functions fail, how does an aspiring Excel nerd easily create sample data that makes sense for its intended purpose?

.

Well, after years going through the pain of creating sample data myself, I finally had enough and pulled together this function called RandNormalDist. This function, based on code by Donald Knuth, allows you to automatically generate random data that falls within a normal distribution. You can copy the code shown here and paste it into a Standard Module.

.

Function RandNormalDist(Optional sMean As Double = 1, Optional sDeviation As Double = 1, Optional sPosSkew As Single = 0)
‘Mike Alexander: www.datapigtechnologies.com
‘Based on code originally published by Donald Knuth
‘***************************************************************************
‘This function will generate a random number that falls within a normal distribution based on a starting Mean and Standard Deviation.

‘You can call this function from an Excel Spreadhseet by entering
‘=RandNormalDist(sMean, sDeviation, sPosSkew)

‘Example Usage 1:
‘This generates a random number that falls within a distribution where the Mean is 20
‘=RandNormalDist(20)

‘Example Usage 2:
‘This generates a random number that falls within a distribution where the Mean is 20 and the Standard Deviation is 3
‘ =RandNormalDist(20, 3)

‘Example Usage 3:
‘This generates only a POSITIVE random number that falls in a distribution where the Mean is 20 and the Standard Deviation is 3
‘ =RandNormalDist(20, 3, 1)
‘***************************************************************************
Dim x As Double
Dim y As Double
Dim z As Double
Dim m As Double

‘Calculate initial coordinate pairs
Start:
z = 2
Do Until z < = 1 x = 2 * Rnd - 1 y = 2 * Rnd - 1 z = x ^ 2 + y ^ 2 Loop 'Do the math to get base multiplier z = Sqr((-2 * Log(z)) / z) 'Account for sPosSkew Positive request and factor in the other user inputs If sPosSkew = 1 And (sDeviation * x * z + 1) < 0 Then GoTo Start End If RandNormalDist = Int(((sDeviation * x * z + 1)) * sMean) End Function [/vb]

.

Once you have the code in your workbook, you can call the function by entering it into your spreadsheet.

The idea is to enter the desired formula and copy it down to as many samples as you need.

.

This function accepts three optional arguments:

Mean: The average value you would like your sample data to have. If omitted, the function uses 1 as the Mean.

Standard Deviation: The number of standard deviations you would like your sample data values to be from the Mean. If omitted, the function uses 1 as the standard deviation.

Skew Positive: A tag that tells the function to only return positive numbers. If omitted, the function includes negative numbers in the resulting random numbers.

.

Here are a few examples of using the RandNormalDist function.

.

Entering RandNormalDist(20) will give you a tight group of numbers that fall around the number 20. As you can see, both negative and positive numbers are returned in a normal distribution.

.

Entering RandNormalDist(20,3) will give you a more dispersed (spread out) set of numbers. Again, you will see both negative and positive numbers returned in a relatively normal distribution.

.

Entering RandNormalDist(20,2,1) will give you a distributed group of ONLY POSITIVE numbers that taper away from the Mean. This kind of distribution is ideal for creating sample data where the numbers need to start around a Mean and move into a handful of higher positive values. This is ideal for generating random samples such as Salary data, Commission data, Help Desk Ticket data, etc.

.

This function has saved me tons of time and made my mock data so much better.

Hopefully it will do the same for you!

Bonus: “Yo momma” joke for Statisticians:

12 thoughts on “Generating Normally Distributed Random Numbers – Sample Data that Makes Sense

  1. David Henderson

    Mathematician, thanks for the neat formulae!. Always been wondering how to generate random no’s in Excel according to Normal, Poisson etc distributions.

  2. David Henderson

    Aha! Just found another way to generate random no’s in Excel.

    If you go to Data…then Data Analysis… then click on Random Number Generation. It then brings up the inbuilt Excel Add-in to generate random no’s according to whatever distribution you want.

    Only trouble is, that you can’t recalculate (F9) the generated no’s.

  3. datapig Post author

    Mathematician: I know about the NORM.INV option, but it doesn’t suit my needs for mock data. For example, if I wanted to generated random data for Salaries, I could enter NORM.INV(RAND(),50000,1) and copy that down 100 times. But that generates a set of numbers with a max of 500002 and a min of 49997. Yes, it’s staticially what I asked for, but it’s not the “real world” data I need. Even if I kick up the standard deviations to 5, NORM.INV(RAND(),50000,5), I get a max of 50011 and 499987.

    Mathematically that is correct, but again, the numbers generated by NORM.INV don’t do anything for me.

    In addition to getting me the random numbers I’m looking for, my function also has the ability to exclude negatives.

  4. Kevin Lehrbass

    This is really useful. Thanks Mike. I often need to generate random datasets for training.
    Excel’s Data Analysis Random Number Generator is also great when you’re in a hurry. If you use the ‘Discrete’ distribution you can assign probabilities to text values which will give you some variety.
    I make a video about this a few months ago: http://youtu.be/POroJY4jkhA

  5. Charlie

    @datapig,

    What if you increased the std dev – as in NORM.INV(RAND(),50000,50000)

    You probably want a different distribution, but it looks like it is close to what you want

  6. Peter Bedson

    Neat code and produces what you are looking for for test data purposes – but I think the distribution is actually a normal distribution that is truncated and not skewed: you have simply discarded negative values and replaced these with another sample from the same distribution. So if for example you wanted to simulate data for investment bankers salaries with a mean of $1,000,000 you would still get some very low (but non-negative) salaries or if you wanted to simulate processing times you woud get some with near zero time – in these circumstances, effectively you would get a normal distribution.

    You could extend the function by specifying a minimum value (say minimum wage or minimum possible processing time) and effectively moving the whole distribution up and down to respect the minimum while keeping the mean which would then probably “look” OK but would still be a truncated distribution.

  7. Carlos Barboza

    Hi Mike, I tried using the copy and paste the code but had no luck:

    Function RandNormalDist(Optional sMean As Double = 1, Optional sDeviation As Double = 1, Optional sPosSkew As Single = 0)

    ‘Mike Alexander: http://www.datapigtechnologies.com
    ‘Based on code originally published by Donald Knuth
    ‘***************************************************************************
    ‘This function will generate a random number that falls within a normal distribution based on a starting Mean and Standard Deviation.

    ‘You can call this function from an Excel Spreadhseet by entering
    ‘ = RandNormalDist(sMean, sDeviation, sPosSkew)

    ‘Example Usage 1:
    ‘This generates a random number that falls within a distribution where the Mean is 20
    ‘ = RandNormalDist(20)

    ‘Example Usage 2:
    ‘This generates a random number that falls within a distribution where the Mean is 20 and the Standard Deviation is 3
    ‘ = RandNormalDist(20, 3)

    ‘Example Usage 3:
    ‘This generates only a POSITIVE random number that falls in a distribution where the Mean is 20 and the Standard Deviation is 3
    ‘ = RandNormalDist(20, 3, 1)

    ‘***************************************************************************

    Dim x As Double
    Dim y As Double
    Dim z As Double
    Dim m As Double

    ‘Calculate initial coordinate pairs

    Start: z = 2

    Do Until z < = 1 x = 2 * Rnd – 1 y = 2 * Rnd – 1 z = x ^ 2 + y ^ 2 Loop

    'Do the math to get base multiplier z = Sqr((-2 * Log(z)) / z)
    'Account for sPosSkew Positive request and factor in the other user inputs

    If sPosSkew = 1 And (sDeviation * x * z + 1) < 0 Then GoTo Start

    End If RandNormalDist = Int(((sDeviation * x * z + 1)) * sMean)

    End Function

    Please help!

  8. amir hossein mojarrad

    Function RandNormalDist(Optional sMean As Double = 1, Optional sDeviation As Double = 1, Optional sPosSkew As Single = 0)

    Dim x As Double
    Dim y As Double
    Dim z As Double
    Dim m As Double

    Start:
    z = 2
    Do Until z <= 1
    x = 2 * Rnd – 1
    y = 2 * Rnd – 1
    z = x ^ 2 + y ^ 2
    Loop

    z = Sqr((-2 * Log(z)) / z)

    If sPosSkew = 1 And (sDeviation * x * z + 1) < 0 Then
    GoTo Start
    End If

    RandNormalDist = Int(((sDeviation * x * z + 1)) * sMean)

    End Function

    Sub randnormal()

    End Sub

  9. amir hossein mojarrad

    IN EXCEL YOU SHOULD PRESS ALT+F11

    new
    module
    paste it
    click on run

    if asked enter a name for it

Leave a Reply

Your email address will not be published. Required fields are marked *