Thursday, September 19, 2024

Probability that a value selected will be the maximum value in a uniform distribution

Copyright Carl Janssen 2024

Warning I am writing this to think about different possibilities and do not claim that the information presented is true

Title : Probability that a value selected will be the maximum value in a uniform distribution

When a sample size of n is collected with replacement from a population with a discrete uniform distribution including m possible values the probability that maximum value in the population is not sampled is

[ ( m - 1 ) / m ] ^ n

This is not the same as the probability of a value being selected that is higher than the value in the original sample with a sample size of n if the population is sampled from a second time with a sample size of 1

Example to explain why

Let us say the population has a uniform distribution of all whole numbers starting at 1 and ending at 10

Highest value in sample, missing higher numbers, Probability that a second sample with a sample size of 1 would achieve a higher result than the first value

10, no higher numbers, probability 0
9, higher numbers 10, probability 10%
8, higher numbers 8, 9 probability 20%
7, higher numbers 7, 8, 9 probability 30%
2, higher numbers 3, 4, 5, 6, 7, 8, 9, 10 probability 80%
1, higher numbers 2, 3, 4, 5, 6, 7, 8, 9, 10 probability 90%
10 - x, higher numbers 10 -x, 10 - x + 1, 10 - x + 2... 10-2 , 10-1 , 10 , probability x*10% 

Although this enables someone to estimate the probability that upon a second sample with a sample size of 1 there will be a value selected that is higher than the first sample if the highest value in the first sample is known it does not give the probability that there will be a higher value in the second sample that is greater than highest value in the first sample if the test in the first sample has not yet been done and the highest value in the first sample is not yet known.  

Such a thing would be desirable to correct problems with the empirical distribution function that mistakenly assigns a 0% probability of numbers being lower than the lowest number sampled and a 0% probability of numbers being higher than the highest number sampled which is not necessarily true.  

However such a correction would be more difficult because we do not necessarily know if the population that will be sampled by the empirical distribution function has a uniform distribution, if it is discrete or continuous or what it's range is and this method only works to estimate it when the population is discrete and has a known range, unless further things are done to modify things to take into account what is unknown beyond this starting point.

https://en.wikipedia.org/wiki/Empirical_distribution_function

Example

population only has values of whole numbers 1 through 4 and m = 4

sum of
( 4 - 4 ) / 4 * probability that highest number is 4 for sample size of n
( 4 - 3 ) / 4 * probability that highest number is 3 for sample size of n
( 4 - 2 ) / 4 * probability that highest number is 2 for sample size of n
( 4 - 1 ) / 4 * probability that highest number is 1 for sample size of n
equals
probability of a value being selected that is higher than the value in the original sample with a sample size of n if the population is sampled from a second time with a sample size of 1

Probability that the maximum value sampled from the population is not the same as the maximum value in the population

[ ( 4 - 1 ) / 4 ] ^ n = ( 3 / 4 ) ^ n

Probability that highest number is 4 for sample size of n
1 - [ ( m - 1 ) / m ] ^ n
1 - [ ( 4 - 1 ) / 4 ] ^ n
1 - ( 3 / 4 ) ^ n
1 - ( probability that a single number is 3 or lower ) ^ n

Probability that highest number is 3
1 - probability that highest number is 4 - ( probability that a single number is 2 or lower ) ^ n
1 - ( 1- ( 3 / 4 ) ^ n ) - ( 2 / 4 ) ^ n
( 3 / 4 ) ^ n - ( 2 / 4 ) ^ n

Probability that highest number is 2
1 - probability that highest number is 4 - probability that highest number is 3 - ( probability that a single number is 1 ) ^ n
( 2 / 4 ) ^ n - ( 1 / 4 ) ^ n

Probability that a single number is 1
( 1 / 4 ) ^ n

Sum of probabilities to check if they add up to 1

1 - ( 3 / 4 ) ^ n + ( 3 / 4 ) ^ n - ( 2 / 4 ) ^ n + ( 2 / 4 ) ^ n - ( 1 / 4 ) ^ n + ( 1 / 4 ) ^ n = 1


Probability that a second sample with a sample size of 1 will give a higher result than the highest value of a first sample with a sample size of n


Sum of

( 4 - 4 ) / 4 * [ 1 - ( 3 / 4 ) ^ n ]
( 4 - 3 ) / 4 * [ ( 3 / 4 ) ^ n - ( 2 / 4 ) ^ n ]
( 4 - 2 ) / 4 * [ ( 2 / 4 ) ^ n - ( 1 / 4 ) ^ n ]
( 4 - 1 ) / 4 * ( 1 / 4 ) ^ n

equals
Probability that a second sample with a sample size of 1 will give a higher result than the highest value of a first sample with a sample size of n
equals
[ 1 / 4 ] * [ ( 3 / 4 ) ^ n + ( 2 ^ 4 ) ^ n + ( 1 / 4 ) ^ n ]

I removed the exponents so this is not the same but shows the idea of how parts cancel out or technically it is the same when the exponents are 1 but they cancel out in the same way with other values for the exponents.

0+1*( 3 / 4 ) - 1*(2/4)+2*(2/4)-2*(1/4)+3*(1/4)=3/4+2/4+1/4

when n = 1 this is at it's highest for all permitted n values as n => 1

3+2+1 = 6
6 / 4 = 1.5
0 < 1.5 / 4 = 3 / 8 < 0.5 < 1 

if it was greater than 0.5 it would be a problem because we have to consider symmetry of testing for how often a sample size of 1 in the second sample is lower than the lowest value of the first sample

Checking results
Sum of products
Value, Frequency, probability of next sample being higher, product
1, 1/4, 3/4, 3/16
2, 1/4, 2/4, 2/16
3, 1/4, 1/4, 1/16
4, 1/4, 0, 0
Equals
6 / 16 = 3 / 8

Final Answer

From i = 1 to i = m - 1 where i changes by 1
take the sum of
[ 1 / m ] * ( i / m ) ^ n

Check to make sure this never exceeds 1 / 2 for m  => 2 it is at max when n = 1 for n => 1
when n = 1
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )
equals
[ ( m - 1 ) * m / 2 ] / [ m * m ] = ( m - 1 ) / ( 2 m ) = 0.5 - ( 1 / 2 m ) < 0.5
approaches 0.5 from below as m approaches positive infinity

This value of approaching one half from below makes sense because it should be split into two parts one above the sample value and the other below the sample value except for a very narrow third part of the single sample value and on average there is symmetry and the portion excluded gets smaller and smaller as sample size gets larger and larger


1 + 2= 3 = [ 2 * 2 + 2 ] / 2 = ( 2 * 3 ) / 2
1 + 2 + 3 = 6 = [ 3 *3 + 3 ] / 2 = ( 3 * 4 ) / 2
1 + 2 + 3 + 4 = 10 = [ 4 * 4 + 4 ] / 2 = ( 4 * 5 ) / 2

For other values of n and m

when n = 1
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n














when n = 2
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n


https://calculator-online.net/riemann-sum-calculator/









when n = 3
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n


when n = 3
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n


when n = 4
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n


when n = 9
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n


when n = 19
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n




when n = 24
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n



when n = 49
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n



when n = 99
From i = 1 to i = m - 1 where i changes by 1
take the sum of 
[ 1 / m ] * ( i / m )^n






Computer estimated results very close to 1 / ( n + 1 ) for m = 250

Right and Midpoint Riemann sum overestimated for n = 1 so used left Riemann sums primarily as defined by however the computer system did it

when n = 2 as m increased the calculated result increased and got closer to 1 / ( n +1 ) from below

This was close to previous assumption by dividing sample values into sections that form the gaps between nearest neighbors and assuming equal probability of falling into each gap range and having either n-1 or n+1 gaps depending on if you consider the area above the highest selected value and below the lowest selected value to be a gap or not.  Although I guessed that I did not have a formal mathematical way to show my guess was reasonable before doing this.

Only tested for m up to 250 as it seemed to slow down or not work well for m of 1000 the one time I tried it
250 is nearest factor of 1000 that is below 256 or 255 so it was chosen

No comments:

Post a Comment

Special Relativity Experiments short

 Copyright Carl Janssen 2024 I do not want to delete this content or edit it to remove things but I am not going to finish it.  I will copy ...