Copyright Carl Janssen 2024
Warning : I am kind of making stuff up to think about possibilities of how stuff works do not use this on statistics tests or homework at a accredited university
This is out of order because I wrote things to do later in the front then I started doing them and I also wrote about observations from Wikipedia and other sources before the original introduction and material I was doing earlier and I have a second group of things to do at the end. I might later simply do stuff in a shorter version.
In order to solve the problem of repeat values replace the repeat values with two values right next to it
For example 2, 2 could be replaced with 2,01 and 1.99 as long as all other values are greater than 2.01 and also less than 1.99 by a large amount like maybe 100 but better replace it with the limit of 2-x as x approaches infinity and 2+x as x approaches infinity. Another possibility is if a uniform distribution function is used to fill the space in between values then the uniform distribution functions to the left and right of the function are multiplied by the number of times that value occurs if the value 2 occurs twice instead of once then it is multiplied by 2 each time this is done it can be counted as another gap between values, each gap between values is presumed to have an equal probability of falling anywhere within the range of that gap of equal to one divided by the number of gaps between values if the gaps beyond the maximum and minimum numbers observed that I call tails in this context are ignored or presumed to have a zero probability. There are n-1 gaps between n values if there are no repeat values and the spots I call tails are ignored but if the spots I call tails are counted as gaps then there are n+1 gaps. The probability of falling on a specific gap that is not a tails might be ( 1 - probability of falling on tails ) / n - 1
To do
Explain possible models to make it more continuous
Fill in remaining space with uniform distribution probability density functions values except at the two tail ends such that the probability of achieving a value between any two values that were found in the sample is modeled as being the same but the cumulative probability density function looks continuous instead of like steps, this should result in the same cumulative probability values for values that match the sample values but different values for the values that are in between that would be predicted by the probability density function
Emphasize that remaining space that was filled with uniform distribution probability density functions could also be filled in with other functions as long as this results in a cumulative probability density function that is the same as it would at the values that were in the sample when the function was treated like that there were steps instead of it being continuous
Look into setting the cumulative probability for variation with two symmetric tails as comparing two options with no replicated values and a sample size of n
Option 1
There are n + 1 sections each with a probability value of 1/ ( n + 1 ) of landing in that section smooth out with uniform distribution probability density functions to make data continuous so that there are no discontinuities in the cumulative probability density function where values that were in the sample occurred anymore
Such that the cumulative probability ( as calculated from left to right ) of the value corresponding to index i equals i / ( n + 1 )
This still does not resolve the probability distribution function at the tails but let's you know the cumulative probability of landing anywhere on the unlimited range in one direction of the left tail as 1 / n +1, the left tail can not have a uniform probability distribution function if it goes on forever in one direction
Option 2
Set the value of the cumulative probability density function for the value corresponding to index i as the number of points to the left of it plus one divided by the number of points to the left of it plus 1 the number of points to the right of it plus 1. The plus 1 sometimes represents including the probability of landing on a tail.
i -1 points are to the left of it
n - i points are to the right of it
( i - 1 + 1 ) / ( i - 1 + 1 + n - i + 1 ) = i / ( n + 1 )
Done Option 2 results in the same as option 1 when tails are counted when sample values are not replicated
Compare possibilities of dealing with replication of values within sample as to how different options would effect the model
Definitions of variables for option 1 and option 2 when replicated values occur
there are n different value in the sample with m total values replicated
How you count replicated is important and there are different conventions to count it, and I choose the following convention because it allows me to to get the sample size easily by adding m + n
if a value occurs twice it is only replicated once
if a value is replicated it counts toward the count of n different values but is only counted one time
q equals the number of times a value occurs
if a value occurs q times then M = Q - 1 and Q = M + 1
the total number of values taken in the sample is M + n
2, 2, 2, 3, 4, 5 M=2 and n = 4 and M + n = 6 and Q = 3
values are ordered by index from lowest to highest
when identical values occur there are not multiple indexes but the number of times they occur for that index is q ( i ) and the number of times they are replicated is m ( i )
the index i = 1 represents the lowest value
the index i = n represents the highest value
M = sum of all m(i)
Q = sum of all q(i)
m(i) = number of times the value at that index is replicated in the sample = q(i) - 1
q(i) = number of times the value at the index
c(i) = cumulative probability of i
c ( 1 ) = c ( i = 1 )
Option 1
for the value corresponding to index 1 the cumulative probability is
probability of landing on one tail is 1 / ( 1 + M + n )
c (1) = [ 1+m(1) ] / ( 1 + M + n )
c ( M + n ) = 1 - [ 1+m( M + n) ] / [ 1 + M + n ] if this is not true then calculation from right to left is not the same as from left to right and I should reject using option 1
for i > 1
c ( i ) = c ( i - 1) + [ 1+m( i ) ] / ( 1 + M + n )
for i < M + n
c ( i + 1 ) = c ( i ) + [ 1+m( i + 1) ] / ( 1 + M + n )
Compare option 1 calculating cumulative probability density from left to right as opposed to from right to left the left to right value + the right to left value for each index should add up to 1 otherwise this method should be abandoned the above calculations are only for left to right and not from right to left
Option 2
c(i) = 1 + the number of things to the left of it divided by 1 + the number of things to the right of it + 1 + the number of things to the left of it
Probability of landing anywhere in entire range of all of left tail value = 1 / ( 1 + M + n )
c ( 1 ) = 1 / ( 1 + M + n - 1 - m (1) )
c ( 1 ) = 1 / ( M + n - m (1) )
c (M+n) = 1 / ( 1 + M + n - 1 - m (M+n) )
c (M+n) = 1 / ( M + n - m (M+n) )
for i < M + n
c ( i + 1 ) = c ( i ) + 1 / ( 1 + M + n - 1 - m ( i + 1 ) )
c ( i + 1 ) = c ( i ) + 1 / ( M + n - m ( i + 1 ) )
for i > 1
c ( i ) = c ( i - 1 ) + 1 / ( 1 + M + n - 1 - m ( i ) )
c ( i ) = c ( i - 1 ) + 1 / ( M + n - m ( i ) )
Check to make sure probability does not add up to more than 1. If probability does add up to more than 1 then see if it adds up to less than or equal to 1 when tail value is set to 0. If it adds up to less than 1 then change tail values to add up to what is left over so that it adds up to 1. Changing tail value will change the cumulative probability distribution of the first point on the index when calculating from left to right and the cumulative probability values of the last point on the index when calculating from right to left. Cumulative probability from left to right + cumulative probability from right to left equals 1.
Make sure there are no contradictions when calculating from left to right as opposed to right from left
Right tail number + Left Tail number can be set and added to the denominator of every cumulative probability distribution for the values found in the sample but only left tail number will be added to the numerator if cumulative probability distribution is calculated from left to right
If the cumulative probability for everything adds up to more than 1 then the entire cumulative probability distribution function can be multiplied by a constant to fix this problem so that the probability is 1
Estimating tail frequency as a function of sample size for sampling from a known random uniform distribution that is pretended to be unknown and modeled with the piecewise method
The assumption has been made for one of the models that the frequency greater than the highest value measured in a sample plus the frequency greater than the lowest value measure in a sample equal two divided by the sample size plus one
If four data points are taken as a sample then a model is constructed the theoretical estimation is that 2 out of 5 points will be higher or lower than the minimum and maximum from the previous sample if a second sample is taken with a sample size of a multiple of 5
If nine data points are taken as a sample then a model is constructed the theoretical estimation is that 2 out of 10 points will be higher or lower than the minimum and maximum from the previous sample if a second sample is taken with a sample size of a multiple of 10
If 24 data points are taken as a sample then a model is constructed the theoretical estimation is that 2 out of 25 points will be higher or lower than the minimum and maximum from the previous sample if a second sample is taken with a sample size of a multiple of 25
This can be repeatedly tested with a uniform distribution and an average can be taken of the frequency of how many data points exceed the highest or lowest value in the previous sample with a given sample size
We can see if this prediction is accurate or if a constant times this prediction is accurate
We can then use that constant to decide what the probability of landing on the left tail + the probability of landing on the right tail is for the piecewise distribution
Tails Definition in this context
Where left numbers are treated as lower than right numbers on a numberline
I am defining the left tail as the region of values to the left of the lowest value from the sample used to decide the piecewise distribution
I am defining the right tail as the region of values to the right of the highest value from the sample used to decide the piecewise distribution
Statistical significance testing theories
Calculate the probability that the mean or median of A would be different than the mean or median of B by at least as much as the values were different in the sample
By calculate I mean give a demonstration on how to calculate that. It would probably be good to make sure no numbers overlap between A and B to avoid issues with the discontinuities when doing the demonstration unless the space is filled in with uniform distribution functions
Explain how to get a lower and upper estimate for the probabillity with the tails, the lower estimate being 0 and the upper estimate arbitrarily being 1 / ( sample size + 1 ) and demonstrate another test where results are beyond the tails
1 Calculate probability that a value with a sample size of 1 taken from distribution B would be as far from the median of B as the median of A is from the median of B
2 Calculate probability that a value with a sample size of 1 taken from distribution A would be as far from the median of A as the median of B is from the median of A
3 Calculate probability that a value with a sample size of 1 taken from distribution B would be as far from the mean of B as the mean of A is from the mean of B
4 Calculate probability that a value with a sample size of 1 taken from distribution A would be as far from the median of A as the mean of B is from the mean of A
Would the probability be raised to the power of the sample size for a sample size greater than 1 more detail on that question later?
If a Statistical Significance test is testing the chance that they would or would not really come from the same distribution consider the consequences to statistical significant testing outcomes if you compared the probability of results compared to a merged distribution. Consider merging the two data sets A and B into one set C then seeing the probability of the following four things.
1 that the median of A is different than the median of C by at least as much as the measured difference between the two
2 that the mean of A is different than the mean of C by at least as much as the measured difference between the two
3 that the median of B is different than the median of C by at least as much as the measured difference between the two
4 that the mean of B is different than the mean of C by at least as much as the measured difference between the two
Is the probability of difference in that direction put to the power of the sample size?
Is the probability of items 1, 2, 3 and 4 above for the sample size of A or B equal to the probability that it would be so if the sample size was 1 to the power of the actual sample size. For example if the probability of such a difference in that direction was 0.2 if the sample size was 1 and the sample size was 5 would the probability actually be 0.2^0.5 because it is the probability of that happening 5 consecutive times instead of 1 time or is the probability with a sample size of 5 the same at 0.2 as it would be with a sample size of 1 or something else?
If in order for A and C to have a difference in mean of 2 or more for a sample size of 4 for A you would not need to be apart from the mean of C for the sample values of A by 2 a total of 2 times but would only need to be apart by an average of 2 among all the values this could be achieved by ( 2, 2 ) but also by at least ( 2-x, 2+x ) for a difference between the values of A and the mean of C however if the probability of being off by 2 + x divided by the probability of being off by 2 is less than the probability of being off by 2 divided by the probability of being off by 2 - x and the greater the value of x the less likely you are to be off by (2+x, 2-x) then treating it as the probability of being off by by at least (2, 2) would over estimate how likely you are to reject the null hypothesis compared to being off by (2+x, 2-x) so you might use that to make the argument that the probability of being off by at least 2 a total of 2 times in a row is close enough to achieving the same result but it is not because having an option of (2,2) or (2+x, 2-x) gives more possibilities and therefore a greater chance of failure to reject the null hypothesis
Article Title : Getting Piecewise Probability Distribution functions from sample data
Possible name for distribution I thought of before hearing about the empirical distribution: Percentile based piecewise probability distribution functions (PBpdfs)
Finding out about the empirical distribution after starting writing this article
After writing an earlier draft I found this article wikipedia on the "empirical distribution function" and it seems similar to what I already wrote about the variation of the model involving "assuming a distribution with no tails and zero probability outside max and min collected values" writing but there are some differences in the very least in that I wrote another model or a model variation to consider the possibility of data beyond the min and max of the sample data
Empirical distribution function wikipedia
https://en.wikipedia.org/wiki/Empirical_distribution_function
Naming
Upon this new information about naming I will call the types of distribution models I am looking at "piecewise distributions" and the type mentioned on wikipedia an "empirical distribution"
All empirical distributions are piecewise distributions but not all piecewise distributions are empirical distributions
If I find that the term piecewise distribution is used to mean something else then I will have to change the name. Or if I find the definition of empirical distribution is broader than I assumed such that it overlaps completely with or includes things that fall under other categories than what I call a piecewise distribution then I will have to change that claim
What do I mean by a piecewise distribution. A piecewise distribution is a distribution whose probability density function is made with piecewise functions as opposed to a distribution that is only made with one function that is not a piecewise function.
Here is a Wikipedia article on Piecewise functions
https://en.wikipedia.org/wiki/Piecewise_function
Desire to find out if there is a statistical significance test for the empirical distribution
I would like to find out if a method of "statistical significance testing" that presupposes a "Empirical distribution function" has been designed already now that I know that I know that there is a name for a distribution that is similar to the type of distribution I am trying to create a model for, although as I already mentioned there are some differences. I would suggest that there are many things you can do with statistics other than statistical significant testing which generally are more useful to society and even if there is no premade statistical significance test for this distribution it is still useful to think about it and similar distributions for other purposes. If there is no premade statistical significance test for such a distribution then I would like to figure out how to make one if such a thing is possible. I do not think it is good to pre-assume a normal distribution when you can get a better matched distribution through one that is designed to match the probability of each point of the actual raw data.
Possibility of using a confidence interval for the empirical distribution or piecewise distribution to do a statistical significance test
"Confidence Intervals and Statistical Significance
If you want to determine whether your hypothesis test results are statistically significant, you can use either P-values with significance levels or confidence intervals. These two approaches always agree."
https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-confidence-intervals-levels/
What a confidence interval allegedly does not mean
"A 95% confidence level does not mean that 95% of the sample data lie within the confidence interval.
A 95% confidence level does not mean that there is a 95% probability of the parameter estimate from a repeat of the experiment falling within the confidence interval computed from a given experiment."
https://en.wikipedia.org/wiki/Confidence_interval
And another website with an example contradicting the Wikipedia claim about what the confidence interval allegedly does not mean
The "±" means "plus or minus", so 175cm ± 6.2cm means
175cm − 6.2cm = 168.8cm to
175cm + 6.2cm = 181.2cm
And our result says the true mean of ALL men (if we could measure all their heights) is likely to be between 168.8cm and 181.2cm
But it might not be!
The "95%" says that 95% of experiments like we just did will include the true mean, but 5% won't.
So there is a 1-in-20 chance (5%) that our Confidence Interval does NOT include the true mean.
https://www.mathsisfun.com/data/confidence-interval.html
Interpretation of a Confidence Interval
In most general terms, for a 95% CI, we say “we are 95% confident that the true population parameter is between the lower and upper calculated values”.
A 95% CI for a population parameter DOES NOT mean that the interval has a probability of 0.95 that the true value of the parameter falls in the interval.
The CI either contains the parameter or it does not contain it.
The probability is associated with the process that generated the interval. And if we repeat this process many times, 95% of all intervals should in fact contain the true value of the parameter.
https://online.stat.psu.edu/stat504/lesson/confidence-intervals
Once data is collected the probability that the mean lies within the confidence interval is not 1 - alpha or 100% - alpha but either 100% or 0%. The confidence interval is determined by the confidence coefficient, the mean and the standard deviation when using z-scores.
Interpreting the Confidence Coefficient
Interpreting the confidence coefficient requires a nuanced understanding of its implications. A common misconception is that a 95% confidence coefficient means there is a 95% probability that the true parameter lies within the interval. Instead, it reflects the long-term performance of the estimation process. In repeated sampling, 95% of the intervals constructed would contain the true parameter, but for any single interval, the true parameter either lies within it or it does not. This distinction is crucial for accurate statistical interpretation.
https://statisticseasily.com/glossario/what-is-confidence-coefficient-explained-in-detail/
The CI either contains the parameter or it does not contain it.
https://online.stat.psu.edu/stat504/lesson/confidence-intervals
Really suspicious claims about confidence intervals for empirical distribution on wikipedia that are either wrong or non conventional in terms of their units and that are wrong in that they do not scale correctly if they are wrong about their units and the conventional units are used
Calculations are presented involving confidence intervals for the empirical distribution function in these two Wikipedia articles but the calculations make no sense because they are unitless and dimensionless and only dependent on the sample size and the alpha values as far as I understand.
This would mean that if all the values were multiplied by a constant the confidence interval would be the same, it also means that the confidence interval would not change if the values change. This seems to me to mean that if all the sample values were multiplied by a unit less real number constant with an absolute value less than 1 and the alpha value and the sample size were kept the same then the chance of rejecting the null hypothesis would decrease and if all the values were multiplied by a unit less real number constant with an absolute value greater than 1 or less than -1 and the alpha value and sample size were kept the same then the chance of rejecting the null hypothesis would increase
The confidence interval might have been done in a different way then I am normally used to involving a probability on the graph instead of involving a number with the same unit as the values measured. That is normally if probability was graphed vertically and the value that is assigned a probability is graphed horizontally then the confidence interval would normally be graphed horizontally but I thought they mentioned it in such a way that it would be graphed vertically. Let's say you measured the length of logs you might plot the length of the logs on the x axis horizontally and the frequency that logs are that length vertically on the y axis, now if you wanted to assume a normal distribution represented and plot that distribution you might again plot the probability on the y axis vertically and the theoretical length of the logs that corresponds with that probability on the x axis. In such a case normally the confidential interval would have units of length and not unitless or dimensionless units of probability and be drawn horizontally on the graph not vertically.
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
https://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality
Introduction
I am not sure if piecewise means something special in statistics, I made up this term for what I want to do based on the concept of constructing piecewise linear functions to fit data in algebra, although this might be somewhat different in some ways but similar in other ways as how to get such a piecewise function than working with coordinates in algebra.
Let's say you have gathered a sample of N data points from a population. It is a common practice to assume without any proof that the data is normally distributed and calculate a mean and standard deviation and then use that for statistical significance tests. But I would suggest that it might not be normally distributed and instead you could create a custom or piecewise distribution that might better represent the actual distribution of the frequencies of values or frequencies of ranges of values within a population.
You could simply take each of the values and count how many times that value shows up in the sample and divide that by N or the number of data points that you collected to get that sample data from the population. I would suggest that might not be the best practice because it assumes that only those values exist within the population and no value greater than the highest value collected exists within the population nor do values lower than the lowest value collected exist nor do values in between the values you collected exist if they are not the same values as values that you collected because a 0 probability is assigned to all values that you did not collect in the sample. Such values that would be assigned a 0 probability this way might actually exist within the population that the sample was collected from.
People might debate about several other ways to create such a piecewise distribution or whether or not a piecewise distribition would be more accurate than a normal distribution or some other distribution but here are some ways a probability distribution could be assigned on a piecewise basis. I will assume that the percentile of a value that shows up within sample data is related to the commutative distribution that it occurs but the assumed value will not necessarily the same as the cumulative distribution as I will suggest caveats that modify the guessed cumulative distribution to be slightly different than the percentile. First there are inconsistencies in how a percentile is used based on what is done when calculating a percentile when a value is exactly equal to a value on the list. Second a problem with using percentiles is how do you rank the percentile for a number with a value that exceeds or is less than all the numbers on a list for percentiles, would it be realistic to assume that such values can never occur, if this was the case then this is problematic unless all samples used to calculate percentiles always had world record holders. Third two different numbers in between the same two different closest neighbor values used to calculate percentiles should have a different cumulative probability of occurring I plan to address various ways to deal with this by filling in functions or to somewhat ignore it by estimating the probability to be within a specific range of values but not specifying the probability of one value within that range compared with another.
I also plan to show an example of one way you could test which model fits a random sample of data better a normal distribution or a piecewise distribituon using an example with a presumed uniform random sample of data from random dot org random.org which might of course give different results than other ways you could test this which I will not specify.
Percentiles and Discontinuities
Let's say you have a list of values and you want to take a single value that might or not be the same as a value on the list and you want to say what fraction of numbers are equal to, greater, then less than that number
Let us take the example of the list 1, 2, 3, 4
All numbers less than 1 are less than 100% of the numbers on that list
1 is equal to 1 out of 4 numbers on that list and less than 3 out of 4 numbers on that list and greater than 0 out of 4 numbers on that list, it is less than 3 out of 3 numbers that are not equal to it and greater than 0 out of 3 numbers that are not equal to it
numbers greater than 1 and less than 2 are equal to 0 out of 4 numbers on that list and less than 2 out of 4 numbers on that list and greater than 3 out of 4 numbers on that list
2 is equal to 1 out of 4 numbers on that list and less than 2 out of 4 numbers on that list and greater than 1 out of 4 numbers on that list, it is less than 2 out of 3 numbers that are not equal to it and greater than 1 out of 3 numbers that are not equal to it
Data values will be ordered from lowest to highest with the assumption of no repeat values for now
i = 1 the index for the lowest value collected in the sample
I = 2 the index for the second lowest value collected in the sample
i = N -1 the index for the second highest value collected in the sample
i = N the highest value collected in the sample
x ( i ) = the value collected in the sample that corresponds to the index of i
X = a random value from a different sample collected from the same population that x values were collected from
f ( X ) = the assumed probability of getting a value of X might not use this text line
F ( X ) = the assumed probability of getting a value of less than or equal to X might not use this text line
There might or might not be a point of discontinuity for the probability at exactly each of the values collected for some of these methods and < and > will be used instead of <= and >= for now and continuous data will be assumed for now
Explanations will be made for how to do so if no duplicate values were collected in the sample for now
Probability refers to the probability value that will be assigned for the piecewise Probability distribution function
Only scalar real numbers will be used for this for now
Assuming a distribution with no tails and zero probability outside max and min collected values
P( X => x ( 1 ) ) = 1
P( X< x ( 1 ) ) = 0
P( X > x ( N ) ) = 0
P( X<= x ( N ) ) = 1
P ( x ( i-1 ) < X < x ( i ) = 1 / ( N - 1 ) for i => 2 and i <= N
P ( x ( i ) < X < x ( i + 1 ) = 1 / ( N - 1 ) for i => 1 and i => N + 1
Assuming a distribution with two Symmetric Tails of unlimited length
P( X => x ( 1 ) ) = 1 - ( 1 / ( N + 1 ) )
P( X< x ( 1 ) ) = 1 / ( N + 1 )
P( X > x ( N ) ) = 1 / ( N + 1 )
P( X<= x ( N ) ) = 1 - ( 1 / ( N + 1 ) )
P ( x ( i-1 ) < X < x ( i ) = 1 / ( N + 1 ) for i => 2 and i <= N
P ( x ( i ) < X < x ( i + 1 ) = 1 / ( N + 1 ) for i => 1 and i => N + 1
https://en.wikipedia.org/wiki/Cumulative_distribution_function
Testing piecewise compared with normal distribution
Generate random uniform distribution data with between 0 and 100 with N data points excluding repeat values
Replace excluded data points with new random data points
Create probability piecewise probability distribution function based on that data
Calculate normal distribution curve with mean and standard deviation based on that data
Create N + 1 data point sections that fit into boxes that each have a 1 / ( N + 1 ) probability of X landing somewhere inside each box the two end boxes going on towards positive and negative infinity or having no end limits on one end but having a limit on the other end
Create such data point section boxes for normal distribution curve and also for the piecewise distribution
The boxes for the normal distribution will look symmetric but not necessarily the ones for the piecewise distribution
Generate random uniform distribution data with between 0 and 100 with N + 1 data points excluding repeat values and excluding values identical to the previous generated data points or that fit perfectly the box boundaries so that you do not know which of the two neighboring boxes they fit in
Replace excluded data points with new random data points
Each box should have exactly 1 data point inside it
A box with 0 data points counts as 1 error
A box with more than 1 data point in it counts as 1 error for every extra data point it has inside it beyond 1
See if the normal distribution or the piecewise distribution results in more errors or if they both result in the same number of errors
