Copyright Carl Janssen 2024 September 16
why I am against statistical significance tests
Presenting an informed public with raw data versus using statistical significance without raw data to dupe the public or reasons why I am against statistical significance tests as a over glorified and often more harmful than beneficial standard in academia
This article is an incomplete description of my reasons as there are more. For example some reasons involving questioning the very idea of there being random events with a probability is not addressed
Although some of the ideas in probability based statistics might have advanced the cause of science more than it harmed it, I suspect that the idea of statistical significance tests have done more to harm the cause of science then to advance it or at the very least using it as a common standard in scientific journal articles and using it as a assumed default research method whenever anything involving measured values is done in the biological and social sciences has done more harm than good.
A common practice in peer reviewed scientific journal articles is to assume data is normally distributed where whether or not data is normally distributed is often not tested, then to arbitrarily assign an alpha values for a statistical significance test, which often is a t-test even though many other types of statistical significance tests are available which could have been used instead and might have made more sense if the data was not normally distributed and or was not quantitative or interval data but was treated like it was. The public is often then told whether or not the test results are statistically significant based on that t-test in media publications if this is politically expedient to meet their goals but if it is not politically expedient the results might be less likely to be mentioned. When the data is published the raw data is often excluded and only the mean, sample size and standard deviation are typically presented as a summary which prevents alternative ways to analyze the data mathematically or statistically unnecessarily hindering the advancement of science through suppressing access to data that funding and research already went to. I would suggest in return for this data suppression the most appropriate response of the public would be too call into legitimacy all conclusions involving statistical significance tests for which all raw data except that which is necessary to protect subject or patient confidentiality is not published.
I believe that if probability based statistics is to be used for public good it would be more useful to simply collect sample data from a population then try to use it to make predictions about the frequency of different outputs or ranges of outputs in different conditions to meet ones goals. These goals might be different in different circumstances, so I believe that taking raw data and then assigning it a presumed distribution such as a normal distribution with a given mean and standard deviation which is published in a peer reviewed journal but then hiding the raw data from the public is a disservice to science compared with simply giving the raw data and letting the public decide what they want to do with it based on their goals and what type of distribution they think it has. Of course one might object to publishing the raw data because it violates patient or subject confidentiality in the biological and social sciences. Will one should then ask does publishing the mean and standard deviation violate confidentiality, to some degree it could but not so much if you remove the subject names. But I would suggest you can simply publish the raw data from which the mean and standard deviation is calculated from then likewise remove the subject's names and any other personal information through which they might be identified.
Although I do not like the idea of doing statistical significance tests at all, if they must be done then I would suggest only publishing the one and two P-values which would let the public know what alpha values would have worked to achieve one or two tailed significance or not instead of arbitrarily assigning a alpha value and then telling the public whether or not the study was statistically significant based on that arbitrarily preassigned choice of an alpha value. The problem with assigning a alpha values and then telling the public that something is or is not statistically significant is it is extremely misleading to the public when it comes to applications because something might be said to have made no difference which would have made a difference if a different alpha value was assigned or made a difference when it would have made no difference if a different alpha values was assigned. It also leads to a problem in that journal's are more likely to choose to publish research if a statistical significance has been achieved, so sometimes researchers will repeatedly do the same experiment and then only publish it when a statistical significance has been achieved for a arbitrarily assigned alpha value that might fit the journal's goals for what is acceptable. This results in extremely biased research that looks like the statistical significance for a certain alpha value is achieved for a certain type of experiment more often than it would be if the times that statistical significance was rejected were also published. The journal's goals for what alpha values should be used do not necessarily the public's goal which is varied depending on what the individual within the public wants to achieve under what circumstances.
The alpha values in statistical significance tests are arbitrary, if the research is done "blind" then whatever value the researcher assigns should not effect what data was collected, someone could just as well have assigned another value and the data would have changed from significant to not significant according to that alpha value. I would suggest that if the scientific journal articles author's want to decide that they are going to treat the data like it is a normal distribution for some part of their article's analysis of the data that is fine but they should still publish the raw data and let the public decide whether or not it is a normal distribution. But as for the alpha value I believe the public would be better off if no alpha values were used at all to decide if the data is statistically significant or not. Instead I would suggest people would be better off if they gave a two tail P value and an additional one tail P value in whatever direction the one tail T test would succeed in or perhaps two one tail P values one in each direction. The public would then know what alpha values would have resulted in statistical significance for a two tail t tests or a one tail t test in whatever direction if that direction were assigned. Depending on the one or two tail alpha values they would desire as the criteria necessary to meet their specific goals they want to accomplish they could then make the decision for what to do based on those one and two tail P values.
How would the public decide what they want to use based on goals. The closer the alpha value is to zero the less likely they are to accidentally reject the null hypothesis when they "should have" accepted it or failed to reject it. The farther the alpha value is away from zero the more likely they are to accidentally reject the null hypothesis when they "should have" accepted it or failed to reject it. There is no perfect alpha values that is identical for all individuals with all goals in all situations, since changing the alpha value does not reduce the chance of making an error but only reduces the chance of making one type of error in exchange for increasing the chance of making another type of error. In one situation for one individual with a certain goal of avoiding one type of error might be more important than avoiding another type of error and they should choose the alpha value according to their specific goal in that specific situation if alpha values are actually ever applied to anything in real life.
But are alpha values ever actually applied to any honest goal in real life? I would suggest no. I would suggest that none of the uses of alpha values encourage a application in a real life situation in meeting a goal other than persuading people. And that the goal's with alpha values in real life situations that do involve persuading people are never the type of persuasion that is done in a ethical manner that does not involve undue influence.
What are some of these reasons
1 To persuade a journal to publish something not to advance the cause of science but to accumulate more publications for a career goal. I am not saying that career goals are bad as career goals can be good or bad based on the motive and the results, but I would suggest that both the motives and the results of this career goal are bad because they are misleading the public in exchange for money.
2 To trick someone into doing something based on something being or not being significant according to a journal.
3 To simply make it through an assignment that you have been unduly influenced into doing so that you can prove you know how to do statistical significance tests without thinking about the actual science of things.
4 To boost your ego in a bad way and feel like you have objectively proven something they predicted in advance that is not so clearly and unambiguously proven at all because if it was clear and unambiguous they would not need a statistical significance test to prove it in the first place because a model would exist in which a specific output can be predicted for a specific input using some sort of combination of algebra, trigonometry and calculus equations with no statistical probability theory invoked at all. Someone might for instance insist they need to assign a one or two tail test and an alpha value before doing the experiment to "eliminate bias' so they can say they called it correctly in advance without bias and boost their ego in a bad way. I would suggest if they were not invested in their ego in a bad way then they would be comfortable with not needing to say they "called it" or "predicted it" correctly but simply publish the data with the P values but not alpha value as I already decided and let the public come to their own conclusions. However I would suggest that even publishing what the P values for what a certain type of statistical significance test would be are not necessary because statistical significance tests are not really used for real honest applications and if the scientists really let go of their ego in a good way they would simply publish the raw data of the experiment and let the public do whatever they want with that raw data.
Ok but if you are not going to do statistical significance tests but you are going to claim that maybe using probability based statistics with raw data is good for something then what would the public do. Let's say there is an experiment and there is data group B which has experiment variation B done to it and data for group A which has experiment variation A done to it. The public should simply choose the real life application closer to the process in either A or B by counting the number of data values that are closer to the results they want divided by the number of data values and then choose A or B based on that for the application. This could be for example choosing which one generates a higher percent within a data range or has a higher or lower mean, median, mode or some other function result. Normally the public wants to get a certain type of values for the results when doing a certain type of action and they should simply choose whether A or B would get those type of results more frequently. This type of application can often be done with the raw data without knowing whether or not the data has a normal distribution.
For example a percentile chart can be made without ever figuring out what type of distribution something is. Someone can simply list the raw data and then look at how many data points are above and how many are below a data point to estimate the percentile without knowing what type of distribution the data is. But instead some people have made this needless complicated and I would propose have not increased the accuracy of the estimation in doing so but decreased it in most cases. First they calculate the mean and the standard deviation then they hide the raw data. Next they tell someone how to estimate what percentile they are based on how many standard deviations they are away from the mean, which may result in a different result then estimating the percentile by counting how many points are above and how many points are below that point.
For example let's say you are a shoe salesman and you want to sell the frequency of shoes of a certain size based on the frequency that the public has those shoe sizes. You could simply count what percent of people in a sample have each shoe size based on the raw data. Why would you waste your time calculating the mean and then the standard deviation and then hiding the raw data from yourself and using the mean and standard deviation to estimate frequencies for each shoe size that might be less accurate then just using the raw data to count frequencies.
Let's say your are a cardiologist and you have to decide which medicine and what dose to prescribe for a patient and each medicine and dose increases or decreases the blood pressure by a certain amount whether it is as a percentage or as a absolute quantity. Let's say you want to increase it or decrease the blood pressure by no more than one amount but no less than another amount. You could simply look at a list of Group A with medicine A at dose A and Group B with medicine B at dose B. You could then simply count which Group has a higher percent of the listed values that meet the criteria of not modifying the blood pressure too much or too little in whichever direction you want and then picking that medicine. To do this actual application does not require figuring out whether or not the data is normally distributed and getting a mean and a standard deviation and it certainly does not require removing the raw data so that you can not see it.
Now someone might object that just using the data is problematic because maybe you do not have enough data points and you need to find out if you have enough data points that you can be certain enough of whatever. And I would simply say I am not objecting to getting more data points. But you simply have the data points of whatever has been collected and you still have to make a decision. Sometimes you have to make a decision with what limited data you have and do not have the time or other resources to collect additional data. And no matter how much or how little data you have using this method is going to be better than burying your head in the sand because you do not have enough data to get a statistically significant result that you feel is powerful enough for an artificially assigned alpha.
More over I would suggest with this method you can get more data. If all the raw data was published for past experiments then someone could simply merge the data from replicated past experiments to get a list of data points to use this method instead of using the statistical significance test methods to remove replications of studies that are not statistically significant as historically has been the common practice of many scientific journal articles. Merging raw data of replications of experiments would result in having enough data points that not enough data points would not be as much of a problem. On the other hand hiding the raw data and also hiding experiment replications that were not statistically significant would increase the problem of not enough data points to be certain enough.
How could data be merged. You would not change the data from old experiments. But let's say there is experiment 1 and experiment 2 and so on and each experiment has data for group A with treatment A and data with group B with treatment B. So group 1A would be data in experiment 1 with treatment A and group 3B would be data from experiment replication number 3 with treatment B. You could combine all the listed data from group A for all the replication numbers into a single list and combine all the data from group B in experiment replication numbers in a single list. By collecting replications of the same experimental treatment in multiple scientific journal articles but this can only be done if raw data is published and can not be properly and correctly done if raw data is removed and only means, standard deviations and statistical significance based on certain alpha values are used.
The pressure to fabricate data. If students are assigned a homework assignment sometimes they are told to do a statistical significance test. They might know the prediction the teacher expects and change the data on their homework so that it gets the same outcome in the statistical significance test that they think the teacher wants if they erroneously believe the teacher will give them better grades if the results match what the teacher predicted better, at least I hope such a belief would be erroneous on the teacher would not subtract points if the results did not match their prediction.
The problem of the ability to fabricate data. Statistical significance tests are often used where a degree of randomness is assumed. If randomness is assumed then the results would be assumed to often not replicate the same. If results are often expected not to replicate the same, then someone could simply not even do an experiment at all and make up data and since replication is not expected because the data is random then no one would be able to argue that the person really did not just make up the data based on this theory of randomness. I would suggest that we seriously contemplate the possibility that when we read scientific journal articles that people have simply fabricated data without running an experiment at all and that might explain part of the reason why people who try to run the experiment can not get data that is similar enough to match the data in the journal article to consider the results to be replicate-able.
So you can use data and do a lot of stuff to figure out the frequency that results are in certain value ranges that meet or fail to meet your goals based on raw data and in my opinion that is a better application to help the public than doing statistical significance tests with research time and "money" or material resources and although I think that is an improvement I still do not think that is the best use of resources in science.
I would suggest that this statistical way of looking at frequencies that achieve goals based on lists would be better replaced with using putting more emphasis on using equations involving, algebra, trigonometry and calculus that predict a output for a given input. These equations could assign a margin of error for each input and a range of potential output with a certain margin of error for the input values. But do we need statistic probability models for margin of error? No! If we have a ruler and we have to round to the centimeter then we could assign the maximum and minimum value that the actual distance could be within the range that is reasonable after rounding based on the location of the physical markings on the rulers and no statistics probability distribution models are needed for that. We would plug in an equation that makes predictions based on the inputs and set the input values that could potential be there considering the margin of error and get predicted output values. If it is inside the range for the predicted output values then the equation was considered to be a correct prediction and if it is outside the range then it is considered a wrong prediction. If we find out the equation predicts things incorrectly then we make a new equation that would have predicted the results correctly then rerun the experiment and see if that equation now predicts correct results.
Statistical significance testing often although not always does not have the ability to predict an output for a given input. I say often but not always because there is an exception called linear regression which allows to predict an output for a given input for example. Statistics often is only used to predict if two outputs will be different than each other or the same or one will be greater than the other but it is not usually used to predict by how much to outputs will be different from each other by. You might get a mean and a standard deviation but you usually can not get an equation to guess what the mean and standard deviation will be the next time you run as a function of the input the next time you run the same experiment.
I would suggest the world would be a better place if people focused there research on finding algebra, trigonometry and calculus based equations that work to accurately predict outputs based on inputs within the expected margin of error of the inputs than focusing on experiments that are so poorly designed that you claim the reason you can not make accurate predictions of output values is because of some random variability that limits you into only guessing which group is greater or less than the other group but not by how much. Although I really do not like statistical significance testing and would consider the so called necessity of statistical significance testing to be a sign that your experiment was poorly designed, I would suggest that there is a place for statistics in science. Statistics can be a starting place where you have to admit that you really do not have a clue what you are doing and your body of knowledge has not yet achieved a level of a competent science model, one might call this pre science or proto science or primitive science. Maybe you can use it for a little while if you admit that you do not yet know what you are doing. But eventually you should move on and make progress with your models to the point where statistrical significance testing is no longer needed and you have a grown up level of competence in that scientific area of study where you can make predictions using algebra, trigonometry and calculus. This means that as science advances in a field of study more and more algebra, trigonometry and calculus should be used in peer reviewed journal articles and less and less statistical significance testing. Unfortunately if the trend seems to me seems to be the opposite direction in biological and social sciences that would suggest to me that we are not making progress forward but going backward and the fact that the statistical significance tests were not immediately mocked and abandoned by the community of people who call themselves scientists but instead embraced and pushed on graduate students in most fields of biological and social sciences suggests that in many ways biological and social sciences are in many ways going backwards and not forwards in progress in spite of increased material resources as more electronic tools to store and measure data points are manufactured which would have given further resources to move these so called sciences forward if another type of methodology was used.
Lastly, I would suggest that the so called social sciences might be better if people went back to roots that were less quantitative and so called social sciences were not called sciences at all but were thought of as more like philosophies and religious worldviews about human behavior that might or might not be true. A person could present a idea about human behavior and the mind and then the audience could simply think and contemplate about whether or not that might be true instead of trying to prove that what they claim is true by presenting the illusion of scientific objectively with the so called scientific process of statistical significance testing. The types of claims in the so sciences were grand claims that can not be supported by science but are necessary to think about before conducting science in the first place much like different religious or philosophical worldviews about morality, free will, the nature of the human mind and so on. Before I think about whether or not choosing to do A or B results in whatever output I must presuppose my ability to choose how I run my experiment this is a philosophical pre-requisite for science not science itself. Social "sciences" have cut themselves short by pretending to be science through the false objectivity of statistical significance testing instead of embracing their grand place as part of philosophy and religion.
