Benjamin Disraeli, Great Britain’s prime minister during second half of
19th century, said “There are three kinds of lies: lies, damned lies,
and statistics”. This subject already suffered from huge mistrust at
this time. Hut its role continues to grow around all activity
fields.
Today, statistics are used for everything: they daily obsess our
political men and women (surveys) and they enable to guide and motivate
their actions (road safety, jobs…); they enable sociologists to analyze
human beings groups and to decode their behavior; they have a
fundamental role in sciences and enable to validate hypotheses of art
people and to produce the good conclusions (medicine, physics…); they
also enable companies to refine their offers, to guide their strategies
and to target their markets better via marketing surveys.
Despite this central role, statistics daily suffer from critics and
incite debates. Each official statistics published brings endless
arguments and polemics (cf recent statistics about the time of work in
France or about delinquency). Political surveys are daily contested and
mocked, especially the major concerned people and especially when their
results are not in their favor. The use of statistics in human sciences
fields is also very controversial. And even scientist statistics is
today subject of debate.
How to explain this mistrust? Can we continue to trust surveys
statistics to guide choices and decisions in companies?
I describe, you deduce…
The statistic (from Latin « status » which means condition) is divided in two distinct parts but really compatible: the descriptive statistic and the inferential statistic:
- Methods of descriptive statistics analysis try to reach the most exact
image of the population to describe from a huge number of analyzed
elements. Usual indicators that can be the average, the standard
deviation and the variance are a part of the descriptive statistic. But
this statistic branch also comprises more sophisticated methods such as
the factorial analysis.
- The aim of the inferential statistic is more about estimating the
validity of the hypotheses, about detecting eventual links between the
variables and about establishing general extrapolations concerning the
analyzed observations. This branch comprises hypothesis tests, variance
analyses, declines…
Descriptive statistics try to sum up the characteristics of the studied
populations whereas inferential statistics aim to discover hidden
characteristics of these populations and the rules that we can obtain
from these hidden characteristics.
All these methods are based on strong mathematics rules. However…
Bikini statistics
GGeorges Gallup, the famous American statistician considered as the
father of opinion survey affirmed “I could prove God statistically”.
Another famous statistician, Aaron Levenstein pronounced this well-known
sentence “Statistics are like a bikini. What they reveal is interesting/
But what they hide is vital”. Actually, statistics always had the
reputation to be tractable and able to say what we want them to say. It
is clear that manipulation is easy in this field, it might be by
omission. For instance, we can affirm that the average salary in a 200
people company is up to €3200 even though 80% of the people only earn
€1500 (the 20% earning about €10.000). We can highlight the important
increase in the value of a product sales whereas the market share of the
company is falling on this market in development (with a high
growth).
The official statistics are the most highly suspected. This is the case
during the pre-election period but we also notice it more generally. The
employment statistics, the delinquency statistics and repeat offense
statistics, the poverty statistics where prices are usually argued.
But suspicion widely exceeds official numbers to touch the social
sciences. Some sociologists refute the legitimacy of the statistics
utilization when it deals with human beings groups. They consider that
classifications and categorizations operated with the statistical
approach of phenomena bring subjectivity and harm the understanding of
the reality. They are in the wake of the American ethno-methodologist
Aaron Cicourel who already rejected the statistics about delinquency in
United States in the 60s, by affirming that they actually represented
the police services activity rather than the real criminal
activities.
According to Alain Desrosières (a French specialist of the statistics
history and member of the huge group of Insee administrators) the
statistic network develops itself following an institutions system.
“This similar investment to the investment of a road network or a rail
network brings categories that become unavoidable”. Consequently, the
field of action of researchers and their ability to transpose social
realities tends to be limited. Such as for a work which is not a line of
words and for an image which is not a succession of colors points, the
social phenomena cannot be divided infinitely to be better caught. The
detractors of statistic criticize its simplifying leanings that harm the
gripping ability and the global understanding of our environment
according to them.
Paradox - Brainwashing
Outside the fragmentary approaches or the follower approaches, the
statistical method keeps an amount of traps in which experienced users
can fall.
The British statistician Edward Simpson described an example of it in
1951. According to his famous “Simpson paradox”, a result affirmed in
several different groups can be inversed if we combine these groups.
Here is an example: A company hires 60 men and only 16 women during one
year. Is it a sexist company that shows a discriminating behavior
because 79% of the hiring benefited to the masculine sex and 21% to the
weak sex?
Let us deepen. The company received 244 men applications and 84 women
applications. 25% of the men were hired whereas 19% of women were hired.
We can affirm that women statistically had about 20% chances less to be
hired, which can seem abnormal.
Let us deepen again. The company actually organized hiring in two times.
- First time 190 men were presents and 56 were hired (59%). 40 women
were presents and 12 were hired (30%).
- Second time, 54 men and 44 women were presents. The company hired
4 men (7%) and 4 women (9%).
The company always hired a bigger percentage of women. However, the final
statistic showed the converse results.
Surprised or not sure to have followed? Just make the calculation and
you will see the trap in which many scientists, sociologists and surveys
responsible people can easily fall.
Premonition with sobriety
Logical reasonings can sometimes be misleading and lead to wrong deductions. We can illustrate this with the famous taxi driver example used by the 2 Economics Nobel prices Daniel Kahneman (American-Israeli psychologist and economist) and his colleague Amos Tversky (expert in mathematical psychology). Kahneman and Tversky imagine a city where 85% of cabs are red and 15% are blue. A taxi driver knocks a pedestrian down and does not stop. According to a witness who saw the accident, the driver drives a blue car.
Before searching all blue cabs in the city, we make an experiment in a
similar context. The result indicates that 20% of the witnesses (having
seen the same situation) are wrong. We could quickly conclude that the
interviewed witness has 80% of chances to be right. But a more extensive
exam of the situation and the use of the famous Bayes theorem show us
that the rate almost must be divided by 2: actually, the driver only has
41% probability of driving a blue cab. The taxi driver to blame has
actually 69% chances to be yellow.
Here is the calculation: a priori, probability that the cab is blue is
15%. If we take the reliability rate calculated in the experiment into
account, the probability that the witness had correctly seen the blue
color of a real blue car is 80%. The converse probability that the cab
is red whereas it was considered as blue is 20%.
A posteriori, the probability that the car really is blue as the witness
affirmed it is 41% according to this below formula:
Link is not a reason
Mistakes in the statistical reasoning or conclusions are prejudicial in every sector. But this is definitely in the sciences sector and medicine sector that these mistakes can bring serious consequences. But according to a published survey in the United States, more than 50% of scientist publications involving statistics contain some mistakes of reasoning or interpretation. One of the most usual mistakes consists in making abusive conclusions about the cause and effect relationship between several elements for which we found a link. Some seem to believe that 2 linked elements are obviously linked by a strong relationship and a mutual influence. But this is not true: for instance, the real estate price in Paris regularly increased during these last few years. This is the same concerning the age of all samples of people (except Benjamin Button). However, it would be risky to conclude that one of these 2 phenomena influences the other (we can hope a small prices decrease but unfortunately not rejuvenation!).
The hidden factor
Correlation calculations contain other big traps. Actually these two factors are really linked and can come from a common source even if they are not interdependent. The American psychotherapist and sociologist Paul Watzlawick gives an example really interesting and surprising in his book “Pragmatics of Human Communication”: At the beginning of 50s we found a link between beer consumption in the west coast of USA and the infant mortality in Japan. Actually, these two elements were due to a common cause: an important heat wave in the Pacific causing big sanitary problems in Japan and an increased consumption in fresh drinks in the United States.
Many scientist surveys fall in this trap. We can find factors correlation in many sectors. These correlations are only linked with their common cause. Some industrials and communicators use these correlation calculations to highlight conclusions that benefit to their products. This is the case in the food industry that regularly show us new truths about some supposed virtues for the health of their food, longevity, protection against cancer or cardiovascular diseases. Because of the mistakes, we doubt about the opposing information that we get from the scientist community.
Statistics or not?
Which conclusion should we make?
The famous precaution principle should conduce to reject the statistics
because of the risks of mistakes. It would reconsider our sciences
founding principles.
Actually, statistic, such as other techniques, need to be manipulated
safely and with reliability. Statistics broadcasters (scientists,
searchers or marketing surveys responsible people) must manage the
mistake risks that we mentioned in this text to produce strict reasoning
and conclusions that respect the discipline rules and good sense.
Final users of the communicated results (political people, journalists,
marketing professional people and other economical deciders) must use
data safely, keeping in mind that the zero risk does not exist.