Statistics

Simpson’s Paradox And Gender Discrimination

One sunny day we arrive at work in the university administration to find a lot of aggressive emails in our in‒box. Just the day before, a news story about gender discrimination in academia was published in a popular local newspaper which included data from our university. The emails are a result of that. Female readers are outraged that men were accepted at the university at a higher rate, while male readers are angry that women were favored in each course the university offers. Somewhat puzzled, you take a look at the data to see what’s going on and who’s wrong.

The university only offers two courses: physics and sociology. In total, 1000 men and 1000 women applied. Here’s the breakdown:

Physics:

800 men applied ‒ 480 accepted (60 %)
100 women applied ‒ 80 accepted (80 %)

Sociology:

200 men applied ‒ 40 accepted (20 %)
900 women applied ‒ 360 accepted (40 %)

Seems like the male readers are right. In each course women were favored. But why the outrage by female readers? Maybe they focused more on the following piece of data. Let’s count how many men and women were accepted overall.

Overall:

1000 men applied ‒ 520 accepted (52 %)
1000 women applied ‒ 440 accepted (44 %)

Wait, what? How did that happen? Suddenly the situation seems reversed. What looked like a clear case of discrimination of male students turned into a case of discrimination of female students by simple addition. How can that be explained?

The paradoxical situation is caused by the different capacities of the two departments as well as the student’s overall preferences. While the physics department, the top choice of male students, could accept 560 students, the smaller sociology department, the top choice of female students, could only take on 400 students. So a higher acceptance rate of male students is to be expected even if women are slightly favored in each course.

While this might seem to you like an overly artificial example to demonstrate an obscure statistical phenomenon, I’m sure the University of California (Berkeley) would beg to differ. It was sued in 1973 for bias against women on the basis of these admission rates:

8442 men applied ‒ 3715 accepted (44 %)
4321 women applied ‒ 1512 accepted (35 %)

A further analysis of the data however showed that women were favored in almost all departments ‒ Simpson’s paradox at work. The paradox also appeared (and keeps on appearing) in clinical trials. A certain treatment might be favored in individual groups, but still prove to be inferior in the aggregate data.

Analysis of Viewers for TV Series

I analysed the number of viewers of all the completed seasons for the following tv shows: Fringe, Lost, Heroes, Gossip Girl, Vampire Diaries, True Blood, The Sopranos, How I met your Mother, Glee and Family Guy. The data was taken from the respective Wikipedia pages.

My aim was to find simple “rule-of-thumb” formulas to estimate key values from the number of premiere viewers and to see if there’s a pattern for the decline of a show. Below you can see the main results from the analysis.

Result 1: Finale vs. Premiere

The number of finale viewers is about 85 % the number of premiere viewers.

Result 2: Average vs. Premiere

The average number of viewers during a season is about 83 % the number of premiere viewers.

finaleaverage

Result 3: Decline Pattern

The average number of viewers during a season is about 93 % the average number of viewers during the previous season.

averageprevious

This last result implies that the decline in popularity is exponential. If the average number of viewers for the first season is N(1), then the expected number of viewers for season n is: N(n) = N(1) * 0.93^(n-1). We can also express this using a table:

Average season two = 93 % of average season one

Average season three = 86 % of average season one

Average season four = 80 % of average season one

Average season five = 75 % of average season one

Average season six = 70 % of average season one

etc …

Of course, this is all just the sum of the behaviour of all the analyzed shows. Individual shows can behave very differently form that.

World Population – Is Mankind’s Explosive Growth Ending?

According to the World Population Clock there are currently about 7.191 billion people alive. This year there have been 118 million births (or 264 per minute) and 49 million deaths (or 110 per minute), resulting in a net growth of 69 million people. Where will this end? Nobody can say for sure. But what we can be certain about is that the explosive growth has been slowing down for the past 40 years. I’ll let the graphs tell the story.

Here is how the world population has developed since the year 1700. The numbers come from the United Nations Department of Economic and Social Affairs. From looking at the graph, no slowdown is visible:

Image

However, another graph reveals that there’s more to the story. I had the computer calculate the percentage changes from one decade to the next. From 1960 to 1970 the world population grew by 22 %. This was the peak so far. After that, the growth rate continuously declined. The percentage change from 2000 to 2010 was “only” 12 %.

Image

Of course it’s too early to conclude that this is the end of mankind’s explosive growth. There have been longer periods of slowing growth before (see around 1750 and 1850). But the data does raise this question.

Talk to me again when it’s 2020 or 2030.

Just by the way: according to estimates, about 108 billion people have been born since the beginning of mankind (see here). This implies that about 101 billion people have died so far and that of all those born, 6.5 % percent are alive today.

Did somebody say dust in the wind?

The Standard Error – What it is and how it’s used

I smoke electronic cigarettes and recently I wanted to find out how much nicotine liquid I consume per day. I noted the used amount on five consecutive days:

3 ml, 3.4 ml, 7.2 ml, 3.7 ml, 4.3 ml

So how much do I use per day? Well, our best guess is to do the average, that is, sum all the amounts and divide by the number of measurements:

(3 ml + 3.4 ml + 7.2 ml + 3.7 ml + 4.3 ml) / 5 = 4.3 ml

Most people would stop here. However, there’s one very important piece of information missing: how accurate is that result? Surely an average value of 4.3 ml computed from 100 measurements is much more reliable than the same average computed from 5 measurements. Here’s where the standard error comes in and thanks to the internet, calculating it couldn’t be easier. You can type in the measurements here to get the standard error:

http://www.miniwebtool.com/standard-error-calculator/

It tells us that the standard error (of the mean, to be pedantically precise) of my five measurements is SEM = 0.75. This number is extremely useful because there’s a rule in statistics that states that with a 95 % probability, the true average lies within two standard errors of the computed average. For us this means that there’s a 95 % chance, which you could call beyond reasonable doubt, that the true average of my daily liquid consumption lies in this intervall:

4.3 ml ± 1.5 ml

or between 2.8 and 5.8 ml. So the computed average is not very accurate. Note that as long as the standard deviation remains more or less constant as further measurements come in, the standard error is inversely proportional to the square root of the number of measurements. In simpler terms: If you quadruple the number of measurements, the size of the error interval halves. With 20 instead of only 5 measurements, we should be able to archieve plus/minus 0.75 accuracy.

So when you have an average value to report, be sure to include the error intervall. Your result is much more informative this way and with the help of the online calculator as well as the above rule, computing it is quick and painless. It took me less than a minute.

A more detailed explanation of the average value, standard deviation and standard error (yes, the latter two are not the same thing) can be found in chapter 7 of my Kindle ebook Statistical Snacks (this was not an excerpt).

Increase Views per Visit by Linking Within your Blog

One of the most basic and useful performance indicator for blogs is the average number of views per visit. If it is high, that means visitors stick around to explore the blog after reading a post. They value the blog for being well-written and informative. But in the fast paced, content saturated online world, achieving a lot of views per visit is not easy.

You can help out a little by making exploring your blog easier for readers. A good way to do this is to link within your blog, that is, to provide internal links. Keep in mind though that random links won’t help much. If you link one of your blog post to another, they should be connected in a meaningful way, for example by covering the same topic or giving relevant additional information to what a visitor just read.

Being mathematically curious, I wanted to find a way to judge what impact such internal links have on the overall views per visit. Assume you start with no internal links and observe a current number views per visitor of x. Now you add n internal links in your blog, which has in total a number of m entries. Given that the probability for a visitor to make use of an internal link is p, what will the overall number of views per visit change to? Yesterday night I derived a formula for that:

x’ = x + (n / m) · (1 / (1-p) – 1)

For example, my blog (which has as of now very few internal links) has an average of x = 2.3 views per visit and m = 42 entries. If I were to add n = 30 internal links and assuming a reader makes use of an internal link with the probability p = 20 % = 0.2, this should theoretically change into:

x’ = 2.3 + (30 / 42) · (1 / 0.8 – 1) = 2.5 views per visit

A solid 9 % increase in views per visit and this just by providing visitors a simple way to explore. So make sure to go over your blog and connect articles that are relevant to each other. The higher the relevancy of the links, the higher the probability that readers will end up using them. For example, if I only added n = 10 internal links instead of thirty, but had them at such a level of relevancy that the probability of them being used increases to p = 40 % = 0.4, I would end up with the same overall views per visit:

x’ = 2.3 + (10 / 42) · (1 / 0.6 – 1) = 2.5 views per visit

So it’s about relevancy as much as it is about amount. And in the spirit of not spamming, I’d prefer adding a few high-relevancy internal links that a lot low-relevancy ones.

If you’d like to know more on how to optimize your blog, check out: Setting the Order for your WordPress Blog Posts and Keywords: How To Use Them Properly On a Website or Blog.

Quantitative Analysis of Top 60 Kindle Romance Novels

I did a quantitative analysis of the current Top 60 Kindle Romance ebooks. Here are the results. First I’ll take a look at all price related data and conclusions.

—————————————————————————–

  • Price over rank:

pricerank

There seems to be no relation between price and rank. A linear fit confirmed this. The average price was 3.70 $ with a standard deviation of 2.70 $.

—————————————————————————–

  • Price frequency count:

pricescount

(Note that prices have been rounded up) About one third of all romance novels in the top 60 are offered for 1 $. Roughly another third for 3 $ or 4 $.

—————————————————————————–

  • Price per 100 pages over rank:

pricerank

Again, no relation here. The average price per 100 pages was 1.24 $ with a standard deviation of 0.86 $.

—————————————————————————–

  • Price per 100 pages frequency count:

PPP1

About half of all novels in the top 60 have a price per 100 pages lower than 1.20 $. Another third lies between 1.20 $ and 1.60 $.

—————————————————————————–

  • Price per 100 pages over number of pages:

PPP2

As I expected, the bigger the novel, the less you pay per page. Romance novels of about 200 pages cost 1.50 $ per 100 pages, while at 400 pages the price drops to about 1 $ per 100 pages. The decline is statistically significant, however there’s a lot of variation.

—————————————————————————–

  • Review count:

reviewscount

A little less than one half of the top novels have less than 50 reviews. About 40 % have between 50 and 150 reviews. Note that some of the remaining 10 % more than 600 reviews (not included in the graph).

—————————————————————————–

  • Rating over rank:

rankreviews

There’s practically no dependence of rank on rating among the top 60 novels. However, all have a rating of 3.5 stars or higher, most of them (95 %) 4 stars or higher.

—————————————————————————–

  • Pages over ranking:

pagesrank

There’s no relation between number of pages and rank. A linear fit confirmed this. The average number of pages was 316 with a standard deviation of 107.

—————————————————————————–

  • Pages count:

pagescount

About 70 % of the analyzed novels have between 200 and 400 pages. 12 % are below and 18 % above this range.

Probability and Multiple Choice Tests

Imagine taking a multiple choice test that has three possible answers to each question. This means that even if you don’t know any answer, your chance of getting a question right is still 1/3. How likely is it to get all questions right by guessing if the test contains ten questions?

Here we are looking at the event “correct answer” which occurs with a probability of p(correct answer) = 1/3. We want to know the odds of this event happening ten times in a row. For that we simply apply the multiplication rule:

  • p(all correct) = (1/3)10 = 0.000017

Doing the inverse, we can see that this corresponds to about 1 in 60000. So if we gave this test to 60000 students who only guessed the answers, we could expect only one to be that lucky. What about the other extreme? How likely is it to get none of the ten questions right when guessing?

Now we must focus on the event “incorrect answer” which has the probability p(incorrect answer) = 2/3. The odds for this to occur ten times in a row is:

  • p(all incorrect) = (2/3)10 = 0.017

In other words: 1 in 60. Among the 60000 guessing students, this outcome can be expected to appear 1000 times. How would these numbers change if we only had eight instead of ten questions? Or if we had four options per question instead of three? I leave this calculation up to you.

More Pirates, Less Global Warming … wait, what?

An interesting correlation was found by the parody religion FSM (Flying Spaghetti Monster). Deducting causation here would be madness. Over the 18th and 19th century, piracy, the one with the boats, not the one with the files and the sharing, slowly died out. At the same time, possibly within a natural trend and / or for reasons of increased industrial activity, the global temperature started increasing. If you plot the number of pirates and the global temperature in a coordinate system, you find a relatively strong correlation between the two. The more pirates there are, the colder the planet is. Here’s the corresponding formula and graph:

 T = 16 – 0.05 · P0.33

 Statistical Snacks_html_40452f20

with T being the average global temperature and P the number of pirates. Given enough pirates (about 3.3 million to be specific), we could even freeze Earth. But of course nobody in the right mind would see causality at work here, rather we have two processes, the disappearance of piracy and global warming, that happened to occur at the same time. So you shouldn’t be too surprised that the recent rise of piracy in Somalia didn’t do anything to stop global warming.

Statistics and Monkeys on Typewriters

Here are the first two sentences of the prologue to Shakespeare’s Romeo and Juliet:

Two households, both alike in dignity,
In fair Verona, where we lay our scene

This excerpt has 77 characters. Now we let a monkey start typing random letters on a typewriter. Once he typed 77 characters, we change the sheet and let him start over. How many tries does he need to randomly reproduce the above paragraph?

There are 26 letters in the English alphabet and since he’ll be needing the comma and space, we’ll include those as well. So there’s a 1/28 chance of getting the first character right. Same goes for the second character, third character, etc … Because he’s typing randomly, the chance of getting a character right is independent of what preceded it. So we can just start multiplying:

p(reproduce) = 1/28 · 1/28 · … · 1/28 = (1/28)^77

The result is about 4 times ten to the power of -112. This is a ridiculously small chance! Even if he was able to complete one quadrillion tries per millisecond, it would most likely take him considerably longer than the estimated age of the universe to reproduce these two sentences.

Now what about the first word? It has only three letters, so he should be able to get at least this part in a short time. The chance of randomly reproducing the word “two” is:

p(reproduce) = 1/26 · 1/26 · 1/26 = (1/26)^3

Note that I dropped the comma and space as a choice, so now there’s a 1 in 26 chance to get a character right. The result is 5.7 times ten to the power of -5, which is about a 1 in 17500 chance. Even a slower monkey could easily get that done within a year, but I guess it’s still best to stick to human writers.

.This was an excerpt from the ebook “Statistical Snacks. Liked the excerpt? Get the book here: http://www.amazon.com/Statistical-Snacks-ebook/dp/B00DWJZ9Z2. Want more excerpts? Check out The Probability of Becoming a Homicide Victim and Missile Accuracy (CEP).

Missile Accuracy (CEP) – Excerpt from “Statistical Snacks”

An important quantity when comparing missiles is the CEP (Circular Error Probable). It is defined as the radius of the circle in which 50 % of the fired missiles land. The smaller it is, the better the accuracy of the missile. The German V2 rockets for example had a CEP of about 17 km. So there was a 50/50 chance of a V2 landing within 17 km of its target. Targeting smaller cities or even complexes was next to impossible with this accuracy, one could only aim for a general area in which it would land rather randomly.

Today’s missiles are significantly more accurate. The latest version of China’s DF-21 has a CEP about 40 m, allowing the accurate targeting of small complexes or large buildings, while CEP of the American made Hellfire is as low as 4 m, enabling precision strikes on small buildings or even tanks.

Assuming the impacts are normally distributed, one can derive a formula for the probability of striking a circular target of Radius R using a missile with a given CEP:

p = 1 – exp( -0.41 · R² / CEP² )

This quantity is also called the “single shot kill probability” (SSKP). Let’s include some numerical values. Assume a small complex with the dimensions 100 m by 100 m is targeted with a missile having a CEP of 150 m. Converting the rectangular area into a circle of equal area gives us a radius of about 56 m. Thus the SSKP is:

p = 1 – exp( -0.41 · 56² / 150² ) = 0.056 = 5.6 %

So the chances of hitting the target are relatively low. But the lack in accuracy can be compensated by firing several missiles in succession. What is the chance of at least one missile hitting the target if ten missiles are fired? First we look at the odds of all missiles missing the target and answer the question from that. One missile misses with 0.944 probability, the chance of having this event occur ten times in a row is:

p(all miss) = 0.94410 = 0.562

Thus the chance of at least one hit is:

p(at least one hit) = 1 – 0.562 = 0.438 = 43.8 %

Still not great considering that a single missile easily costs 10000 $ upwards. How many missiles of this kind must be fired at the complex to have a 90 % chance at a hit? A 90 % chance at a hit means that the chance of all missiles missing is 10 %. So we can turn the above formula for p(all miss) into an equation by inserting p(all miss) = 0.1 and leaving the number of missiles n undetermined:

0.1 = 0.944n

All that’s left is doing the algebra. Applying the natural logarithm to both sides and solving for n results in:

n = ln(0.1) / ln(0.944) = 40

So forty missiles with a CEP of 150 m are required to have a 90 % chance at hitting the complex. As you can verify by doing the appropriate calculations, three DF-21 missiles would have achieved the same result.

Liked the excerpt? Get the book “Statistical Snacks” by Metin Bektas here: http://www.amazon.com/Statistical-Snacks-ebook/dp/B00DWJZ9Z2. For more excerpts see The Probability of Becoming a Homicide Victim and How To Use the Expected Value.

Smoking – Your (My) Chances of Dying Early from it

I admit that I smoke. And my first attempt to quit after 13 years of a pack a day only lasted one month. Here’s what convinced me to try:

  • 50 % of smokers will die early due to their habit (Source: WHO)
  • On average smokers die 10 years earlier (Source: CDC)
  • Every year about 6 million people die from smoking related diseases, that is more than one Jumbo Jet full of people every hour (Source: WHO)

Most sensible people wouldn’t play Russian Roulette, but some take even higher chances at early death with smoking.

http://www.who.int/mediacentre/factsheets/fs339/en/

http://www.cdc.gov/tobacco/data_statistics/fact_sheets/health_effects/tobacco_related_mortality/

 

The good news: If you start smoking in your teens, and quit at …

  • … the age of 30, you get all of the 10 years back; the damage done is almost completely reversible
  • … the age of 40, you get 9 of the 10 years back; the damage done is reversible for the most part
  • … the age of 50, you get 6 of the 10 years back; some of the damages are still reversible
  • … the age of 60, you get 3 of the 10 years back; most damages will remain, but life quality will improve

Since I just hit 30, I’ll be sure to give it another try once my vacation is over. Having too much time is a very bad idea if you want to quit, better do it when you’re busy.

http://www.netdoctor.co.uk/healthy-living/lung-cancer.htm

http://www.rauchfrei.de/raucherstatistik.htm

Dota 2 Statistics

I analyzed 24 Dota Games with and a total of 209 players and lengths ranging from 6 to 39 minutes to deduce the distribution of Gold per Minute (GPM) and Experience per Minute (XPM) players manage to achieve in the game. The data is taken from dotabuff.com and processed using Origin Pro.

 

Gold per Minute:

  • Average: 270 GPM
  • Ranged between 75 and 712 GPM
  • 5 % (1 in 20) cracked 500 GPM mark

 

GPM

 

Experience per Minute:

  • Average: 311 XPM
  • Ranged between 9 and 712 XPM
  • 4 % (1 in 25) cracked 600 XPM mark

XPM

Guns per Capita and Homicides – Is There a Correlation?

Here’s a statistics quicky. A while ago, just after the tragic shooting at Sandy Hook Elementary School, I wanted to produce a clear proof that gun ownership and homicide rates are correlated. It seemed logical to me that, plus / minus statistical fluctuations, the phrase “more guns, more violence” holds true. So I extracted the relevant data for all first world countries from Wikipedia and did the plot. Here’s the picture I got:

Graph2

Maybe you are as surprised as I was. Obviously, there’s no relationship between the two variables, more guns does not mean more violence and less guns does not mean less violence. So whatever the main cause for the violence problem in the US (see the isolated dot in the top right? That’s the US), it can’t be guns. And that’s a liberal European speaking …

Just in case anyone cares, I blame the gang and hip-hop culture. I can’t be guns (see above), but it also can’t be media or mental health or drugs (people in all other first world countries also play shooter games,  watch violent movies, have mental problems, buy and sell drugs).

Sources:

http://en.wikipedia.org/wiki/Number_of_guns_per_capita_by_country

http://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate

My Fair Game – How To Use the Expected Value

You meet a nice man on the street offering you a game of dice. For a wager of just 2 $, you can win 8 $ when the dice shows a six. Sounds good? Let’s say you join in and play 30 rounds. What will be your expected balance after that?

You roll a six with the probability p = 1/6. So of the 30 rounds, you can expect to win 1/6 · 30 = 5, resulting in a pay-out of 40 $. But winning 5 rounds of course also means that you lost the remaining 25 rounds, resulting in a loss of 50 $. Your expected balance after 30 rounds is thus -10 $. Or in other words: for the player this game results in a loss of 1/3 $ per round.

 Let’s make a general formula for just this case. We are offered a game which we win with a probability of p. The pay-out in case of victory is P, the wager is W. We play this game for a number of n rounds.

The expected number of wins is p·n, so the total pay-out will be: p·n·P. The expected number of losses is (1-p)·n, so we will most likely lose this amount of money: (1-p)·n·W.

 Now we can set up the formula for the balance. We simply subtract the losses from the pay-out. But while we’re at it, let’s divide both sides by n to get the balance per round. It already includes all the information we need and requires one less variable.

B = p · P – (1-p) · W

This is what we can expect to win (or lose) per round. Let’s check it by using the above example. We had the winning chance p = 1/6, the pay-out P = 8 $ and the wager W = 2 $. So from the formula we get this balance per round:

B = 1/6 · 8 $ – 5/6 · 2 $ = – 1/3 $ per round

Just as we expected. Let’s try another example. I’ll offer you a dice game. If you roll two six in a row, you get P = 175 $. The wager is W = 5 $. Quite the deal, isn’t it? Let’s see. Rolling two six in a row occurs with a probability of p = 1/36. So the expected balance per round is:

B = 1/36 · 175 $ – 35/36 · 5 $ = 0 $ per round

I offered you a truly fair game. No one can be expected to lose in the long run. Of course if we only play a few rounds, somebody will win and somebody will lose.

It’s helpful to understand this balance as being sound for a large number of rounds but rather fragile in case of playing only a few rounds. Casinos are host to thousands of rounds per day and thus can predict their gains quite accurately from the balance per round. After a lot of rounds, all the random streaks and significant one-time events hardly impact the total balance anymore. The real balance will converge to the theoretical balance more and more as the number of rounds grows. This is mathematically proven by the Law of Large Numbers. Assuming finite variance, the proof can be done elegantly using Chebyshev’s Inequality.

The convergence can be easily demonstrated using a computer simulation. We will let the computer, equipped with random numbers, run our dice game for 2000 rounds. After each round the computer calculates the balance per round so far. The below picture shows the difference between the simulated balance per round and our theoretical result of – 1/3 $ per round.

Image

(Liked the excerpt? Get the book “Statistical Snacks” by Metin Bektas here: http://www.amazon.com/Statistical-Snacks-ebook/dp/B00DWJZ9Z2)

The Probability of Becoming a Homicide Victim

 Each year in the US there are about 5 homicides per 100000 people, so the probability of falling victim to a homicide in a given year is 0.00005 or 1 in 20000. What are the chances of falling victim to a homicide over a lifespan of 70 years?

 Let’s approach this the other way around. The chance of not becoming a homicide victim during one year is p = 0.99995. Using the multiplication rule we can calculate the probability of this event occurring 70 times in a row:

 p = 0.99995 · … · 0.99995 = 0.9999570

 Thus the odds of not becoming a homicide victim over the course of 70 years are 0.9965. This of course also means that there’s a 1 – 0.9965 = 0.0035, or 1 in 285, chance of falling victim to a homicide during a life span. In other words: two victims in every jumbo jet full of people. How does this compare to other countries?

 In Germany, the homicide rate is about 0.8 per 100000 people. Doing the same calculation gives us a 1 in 1800 chance of becoming a murder victim, so statistically speaking there’s one victim per small city. At the other end of the scale is Honduras with 92 homicides per 100000 people, which translates into a saddening 1 in 16 chance of becoming a homicide victim over the course of a life and is basically one victim in every family.

 It can get even worse if you live in a particularly crime ridden part of a country. The homicide rate for the city San Pedro Sula in Honduras is about 160 per 100000 people. If this remained constant over time and you never left the city, you’d have a 1 in 9 chance of having your life cut short in a homicide.

Liked the excerpt? Get the book “Statistical Snacks” by Metin Bektas here: http://www.amazon.com/Statistical-Snacks-ebook/dp/B00DWJZ9Z2. For more excerpts check out Missile Accuracy (CEP), Immigrants and Crime and Monkeys on Typewriters.

Immigrants and Crime – A Statistical Analysis

Assume we are given a country with a population that is 90 % native and 10 % immigrant. As it is often the case in the first world, the native population is on average older than the immigrant population.

Let’s look at a certain type of crime, say robberies. Now a statistic shows that of all the robberies in the country, 80 % have been committed by natives and 20 % by immigrants. Can we conclude from these numbers that the immigrants are more inclined to steal than the natives? Many people would do so.

The police keeps basic records of all crimes that have been reported. This enables us to get a closer look at the situation. Consider the graph below, it shows the age distribution of people accused of robbery in Canada in 2008. It immediately becomes clear that it is for the most part a “young person’s crime”. The rates are significantly elevated for ages 14 – 20 and then decrease with age. Even without crunching the numbers it is clear that the younger a population is, the more robberies will occur.

Image

Let’s go back to our fictional country of 90 % natives and 10 % immigrants, with the immigrant population being younger. Assuming the same inclination to committing robberies for both groups, the immigrant population would contribute more than 10 % to the total amount of robberies for the simple reason that robbery is a crime mainly committed by young people.

Using a simplistic example, we can put this logic to the test. Let’s stick to our numbers of 90 % natives and 10 % immigrants. This time however, we’ll crudely specify an age distribution for both. For the native population the breakdown is:

– 15 % below age 15

– 15 % between age 15 and 25

– 70 % above age 25

For the immigrants we take a slightly different distribution that results in a lower average age:

– 20 % below age 15

– 20 % between age 15 and 25

– 60 % above age 25

We’ll set the total population count to 100 million. Now assume that there’s a crime that is committed solely by people in the age group 15 to 25. Within this age group, 1 in 100000 will commit this crime over the course of one year, independently of what population group he or she belongs to. Note that this means that there’s no inclination towards this crime in any of the two groups.

It’s time to crunch the numbers. There are 0.9 · 100 million = 90 million natives. Of these, 0.15 · 90 million = 13.5 million are in the age group 15 to 25. This means we can expect 135 natives to commit this crime during a year.

As for the immigrants, there are 0.1 · 100 million = 10 million in the country, with 0.2 · 10 million = 2 million being in the age group of interest. They will give rise to an expected number of 20 crimes of this kind per year.

In total, we can expect this crime to be committed 155 times, with the immigrants having a share of 20 / 155 = 12.9 %. This is higher than their proportional share of 10 % despite there being no inclination for committing said crime. All that led to this result was the population being younger on average.

So concluding from a larger than proportional share of crime that there’s an inclination towards crime in this part of the population is not mathematically sound. To be able to draw any conclusions, we would need to know the expected value, which can be calculated from the age distribution of the crime and that of the population and can differ quite strongly from the proportional value.

(Liked the excerpt? Get the book “Statistical Snacks” by Metin Bektas here: http://www.amazon.com/Statistical-Snacks-ebook/dp/B00DWJZ9Z2)