Statistics

New E-Book Release: Math Concepts Everyone Should Know (And Can Learn)

Well, a few weeks ago I broke my toe, which meant that I was forced to sit in my room hour after hour and think about what to do. Luckily, I found something to keep me busy: writing a new math book. The book is called Math Concepts Everyone Should Know (And Can Learn) and was completed yesterday. It is now live on Amazon (click the cover to get to the product page) for the low price of $ 2.99 and can be read without any prior knowledge in mathematics.

Screenshot_11

I must say that I’m very pleased with the result and I hope you will enjoy the learning experience. The topics are:

– Combinatorics and Probability
– Binomial Distribution
– Banzhaf Power Index
– Trigonometry
– Hypothesis Testing
– Bijective Functions And Infinity

Survey on Sleeping – Longest Time Awake

Here I presented some basic results of my survey on sleeping. The analysis showed that the male respondents reported having been awake a maximum of 40.6 h, while for women this was 36.6 h, not an enormous difference, but statistically significant at p < 0.05. I also wanted to give you the breakdown for all of the respondents, male or female:

Screenshot_1

You can see that around 25 % (1 in 4) never made it past the 24 h mark. Roughly the same percentage (adding up the final four bars) made it past the 48 h mark and a brave 4.4 % (1 in 23) even past the 72 h mark. Feel free to use the comments section to tell me what the longest time you were awake was and what the circumstances were that made you stay up so long.

Survey on Sleeping

I conducted a paid survey via AYTM on the topic of sleeping. I was interested in finding out what variables (psychological, lifestyle, circumstance) have a noticeable effect on sleep related issues such as nightmares, sleep duration, time needed to fall asleep, etc … Now that I’ve got the raw data result, there’s a ton of relationships to analyze and that will take time. But I’ve already found a few neat statistically significant results, some to be expected, some rather surprising. I’ll publish them, as well as the results yet to be found, here on my blog in the coming weeks.

Geographic Region: US

Number of Respondents: 250

Males: 96 (38.4 %)
Females: 154 (61.6 %)

Minimum Age: 18
Median Age: 36
Maximum Age: 81

White-American: 163 (65.2 %)
African-American: 23 (9.2 %)
Asian-American: 17 (6.8 %)
Hispanic-American: 28 (11.2 %)
Other: 19 (7.6 %)

Here’s what I extracted from the data so far. Statistical significance was determined via a two-population Z-test. Notice the p-value. You can interpret it as the chance that the result came to be by random fluctuations rather than via a real effect. Hence, the lower the p-value, the more significant and reliable the result. A p-value of 0.05 roughly means that there’s a 1 in 20 chance that the result is just a random fluctuation, a value of 0.01 that there’s a 1 in 100 chance for the same. All of the results below are significant at p < 0.05, some even at p < 0.01.

By the way: if you’ve got your own data, you can let this great website do a two-population Z-test for you. Only works though if you’ve got the result in form of a percentage. To learn more about hypothesis testing, including how to perform a Z-test if the data is not given in form of a percentage, check out this great book by Leonard Gaston.

——————————————————————————
AGE:
——————————————————————————

————————–
Hypothesis: Young people have more nightmares
————————–

Percentage of people with nightmares for people at median age or younger: 49.6 %
Number of respondents: 127

Percentage of people with nightmares for people older than median age: 26.8 %
Number of respondents: 123

The Z-Score is 3.706. The p-value is 0.0001. The result is significant at p < 0.01.

Correlation between age and probability for nightmares: Probability = 0.757 – 0.00958·Age
(Every year the chance for nightmares goes down by roughly 1 %)

————————–
Hypothesis: Young people are more light sensitive
————————–

Percentage of light sensitive people for people at median age or younger: 62.2 %
Number of respondents: 127

Percentage of light sensitive people for people older than median age: 49.6 %
Number of respondents: 123

The Z-Score is 2.0065. The p-value is 0.02222. The result is significant at p < 0.05.

Correlation between age and probability for light sensitivity: Probability = 0.756 – 0.00504·Age
(Every two years the chance for light sensitivity goes down by 1 %)

————————–
Hypothesis: Older people more frequently take naps
————————–

Percentage of people taking naps for people at median age or younger: 23.6 %
Number of respondents: 127

Percentage of people taking naps for people older than median age: 33.3 %
Number of respondents: 123

The Z-Score is 1.7009. The p-value is 0.04457. The result is significant at p < 0.05.

Correlation between age and probability for taking naps: Probability = 0.194 + 0.00232·Age
(Every four years the chance for taking naps goes up by 1 %)

——————————————————————————
GENDER:
——————————————————————————

————————–
Hypothesis: Females wake up more frequently at night
————————–

Percentage of males who frequently wake up at night: 51.1 %
N = 96

Percentage of females who frequently wake up at night: 64.3 %
N = 154

The Z-Score is 2.081. The p-value is 0.03752. The result is significant at p < 0.05.

————————–
Hypothesis: Males have a higher peak waking duration
————————–

Peak waking duration for males: 40.6 h
SEM: 1.76 h

Peak waking duration for females: 36.6 h
SEM: 1.33 h

The Z-Score is 1.8132. The p-value is 0.0349. The result is significant at p < 0.05.

——————————————————————————
INCOME:
——————————————————————————

————————–
Hypothesis: People with low income daydream more
————————–

Percentage of people with low income who frequently daydream: 40.1 %
N = 142

Percentage of people with high income who frequently daydream: 26.9 %
N = 108

The Z-Score is 2.1764. The p-value is 0.01463. The result is significant at p < 0.05.

————————–
Hypothesis: People with low income take longer to fall asleep
————————–

Time to fall asleep for people with low income: 31.4 min
SEM = 1.44 min

Time to fall asleep for people with high income: 22.9 min
SEM = 1.08 min

The Z-Score is 4.7222. The p-value is < 0.00001. The result is significant at p < 0.01.

——————————————————————————
DEPRESSION:
——————————————————————————

————————–
Hypothesis: Depressed people have more nightmares
————————–

Percentage of people with nightmares for depressed people: 57.1 %
Number of respondents: 84

Percentage of people with nightmares for non-depressed people: 28.9 %
Number of respondents: 166

The Z-Score is 4.3308. The p-value is 0. The result is significant at p < 0.01.

————————–
Hypothesis: Depressed people daydream more
————————–

Percentage of depressed people who frequently daydream: 50.3 %
N = 84

Percentage of non-depressed people who frequently daydream: 26.5 %
N = 166

The Z-Score is 3.7392. The p-value is 9E-05. The result is significant at p < 0.01.

————————–
Hypothesis: Depressed people need more time to feel fully awake after a good night’s sleep
————————–

Time to feel fully awake for depressed people: 42.5 min
SEM: 2.70 min

Time to feel fully awake for non-depressed people: 28.9 min
SEM: 1.27 min

The Z-Score is 4.5580. The p-value is < 0.00001. The result is significant at p < 0.01.

——————————————————————————
OTHERS:
——————————————————————————

————————–
Hypothesis: People who need more time to fall asleep also need more time to feel fully awake after a good night’s sleep
————————–

Time to feel fully awake for people who need less than 30 minutes to fall asleep: 29.9 min
SEM: 1.77 min

Time to feel fully awake for people who need 30 minutes or more to fall asleep: 37.1 min
SEM: 1.92 min

The Z-Score is 2.7572. The p-value is 0.002915. The result is significant at p < 0.01.

Correlation between time falling asleep (FAS) versus time to feel fully awake (FAW):

FAW = 24.7 + 0.317*FAS

(Every ten minutes additional time needed to fall asleep translate into roughly three minutes additional time required to feel fully awake after a good night’s sleep)

CTR (Click Through Rate) – Explanation, Results and Tips

A very important metric for banner advertiesment is the CTR (click through rate). It is simply the number of clicks the ad generated divided by the number of total impressions. You can also think of it as the product of the probability of a user noticing the ad and the probability of the user being interested in the ad.

CTR = clicks / impressions = p(notice) · p(interested)

The current average CTR is around 0.09 % or 9 clicks per 10,000 impressions and has been declining for the past several years. What are the reasons for this? For one, the common banner locations are familiar to web users and are thus easy to ignore. There’s also the increased popularity of ad-blocking software.

The attitude of internet users is generally negative towards banner ads. This is caused by advertisers using more and more intrusive formats. These include annoying pop-ups and their even more irritating sisters, the floating ads. Adopting them is not favorable for advertisers. They harm a brand and produce very low CTRs. So hopefully, we will see an end to such nonsense soon.

As for animated ads, their success depends on the type of website and target group. For high-involvement websites that users visit to find specific information (news, weather, education), animated banners perform worse than static banners. In case of low-involvement websites that are put in place for random surfing (entertainment, lists, mini games) the situation is reversed. The target group also plays an important role. For B2C (business-to-consumer) ads animation generally works well, while for B2B (business-to-business) animation was shown to lower the CTR.

The language used in ads has also been extensively studied. One interesting result is that often it is preferable to use English language even if the ad is displayed in a country in which English is not the first language. A more obvious result is that catchy words and calls to action (“read more”) increase the CTR.

As for the banner size, there is inconclusive data. Some analysis report that the CTR grows with banner size, while others conclude that banner sizes around 250×250 or 300×250 generate the highest CTRs. There is a clearer picture regarding shape: in terms of CTR, square shapes work better than thin rectangles having the same size. No significant difference was found between vertical and horizontal rectangles.

Here’s another hint: my own theoretical calculations show that higher CTRs can be achieved by advertising on pages that have a low visitor loyalty. The explanation for this counter-intuitive outcome as well as a more sophisticated formula for the CTR can be found here. It is, in a nutshell, a result of the multiplication rule of statistics. The calculation also shows that on sites with a low visitor loyalty the CTR will stay constant, while on websites with a high visitor loyalty it will decrease over time.

Sources and further reading:

  • Study on banner advertisement type and shape effect on click-through-rate and conversion

http://www.aabri.com/manuscripts/131481.pdf

  • The impact of banner ad styles on interaction and click-through-rates

http://iacis.org/iis/2008/S2008_989.pdf

  • Impact of animation and language on banner click-through-rates

http://www.academia.edu/1608289/Impact_of_Animation_and_Language_on_Banner_Click-Through_Rates

Mathematics of Banner Ads: Visitor Loyalty and CTR

First of all: why should a website’s visitor loyalty have any effect at all on the CTR we can expect to achieve with a banner ad? What does the one have to do with the other? To understand the connection, let’s take a look at an overly simplistic example. Suppose we place a banner ad on a website and get in total 3 impressions (granted, not a realistic number, but I’m only trying to make a point here). From previous campaigns we know that a visitor clicks on our ad with a probability of 0.1 = 10 % (which is also quite unrealistic).

The expected number of clicks from these 3 impressions is …

… 0.1 + 0.1 + 0.1 = 0.3 when all impressions come from different visitors.

… 1 – 0.9^3 = 0.27 when all impressions come from only one visitor.

(the symbol ^ stands for “to the power of”)

This demonstrates that we can expect more clicks if the website’s visitor loyalty is low, which might seem counter-intuitive at first. But the great thing about mathematics is that it cuts through bullshit better than the sharpest knife ever could. Math doesn’t lie. Let’s develop a model to show that a higher vistor loyalty translates into a lower CTR.

Suppose we got a number of I impressions on the banner ad in total. We’ll denote the percentage of visitors that contributed …

… only one impression by f(1)
… two impressions by f(2)
… three impressions by f(3)

And so on. Note that this distribution f(n) must satisfy the condition ∑[n] n·f(n) = I for it all to check out. The symbol ∑[n] stands for the sum over all n.

We’ll assume that the probability of a visitor clicking on the ad is q. The probability that this visitor clicks on the ad at least once during n visits is just: p(n) = 1 – (1 – q)^n (to understand why you have the know about the multiplication rule of statistics – if you’re not familiar with it, my ebook “Statistical Snacks” is a good place to start).

Let’s count the expected number of clicks for the I impressions. Visitors …

… contributing only one impression give rise to c(1) = p(1) + p(1) + … [f(1)·I addends in total] = p(1)·f(1)·I clicks

… contributing two impressions give rise to c(2) = p(2) + p(2) + … [f(2)·I/2 addends in total] = p(2)·f(2)·I/2 clicks

… contributing three impressions give rise to c(3) = p(3) + p(3) + … [f(3)·I/3 addends in total] = p(3)·f(3)·I/3 clicks

And so on. So the total number of clicks we can expect is: c = ∑[n] p(n)·f(n)/n·I. Since the CTR is just clicks divided by impressions, we finally get this beautiful formula:

CTR = ∑[n] p(n)·f(n)/n

The expression p(n)/n decreases as n increases. So a higher visitor loyalty (which mathematically means that f(n) has a relatively high value for n greater than one) translates into a lower CTR. One final conclusion: the formula can also tell us a bit about how the CTR develops during a campaign. If a website has no loyal visitors, the CTR will remain at a constant level, while for websites with a lot of loyal visitors, the CTR will decrease over time.

Braingate – You Thought It’s Science-Fiction, But It’s Not

On April 12, 2011, something extraordinary happened. A 58-year-old woman that was paralyzed from the neck down reached for a bottle of coffee, drank from a straw and put the bottle back on the table. But she didn’t reach with her own hand – she controlled a robotic arm with her mind. Uneblievable? It is. But decades of research made the unbelievable possible. Watch this exceptional and moving moment in history here (click on picture for Youtube video).

Beautiful

The 58-year-old women (patient S3) was part of the BrainGate2 project, a collaboration of researchers at the Department of Veterans Affairs, Brown University, German Aerospace Center (DLR) and others. The scientists implanted a small chip containing 96 electrodes into her motor cortex. This part of the brain is responsible for voluntary movement. The chip measures the electrical activity of the brain and an external computer translates this pattern into the movement of a robotic arm. A brain-computer interface. And it’s not science-fiction, it’s science.

During the study the woman was able to grasp items during the allotted time with a 70 % success rate. Another participant (patient T2) even managed to achieve a 96 % success rate. Besides moving robotic arms, the participants also were given the task to spell out words and sentences by indicating letters via eye movement. Participant T2 spelt out this sentence: “I just imagined moving my own arm and the [robotic] arm moved where I wanted it to go”.

The future is exciting.

Simpson’s Paradox And Gender Discrimination

One sunny day we arrive at work in the university administration to find a lot of aggressive emails in our in‒box. Just the day before, a news story about gender discrimination in academia was published in a popular local newspaper which included data from our university. The emails are a result of that. Female readers are outraged that men were accepted at the university at a higher rate, while male readers are angry that women were favored in each course the university offers. Somewhat puzzled, you take a look at the data to see what’s going on and who’s wrong.

The university only offers two courses: physics and sociology. In total, 1000 men and 1000 women applied. Here’s the breakdown:

Physics:

800 men applied ‒ 480 accepted (60 %)
100 women applied ‒ 80 accepted (80 %)

Sociology:

200 men applied ‒ 40 accepted (20 %)
900 women applied ‒ 360 accepted (40 %)

Seems like the male readers are right. In each course women were favored. But why the outrage by female readers? Maybe they focused more on the following piece of data. Let’s count how many men and women were accepted overall.

Overall:

1000 men applied ‒ 520 accepted (52 %)
1000 women applied ‒ 440 accepted (44 %)

Wait, what? How did that happen? Suddenly the situation seems reversed. What looked like a clear case of discrimination of male students turned into a case of discrimination of female students by simple addition. How can that be explained?

The paradoxical situation is caused by the different capacities of the two departments as well as the student’s overall preferences. While the physics department, the top choice of male students, could accept 560 students, the smaller sociology department, the top choice of female students, could only take on 400 students. So a higher acceptance rate of male students is to be expected even if women are slightly favored in each course.

While this might seem to you like an overly artificial example to demonstrate an obscure statistical phenomenon, I’m sure the University of California (Berkeley) would beg to differ. It was sued in 1973 for bias against women on the basis of these admission rates:

8442 men applied ‒ 3715 accepted (44 %)
4321 women applied ‒ 1512 accepted (35 %)

A further analysis of the data however showed that women were favored in almost all departments ‒ Simpson’s paradox at work. The paradox also appeared (and keeps on appearing) in clinical trials. A certain treatment might be favored in individual groups, but still prove to be inferior in the aggregate data.

How Statistics Turned a Harmless Nurse Into a Vicious Killer

Let’s do a thought experiment. Suppose you have 2 million coins at hand and a machine that will flip them all at the same time. After twenty flips, you evaluate and you come across one particular coin that showed heads twenty times in a row. Suspicious? Alarming? Is there something wrong with this coin? Let’s dig deeper. How likely is it that a coin shows heads twenty times in a row? Luckily, that’s not so hard to compute. For each flip there’s a 0.5 probability that the coin shows heads and the chance of seeing this twenty times in a row is just 0.5^20 = 0.000001 (rounded). So the odds of this happening are incredibly low. Indeed we stumbled across a very suspicious coin. Deep down I always knew there was something up with this coin. He just had this “crazy flip”, you know what I mean? Guilty as charged and end of story.

Not quite, you say? You are right. After all, we flipped 2 million coins. If the odds of twenty heads in a row are 0.000001, we should expect 0.000001 * 2,000,000 = 2 coins to show this unlikely string. It would be much more surprising not to find this string among the large number of trials. Suddenly, the coin with the supposedly “crazy flip” doesn’t seem so guilty anymore.

What’s the point of all this? Recently, I came across the case of Lucia De Berk, a dutch nurse who was accused of murdering patients in 2003. Over the course of one year, seven of her patients had died and a “sharp” medical expert concluded that there was only a 1 in 342 million chance of this happening. This number and some other pieces of “evidence” (among them, her “odd” diary entries and her “obsession” with Tarot cards) led the court in The Hague to conclude that she must be guilty as charged, end of story.

Not quite, you say? You are right. In 2010 came the not guilty verdict. Turns out (funny story), she never commited any murder, she was just a harmless nurse that was transformed into vicious killer by faulty statistics. Let’s go back to the thought experiment for a moment, imperfect for this case though it may be. Imagine that each coin represents a nurse and each flip a month of duty. It is estimated that there are around 300,000 hospitals worldwide, so we are talking about a lot of nurses/coins doing a lot of work/flips. Should we become suspicious when seeing a string of several deaths for a particular nurse? No, of course not. By pure chance, this will occur. It would be much more surprising not to find a nurse with a “suspicious” string of deaths among this large number of nurses. Focusing in on one nurse only blurs the big picture.

And, leaving statistics behind, the case also goes to show that you can always find something “odd” about a person if you want to. Faced with new information, even if not reliable, you interpret the present and past behavior in a “new light”. The “odd” diary entries, the “obsession” with Tarot cards … weren’t the signs always there?

Be careful to judge. Benjamin Franklin once said he should consider himself lucky if he’s right 50 % of the time. And that’s a genius talking, so I don’t even want to know my stats …

Two And A Half Fallacies (Statistics, Probability)

The field of statistics gives rise to a great number of fallacies (and intentional misuse for that matter). One of the most common is the Gambler’s Fallacy. It is the idea that an event can be “due” if it hasn’t appeared against all odds for quite some time.

In August 1913 an almost impossible string of events occurred in a casino in Monte Carlo. The roulette table showed black a record number of twenty-six times in a row. Since the chance for black on a single spin is about 0.474, the odds for this string are: 0.474^26 = 1 in about 270 million. For the casino, this was a lucky day. It profited greatly from players believing that once the table showed black several times in a row, the probability for another black to show up was impossibly slim. Red was due.

Unfortunately for the players, this logic failed. The chances for black remained at 0.474, no matter what colors appeared so far. Each spin is a complete reset of the game. The same goes for coins. No matter how many times a coin shows heads, the chance for this event will always stay 0.5. An unlikely string will not alter any probabilities if the events are truly independent.

Another common statistical fallacy is “correlation implies causation”. In countries with sound vaccination programmes, cancer rates are significantly elevated, whereas in countries where vaccination hardly takes place, there are only few people suffering from cancer. This seems to be a clear case against vaccination: it correlates with (and thus surely somehow must cause) cancer.

However, taking a third variable and additional knowledge about cancer into account produces a very different picture. Cancer is a disease of old age. Because it requires a string of undesired mutations to take place, it is usually not found in young people. It is thus clear that in countries with a higher life expectancy, you will find higher cancer rates. This increased life expectancy is reached via the many different tools of health care, vaccination being an important one of them. So vaccination leads to a higher life expectancy, which in turn leads to elevated rates in diseases of old age (among which is cancer). The real story behind the correlation turned out to be quite different from what could be expected at first.

Another interesting correlation was found by the parody religion FSM (Flying Spaghetti Monster). Deducting causation here would be madness. Over the 18th and 19th century, piracy, the one with the boats, not the one with the files and the sharing, slowly died out. At the same time, possibly within a natural trend and / or for reasons of increased industrial activity, the global temperature started increasing. If you plot the number of pirates and the global temperature in a coordinate system, you find a relatively strong correlation between the two. The more pirates there are, the colder the planet is. Here’s the corresponding formula:

T = 16 – 0.05 · P^0.33

with T being the average global temperature and P the number of pirates. Given enough pirates (about 3.3 million to be specific), we could even freeze Earth.

pirates global warming correlation flying spaghetti

But of course nobody in the right mind would see causality at work here, rather we have two processes, the disappearance of piracy and global warming, that happened to occur at the same time. So you shouldn’t be too surprised that the recent rise of piracy in Somalia didn’t do anything to stop global warming.

As we saw, a correlation between quantities can arise in many ways and does not always imply causation. Sometimes there is a third, unseen variable in the line of causation, other times it’s two completely independent processes happening at the same time. So be careful to draw your conclusions.

Though not a fallacy in the strict sense, combinations of low probability and a high number of trials are also a common cause for incorrect conclusions. We computed that in roulette the odds of showing black twenty-six times in a row are only 1 in 270 million. We might conclude that it is basically impossible for this to happen anywhere.

But considering there are something in the order of 3500 casinos worldwide, each playing roughly 100 rounds of roulette per day, we get about 130 million rounds per year. With this large number of trials, it would be foolish not to expect a 1 in 270 million event to occur every now and then. So when faced with a low probability for an event, always take a look at the number of trials. Maybe it’s not as unlikely to happen as suggested by the odds.

Code Transmission and Probability

Not long ago did mankind first send rovers to Mars to analyze the planet and find out if it ever supported life. The nagging question “Are we alone?” drives us to penetrate deeper into space. A special challenge associated with such journeys is communication. There needs to be a constant flow of digital data, strings of ones and zeros, back and forth to ensure the success of the space mission.

During the process of transmission over the endless distances, errors can occur. There’s always a chance that zeros randomly turn into ones and vice versa. What can we do to make communication more reliable? One way is to send duplicates.

Instead of simply sending a 0, we send the string 00000. If not too many errors occur during the transmission, we can still decode it on arrival. For example, if it arrives as 00010, we can deduce that the originating string was with a high probability a 0 rather than a 1. The single transmission error that occurred did not cause us to incorrectly decode the string.

Assume that the probability of a transmission error is p and that we add to each 0 (or 1) four copies, as in the above paragraph. What is the chance of us being able to decode it correctly? To be able to decode 00000 on arrival correctly, we can’t have more than two transmission errors occurring. So during the n = 5 transmissions, k = 0, k = 1 and k = 2 errors are allowed. Using the binomial distribution we can compute the probability for each of these events:

p(0 errors) = C(5,0) · p^0 · (1-p)^5

p(1 error) = C(5,1) · p^1 · (1-p)^4

p(2 errors) = C(5,2) · p^2 · (1-p)^3

We can simplify these expressions somewhat. A binomial calculator provides us with these values: C(5,0) = 1, C(5,1) = 5 and C(5,2) = 10. This leads to:

p(0 errors) = (1-p)^5

p(1 error) = 5 · p · (1-p)^4

p(2 errors) = 10 · p^2 · (1-p)^3

Adding the probabilities for all these desired events tells us how likely it is that we can correctly decode the string.

p(success) = (1-p)^3 · ((1-p)^2 + 5·p·(1-p) + 10·p^2)

In the graph below you can see the plot of this function. The x-axis represents the transmission error probability p and the y-axis the chance of successfully decoding the string. For p = 10 % (1 in 10 bits arrive incorrectly) the odds of identifying the originating string are still a little more than 99 %. For p = 20 % (1 in 5 bits arrive incorrectly) this drops to about 94 %.

Code Transmission

The downside to this gain in accuracy is that the amount of data to be transmitted, and thus the time it takes for the transmission to complete, increases fivefold.

Distribution of E-Book Sales on Amazon

For e-books on Amazon the relationship between the daily sales rate s and the rank r is approximately given by:

s = 100,000 / r

Such an inverse proportional relationship between a ranked quantity and the rank is called a Zipf distribution. So a book on rank r = 10,000 can be expected to sell s = 100,000 / 10,000 = 10 copies per day. As of November 2013, there are about 2.4 million e-books available on Amazon’s US store (talk about a tough competition). In this post we’ll answer two questions. The first one is: how many e-books are sold on Amazon each day? To answer that, we need to add the daily sales rate from r = 1 to r = 2,400,000.

s = 100,000 · ( 1/1 + 1/2 + … + 1/2,400,000 )

We can evaluate that using the approximation formula for harmonic sums:

1/1 + 1/2 + 1/3 + … + 1/r ≈ ln(r) + 0.58

Thus we get:

s ≈ 100,000 · ( ln(2,400,000) + 0.58 ) ≈ 1.5 million

That’s a lot of e-books! And a lot of saved trees for that matter. The second question: What percentage of the e-book sales come from the top 100 books? Have a guess before reading on. Let’s calculate the total daily sales for the top 100 e-books:

s ≈ 100,000 · ( ln(100) + 0.58 ) ≈ 0.5 million

So the top 100 e-books already make up one-third of all sales while the other 2,399,900 e-books have to share the remaining two-thirds. The cake is very unevenly distributed.

This was a slightly altered excerpt from More Great Formulas Explained, available on Amazon for Kindle. For more posts on the ebook market go to my E-Book Market and Sales Analysis Pool.

Estimating Temperature Using Cricket Chirps

I stumbled upon a truly great formula on the GLOBE scientists’ blog. It allows you to compute the ambient air temperature from the number of cricket chirps in a fixed time interval and this with a surprising accuracy. The idea actually quite old, it dates back 1898 when the physicist Dalbear first analyzed this relationship, and has been revived from time to time ever since.

Here’s how it works: count the number of chirps N over 13 seconds. Add 40 to that and you got the outside temperature T in Fahrenheit.

T = N + 40

From the picture below you can see that the fit is really good. The error seems to be plus / minus 6 % at most in the range from 50 to 80 °F.

E-Book Market & Sales – Analysis Pool

On this page you can find a collection of all my statistical analysis and research regarding the Kindle ebook market and sales. I’ll keep the page updated.

How E-Book Sales Vary at the End / Beginning of a Month

The E-Book Market in Numbers

Computing and Tracking the Amazon Sales Rank

Typical Per-Page-Prices for E-Books

Quantitative Analysis of Top 60 Kindle Romance Novels

Mathematical Model For E-Book Sales

If you have any suggestions on what to analyze next, just let me know. Share if you like the information.

How E-Book Sales Vary at the End / Beginning of a Month

After getting satisfying data and results on ebook sales over the course of a week, I was also interested in finding out what impact the end or beginning of a month has on sales. For that I looked up the sales of 20 ebooks, all taken from the current top 100 Kindle ebooks list, for November and beginning of December on novelrank. Here’s how they performed at the end of November:

  • Strong Increase: 0%
  • Slight Increase: 0 %
  • Unchanged: 20%
  • Slight Decrease: 35 %
  • Strong Decrease: 45 %

80 % showed either a slight or strong decrease, none showed any increase. So there’s a very pronounced downwards trend in ebook sales at the end of the month. It usually begins around the 20th. Onto the performance at the beginning of December:

  • Strong Increase: 50%
  • Slight Increase: 35 %
  • Unchanged: 10%
  • Slight Decrease: 5 %
  • Strong Decrease: 0 %

Here 85 % showed either a slight or strong increase, while only 5 % showed any decrease. This of course doesn’t leave much room for interpretation, there’s a clear upwards trend at the beginning of the month. It usually lasts only a few days (shorter than the decline period) and after that the elevated level is more or less maintained.

Mathematical Model For (E-) Book Sales

It seems to be a no-brainer that with more books on the market, an author will see higher revenues. I wanted to know more about how the sales rate varies with the number of books. So I did what I always do when faced with an economic problem: construct a mathematical model. Even though it took me several tries to find the right approach, I’m fairly confident that the following model is able to explain why revenues grow overproportionally with the number of books an author has published. I also stumbled across a way to correct the marketing R/C for number of books.

The basic quantities used are:

  • n = number of books
  • i = impressions per day
  • q = conversion probability (which is the probability that an impression results in a sale)
  • s = sales per buyer
  • r = daily sales rate

Obviously the basic relationship is:

r = i(n) * q(n) * s(n)

with the brackets indicating a dependence of the quantities on the number of books.

1) Let’s start with s(n) = sales per buyer. Suppose there’s a probability p that a buyer, who has purchased an author’s book, will go on to buy yet another book of said author. To visualize this, think of the books as some kind of mirrors: each ray (sale) will either go through the book (no further sales from this buyer) or be reflected on another book of the author. In the latter case, the process repeats. Using this “reflective model”, the number of sales per buyer is:

s(n) = 1 + p + p² + … + pn = (1 – pn) / (1 – p)

For example, if the probability of a reader buying another book from the same author is p = 15 % = 0.15 and the author has n = 3 books available, we get:

s(3) = (1 – 0.153) / (1 – 0.15) = 1.17 sales per buyer

So the number of sales per buyer increases with the number of books. However, it quickly reaches a limiting value. Letting n go to infinity results in:

s(∞) = 1 / (1 – p)

Hence, this effect is a source for overproportional growth only for the first few books. After that it turns into a constant factor.

2) Let’s turn to q(n) = conversion probability. Why should there be a dependence on number of books at all for this quantity? Studies show that the probability of making a sale grows with the choice offered. That’s why ridiculously large malls work. When an author offers a large number of books, he is able to provide list impression (featuring all his / her books) additionally to the common single impressions (featuring only one book). With more choice, the conversion probability on list impressions will be higher than that on single impressions.

  • qs = single impression conversion probability
  • ps = percentage of impressions that are single impressions
  • ql = list impression conversion probability
  • pl = percentage of impressions that are list impressions

with ps + pl = 1. The overall conversion probability will be:

q(n) = qs(n) * ps(n) + ql(n)* pl(n)

With ql(n) and pl(n) obviously growing with the number of books and ps(n) decreasing accordingly, we get an increase in the overall conversion probability.

3) Finally let’s look at i(n) = impressions per day. Denoting with i1, i2, … the number of daily impressions by book number 1, book number 2, … , the average number of impressions per day and book are:

ib = 1/n * ∑[k] ik

with ∑[k] meaning the sum over all k. The overall impressions per day are:

i(n) = ib(n) * n

Assuming all books generate the same number of daily impressions, this is a linear growth. However, there might be an overproportional factor at work here. As an author keeps publishing, his experience in writing, editing and marketing will grow. Especially for initially inexperienced authors the quality of the books and the marketing approach will improve with each book. Translated in numbers, this means that later books will generate more impressions per day:

ik+1 > ik

which leads to an overproportional (instead of just linear) growth in overall impressions per day with the number of books. Note that more experience should also translate into a higher single impression conversion probability:

qs(n+1) > qs(n)

4) As a final treat, let’s look at how these effects impact the marketing R/C. The marketing R/C is the ratio of revenues that result from an ad divided by the costs of the ad:

R/C = Revenues / Costs

For an ad to be of worth to an author, this value should be greater than 1. Assume an ad generates the number of iad single impressions in total. For one book we get the revenues:

R = iad * qs(1)

If more than one book is available, this number changes to:

R = iad * qs(n) * (1 – pn) / (1 – p)

So if the R/C in the case of one book is (R/C)1, the corrected R/C for a larger number of books is:

R/C = (R/C)1 * qs(n) / qs(1) * (1 – pn) / (1 – p)

In short: ads, that aren’t profitable, can become profitable as the author offers more books.

For more mathematical modeling check out: Mathematics of Blog Traffic: Model and Tips for High Traffic.

Another Home Experiment – Wind Speed and Sound Level

Recently I told you about my home experiment regarding impact speed and sound level. I did another experiment with my sound level meter, this time I was interested in finding out how the sound level varies with the wind speed. So I took my anemometer (yep, that’s a thing) to measure the wind speed and at the same time noted the sound level. I collected some data points and plotted them. Here’s the result:

homeexperiment

As you can see the fit is not that bad (the adjusted r-square is 0.91).

So the sound level grows with the wind velocity to the power of 0.22, meaning that if the wind speed increases by a factor of twenty-five, the sound level doubles. According to the empirical formula, the noise from the wind inside a category 1 and 2 hurricane is comparable to the sound level at a rock concert. This is of course assuming that the formula holds true past the 12 m/s range over which it was determined (which is not necessarily the case, but for now the best guess).

The Ebook Market in Numbers

Over the years the ebook market has grown from a relatively obscure niche to a thrilling billion-dollar mass market. The total ebook revenues went from 64 million $ in 2008 to about 3 billion $ in 2012. That’s a increase by a factor of close to 50 in just a few years.

ebook market revenues

The number of units sold also increased by the same factor (from 10 million units in 2008 to 457 million in 2012).

ebook market units sold

(Source)

However, many experts believe that the ebook market has reached a plateau and the numbers for the first half of 2013 seem to confirm that.

From the revenues and units sold we can also extract the development of the average price for sold ebooks. It strongly increased from 6.4 $ in 2008 to about 8 $ in 2009. After that, it quickly went back down to 7 $ in 2010 and 6.7 $ in 2012. So ebooks have gotten cheaper in the last few years, but are still more expensive than in 2008.

average price ebook

As of 2012, ebooks make up 20 % of the general book market.

21 % of American adults have read an ebook / magazine / newspaper on an e-reader in 2012. This is up from 17 % in the previous year.

A survey, again from 2012, shows that most e-book consumers prefer Amazon’s Kindle Fire (17 %, up from no use) , followed by Apple’s iPad (10 %, same as previous year) and Barnes & Noble’s Nook (7 %, up from 2 %).

The Internet since 1998 in Numbers

Here’s how the number of websites developed since 1998:

internetwebsites

In 1998 there were about 2.4 million websites. This grew to 17.1 million at the turn of the millennium. In 2007 the Internet cracked the 100 million mark and soon after, in 2009, the 200 million mark. 2012 saw a sudden jump to about 700 million websites. 2010 and 2013 were the only years in which the number of sites declined.

internetusers

The number of users has been steadily increasing at a rate of about 170 million per year. It went from 188 million (3 % of the world population) in 1998 to 2760 million (40 % of the world population) in 2013. A mathematical trend analysis shows that we can expect the 4000 million mark to be cracked in 2017 and the 5000 million mark in 2020.

Very interesting in terms of competition among websites is the ratio of users to websites:

internetratio

Before 2000 it was relatively easy to draw a large number of visitors to a website. But then the situation drastically changed. The number of users per website dropped from about 88 to 24 and kept on decreasing. Today there are only 4 internet users per website, a tough market for website owners.

Some more numbers: in 1998 there were about 10,000 search queries on Google per day, this grew 1,200,000,000,000 (or 1.2 trillion) per day in 2012. Since Google controls roughly 65 % of the search engine market, the total number of queries per day should be around 1.8 trillion.

All the data is taken from this neat website: Internet Live Stats.

For more Internet analysis check out my post Average Size of Web Pages plus Prediction.

How To Calculate the Elo-Rating (including Examples)

In sports, most notably in chess, baseball and basketball, the Elo-rating system is used to rank players. The rating is also helpful in deducing win probabilities (see my blog post Elo-Rating and Win Probability for more details on that). Suppose two players or teams with the current ratings r(1) and r(2) compete in a match. What will be their updated rating r'(1) and r'(2) after said match? Let’s do this step by step, first in general terms and then in a numerical example.

The first step is to compute the transformed rating for each player or team:

R(1) = 10r(1)/400

R(2) = 10r(2)/400

This is just to simplify the further computations. In the second step we calculate the expected score for each player:

E(1) = R(1) / (R(1) + R(2))

E(2) = R(2) / (R(1) + R(2))

Now we wait for the match to finish and set the actual score in the third step:

S(1) = 1 if player 1 wins / 0.5 if draw / 0 if player 2 wins

S(2) = 0 if player 1 wins / 0.5 if draw / 1 if player 2 wins

Now we can put it all together and in a fourth step find out the updated Elo-rating for each player:

r'(1) = r(1) + K * (S(1) – E(1))

r'(2) = r(2) + K * (S(2) – E(2))

What about the K that suddenly popped up? This is called the K-factor and basically a measure of how strong a match will impact the players’ ratings. If you set K too low the ratings will hardly be impacted by the matches and very stable ratings (too stable) will occur. On the other hand, if you set it too high, the ratings will fluctuate wildly according to the current performance. Different organizations use different K-factors, there’s no universally accepted value. In chess the ICC uses a value of K = 32. Other approaches can be found here.

—————————————–

Now let’s do an example. We’ll adopt the value K = 32. Two chess players rated r(1) = 2400 and r(2) = 2000 (so player 2 is the underdog) compete in a single match. What will be the resulting rating if player 1 wins as expected? Let’s see. Here are the transformed ratings:

R(1) = 102400/400 = 1.000.000

R(2) = 102000/400 = 100.000

Onto the expected score for each player:

E(1) = 1.000.000 / (1.000.000 + 100.000) = 0.91

E(2) = 100.000 / (1.000.000 + 100.000) = 0.09

This is the actual score if player 1 wins:

S(1) = 1

S(2) = 0

Now we find out the updated Elo-rating:

r'(1) = 2400 + 32 * (1 – 0.91) = 2403

r'(2) = 2000 + 32 * (0 – 0.09) = 1997

Wow, that’s boring, the rating hardly changed. But this makes sense. By player 1 winning, both players performed according to their ratings. So no need for any significant changes.

—————————————–

What if player 2 won instead? Well, we don’t need to recalculate the transformed ratings and expected scores, these remain the same. However, this is now the actual score for the match:

S(1) = 0

S(2) = 1

Now onto the updated Elo-rating:

r'(1) = 2400 + 32 * (0 – 0.91) = 2371

r'(2) = 2000 + 32 * (1 – 0.09) = 2029

This time the rating changed much more strongly.

—————————————–

Mathematics of Blog Traffic: Model and Tips for High Traffic

Over the last few days I finally did what I long had planned and worked out a mathematical model for blog traffic. Here are the results. First we’ll take a look at the most general form and then use it to derive a practical, easily applicable formula.

We need some quantities as inputs. The time (in days), starting from the first blog entry, is denoted by t. We number the blog posts with the variable k. So k = 1 refers to the first post published, k = 2 to the second, etc … We’ll refer to the day on which entry k is published by t(k).

The initial number of visits entry k draws from the feed is symbolized by i(k), the average number of views per day entry k draws from search engines by s(k). Assuming that the number of feed views declines exponentially for each article with a factor b (my observations put the value for this at around 0.4 – 0.6), this is the number of views V the blog receives on day t:

V(t) = Σ[k] ( s(k) + i(k) · bt – t(k))

Σ[k] means that we sum over all k. This is the most general form. For it to be of any practical use, we need to make simplifying assumptions. We assume that the entries are published at a constant frequency f (entries per day) and that each article has the same popularity, that is:

i(k) = i = const.
s(k) = s = const.

After a long calculation you can arrive at this formula. It provides the expected number of daily views given that the above assumptions hold true and that the blog consists of n entries in total:

V = s · n + i / ( 1 – b1/f )

Note that according to this formula, blog traffic increases linearly with the number of entries published. Let’s apply the formula. Assume we publish articles at a frequency f = 1 per day and they draw i = 5 views on the first day from the feed and s = 0.1 views per day from search engines. With b = 0.5, this leads to:

V = 0.1 · n + 10

So once we gathered n = 20 entries with this setup, we can expect V = 12 views per day, at n = 40 entries this grows to V = 14 views per day, etc … The theoretical growth of this blog with number of entries is shown below:

viewsentries

How does the frequency at which entries are being published affect the number of views? You can see this dependency in the graph below (I set n = 40):

viewsfrequency

The formula is very clear about what to do for higher traffic: get more attention in the feed (good titles, good tagging and a large number of followers all lead to high i and possibly reduced b), optimize the entries for search engines (high s), publish at high frequency (obviously high f) and do this for a long time (high n).

We’ll draw two more conclusions. As you can see the formula neatly separates the search engine traffic (left term) and feed traffic (right term). And while the feed traffic reaches a constant level after a while of constant publishing, it is the search engine traffic that keeps on growing. At a critical number of entries N, the search engine traffic will overtake the feed traffic:

N = i / ( s · ( 1 – b1/f ) )

In the above blog setup, this happens at N = 100 entries. At this point both the search engines as well as the feed will provide 10 views per day.

Here’s one more conclusion: the daily increase in the average number of views is just the product of the daily search engine views per entry s and the publishing frequency f:

V / t = s · f

Thus, our example blog will experience an increase of 0.1 · 1 = 0.1 views per day or 1 additional view per 10 days. If we publish entries at twice the frequency, the blog would grow with 0.1 · 2 = 0.2 views per day or 1 additional view every 5 days.