01 Data, Statistics, and Statistical Questions

Statistical questions

In mathematics, most of the questions have definite answers. But in real life, even a simple question such as “How much time does it take you to go to school by bus” may not have a definite answer. Sometimes it takes 7 minutes and sometimes it takes 9 minutes.Even a simple question like this produces answers that have variability, i.e. the answer varies from day to day (varies every time).

If there was no variability and every day it took exactly 9 minutes for a bus ride, then we would not need any additional investigation on this topic. But, because data varies, we need a separate branch of mathematics called statistics to answer questions about the data. As we go further in, we’ll learn more about this.

In statistics, we start with a question, since that is what results in getting answers or ‘data’. But the question has to be phrased in such a way that gives us a desired answer, basically that gives us data we can work with. Such questions are called statistical questions. Confusing? Don’t worry, we’ll slowly go into examples and details to make it clearer about what “desired answer” means.

Let’s start with two questions and decide which ones are ‘statistical’.

A teacher asks the two questions to their class:

Asking every student in class: What grade do you belong to? Asking sixth grade students: What grade do you belong to?

What do you think are possible answers to the two questions?

In the first case, we could get different answers, ranging from 1 to 12. In the second case, we obviously only get 6 as the answer since we are only asking 6th graders.

Do keep in mind who the question is being asked to. That greatly affects the answer you get.

How are the two answers we get different? In the second case, we only get one answer, there is no change in them. In the first case however, there is a wide range of answers we get, meaning there is some sort of variability in the answers. This type of question is a statistical question, which is what we use. The one where we get the same answer is not that helpful to us since we cannot do anything more with the single answer we get.

So, a statistical question is the one where you expect variability in answers like above. Answers to such questions give us the data we require so we can further look into it.

Here are some other examples of statistical questions: How many hours do you sleep every day? How many minutes do students in your class spend on homework? What is the favorite food of your class? In a presidential election, do potential voters support Joe Biden? How do the annual salaries for men and women in similar occupations compare?

And here are some examples of questions that are not statistical. Where in town does our math teacher live? How many minutes of recess do sixth-grade students have each day? How much water can a 1 L bottle hold at the most?

These questions are not statistical because the answers to these questions do not vary/change. The math teachers live in a particular location and each day the recess is the same, let’s say 20 minutes. The 1 L bottle will always hold 1 L of water at max.

Variables - numerical and categorical

The data we collect from statistical questions consists of observation or measurements on a variable. Similar to algebra, in statistics, when we say variable, it is a characteristic that may be different from one individual to another or from one instance to another.

For example, in the statistical question “How many minutes do students in your class spend on homework?”, the number of minutes students spend on homework is called a variable because its measurement varies from one individual to another. In the case of the statistical question “How many hours do you sleep every day?”, the number of hours is a variable because its measurement will change from one day to another.

Data consist of observations or measurements on a variable. In statistics, a variable is a characteristic that may be different from one individual to another or from one instance to another.

For example, in the statistical question “How many minutes do students in your class spend on homework?”, the number of minutes students spend on homework is called a variable because its measurement varies from one individual to another. In the case of the statistical question “How many hours do you sleep every day?”, the number of hours is a variable because its measurement will change from one day to another.

What if the question was “What is the favorite food of your class?”

The answers would be any type of food, like Pizza, Burger, etc.

Can you see a difference between the two different cases we just mentioned?

The answer to both of these questions is “How many minutes do students in your class spend on homework?” and “How many hours do you sleep every day?” are in terms of numbers. The number of minutes spent on homework is a quantity such as 20 mins, 60 mins, and so on. Similarly, the number of hours you sleep is also a quantity. How do you know they are numerical quantities? Well, if you add any two of these quantities, you get a third quantity. If you sleep 8 hours today and 7 hours tomorrow, you would sleep a total of 15 hours in two days. So when we have numerical quantities as our variable, adding them makes sense (which means you can also apply other operations on them). Such variables are called numerical variables.

However, we see that the answer to the other question “What is the favorite food of your class?” is not numerical (since it could be Pizza, Burger, Sandwich or any other food). We cannot possibly add Pizza and Burger to get a meaningful answer. Similarly, the answers to the question “In a presidential election, do potential voters support Joe Biden?“ are Yes, No, or Maybe. We cannot add them either. Would Yes and No together mean maybe? Probably not. Such variables are called categorical variables because rather than quantities, they have specific categories such as ‘Pizza’, ‘Burger’, ‘Yes’, ‘No’ and so on as an answer.

((Sometimes categorical variables have a certain order we need to follow. Let’s look at the question: “How would you rate a movie based on the scale: “High”, “Medium”, “Poor”?” The answer to this is the three categories provided, and we know that one has more value than the other (A “High” is better than “Poor”). Such categorical variables are called ordinal variables, since they have a specific ‘order’. When categorical variables do not have any order, like the one with the favorite food, they are call nominal variables.))

Now that you have started to understand the difference between numerical and categorical variables, let’s look at a tricky question.

You are running a survey and you ask each of the people what their home zip code is. You get answers like 6547, 2356, 9871, 8714, etc.

Is zipcode a numerical or categorical variable?

[Hint: Check whether it makes sense to add two measurements.]