Statistics 101

Photo by Brent De Ranter on Unsplash

Intro

Now that we have the data at hand, rhe next question will be: what’s inside it? In this post, I will explain about some basic terms to get basic understanding of numerical data. Example in real life can be exam results, and everything with a price tag on it like house, car, or stock. Let’s go dig deeper on that.

Average or Means

This is the most common metrics we usually hear. And the formula actually very simple and in fact, I’m sure all of us know it first back in junior high school class. Imagine we have 10 results (data points) from Math exam. We can get the average or means simply by summing up the scores and then divided by number of records (in this case, 10).

The pros of having this metric is we can know at a glance where is the central of our data. This is very useful when we are familiar with the data and we know the data range. For example, in school exam results, the score will range from 0 to 100. It’s not common to have range 0 to 114 for example.

In the other hand, if we don’t know the range of our data, average usually can’t tell very well about the central point. Imagine a car price for example. A car can have price maybe starting from USD 10,000 until USD 2,000,000. Such a wide range will make the average price will look much higher because the way we define average formula. Imagine we have 9 cars with 10,000 in price and 1 car with 2,000,000. The average price will be USD 209,000. We know that doesn’t represent our cars at all. To overcome this issue, we use median.

Median

This is another method to see where is the central of our data. Unlike average, a single big number won’t affect median so we can say this metric is more robust to use when we have wide range or unknown range of data. The formula is a bit complex but nowadays we only need to call a function and the computer will do it for us very quickly.

To get the median, we first need to sort our data from the smallest to the biggest. If the data count is even (e.g. 10), the median will be the average of data number 5 and data number 6. But if the data is odd (e.g. 11), the median will come from number of data + 1 divided by 2 (data number 6). Using our car data, the median price will be USD 10,000 which represents our car price a lot.

Mode

This metric shows data frequency. Pretty useful if we have data in full numbers like number of children or cars in a household. Using the car price example, we will have USD 10,000 as the mode since we have 9 of them.

Quantile or Percentile

This is another metric if we want to know our data distribution. Percentile usually denoted between 0% to 100%. And median is actually 50% percentile since we can split the data exactly into 2. If we know median price is USD 10,000, it means: half of our car has price equal or less than USD 10,000. Usually we want to know the 25, 50, and 75th percentile to get more understanding of our data. And it also helps if we visualize this using histogram (distribution plot) or box plot

Closing

These are 4 common metrics that we may need to know if our data is numeric. In real life data where the data range is mostly unknown, it’s better to stick with percentile to get full picture. Hope you’ll find this post useful and see you in the next post.

2 thoughts on “Statistics 101

  1. I was curious if you ever considered changing the layout of your blog? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two images. Maybe you could space it out better?

    Like

Leave a comment