Exploratory Data Analysis: Estimates of Location

Estimates of location are summary statistics that provide information about the central tendency or typical value of a dataset. These statistics help you understand where the data tends to cluster or concentrate.

3 October 2023

statistics
mini profile picture

by Muhammad Reyhan Arighy Data Scientist

banner

Introduction

Summarizing data and estimating central tendency is a fundamental step in data analysis, and while meanThe sum of all values divided by the number of values.
Synonym: average
is a commonly used measure of central tendency, it may not always be the most appropriate choice.

Statisticians have developed alternative estimates to the mean to address situations where the mean might not provide a representative measure of central tendency.

Metrics and Estimates

  1. Estimate: In statistics, the term is commonly used to refer to a value calculated from a sample of data to infer a population parameter. This acknowledges the inherent uncertainty involved in making conclusions about the entire population based on a subset of data.
  2. Metric: In data science and related fields, the term is more frequently used to refer to a specific quantitative measure used to assess performance, track progress, or evaluate the success of a particular aspect of a system. Metrics are often employed in a business or organizational context to make data-driven decisions and monitor key performance indicators (KPIs). Data scientists aim to provide actionable insights and measurements that align with organizational goals and objectives.

The difference reflects the approach of statistics versus that of data science: accounting for uncertainty lies at the heart of the discipline of statistics, whereas concrete business or organizational objectives are the focus of data science. Hence, statisticians estimate, and data scienctists measure.

Mean

The most basic estimate of location is the mean, or average value. Consider the following set of numbers: [3 5 1 2]. The mean is (3 + 5 + 1 + 2) / 4 = 11 / 4 = 2.75. You will encounter the symbol (pronounced "x-bar") being used to represent the mean of a sample from a population. The formula to compute the mean for a set of n values x1, x2, ..., xn is:

mean-formula

The convention of capitalizing N when referring to a population and using lowercase n when referring to a sample is indeed a common practice in statistics. This distinction helps clarify whether you are discussing a parameter (population) or a statistic (sample) and is an important aspect of statistical notation.

In data science and related fields, the conventions around N and n may be less strict. While some data scientists may follow the traditional capitalization rules to maintain consistency, others may use them interchangeably because the distinction between population and sample may not always be as critical in data science as it is in formal statistical analysis.

A variation of the mean is a trimmed meanThe average of all values after dropping a fixed number of extreme values.
Synonym: truncated mean
, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by x(1), x(2), ..., x(n) where x(1) is the smallest value and x(n) the largest, the formula to compute the trimmed mean with p smallest and largest values omitted is:

trimmed-mean-formula

Trimmed means are particularly useful in situations where outliers or extreme values can distort the interpretation of the data. For example, in educational assessment, let's say there is a standardized test with a large number of questions, and each question is graded on a scale from 0 to 100. The goal is to calculate a student's overall score on the test. However, some students may provide extreme responses, such as leaving many questions unanswered or guessing randomly, leading to exceptionally low or high scores.

To mitigate the influence of such extreme responses and ensure a more reliable measure of a student's performance, a trimmed mean can be employed. Trimmed means are widely used, and in many cases are preferable to using the ordinary mean (see "Median and Robust Estimates").

Another type of mean is a weighted meanThe sum of all values times a weight divided by the sum of the weights.
Synonym: weighted average
, which you calculate by multiplying each data value xi by a user-specified weight wi and dividing their sum by the sum of the weights. The formula for a weighted mean is:

weighted-mean-formula

There are two main motivations for using a weighted mean:

  1. Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. For example, if we are taking the average from multiple sensors and one of the sensors is less acurate, then we might downweight the data from that sensor.
  2. The data collected does not equally represent the different groups that we are interested in measuring. For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented.

Median and Robust Estimates

The medianThe value such that one-half of the data lies above and below.
Synonym: 50th percentile
is the middle value in a sorted list of data. If there is an even number of data values, it is the average of the two middle values. It represents the middle point of the dataset when ordered. Compared to the mean, which uses all observations, the median depends only on the values in the center of the sorted data. While this might seem to be a disadvantage, since the mean is more sensitive to the data, there are many instances in which the median is a better metric for location, especially when dealing with datasets that contain extreme values or outliers.

The median is commonly used in various fields, including economics, income distribution analysis, housing prices, and healthcare, where extreme values or outliers are common and can distort the interpretation of data. Let's say we want to look at typical household incomes in neighborhoods around Lake Washington in Seattle. In comparing the Medina neighborhood to the Windermere neighborhood, using the mean would produce different results because Bill Gates lives in Medina. If we use the median, it won't matter how rich Bill Gates is. The position of the middle observation will remain the same.

For some reasons that one uses a weighted mean, it is also possible to compute a weighted medianThe value such that one-half of the sum of the weights lies above and below the sorted data.. As with the median, we first sort the data, although each data value has an associated weight. Instead of the middle number, the weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list. Like the median, the weighted median is robust to outliers.

Outliers

The median is referred to as a robustNot sensitive to extreme values.
Synonym: resistant
estimate of location since it is not influenced by outliersA data value that is very different from most of the data.
Synonym: extreme value
(extreme cases) that could skew the results. An outlier is any value that is very distant from the other values in a data set. The exact definition of an outlier is somewhat subjective.

Being an outlier in itself does not make a data value invalid or erroneous as in the previous example with Bill Gates. Still, outliers can be caused by data errors, including recording errors, sensor malfunctions, unit mismatches, or data entry mistakes. In such cases, the presence of outliers may indicate data quality issues that need to be addressed. It's important to identify and investigate outliers in your dataset. Outliers may carry valuable information or insights, especially in scientific research or anomaly detection. They may indicate unusual but real events or phenomena that warrant further examination.

The median is not the only robust estimate of location, and trimmed mean is indeed another widely used and effective method to mitigate the influence of outliers while providing a compromise between the median and the mean. For example, trimming the bottom and top 10 % (a common choice) of the data will provide protection against outliers in all but the smallest data sets. The trimmed mean is robust to extreme values in the data, but uses more data to calculate the estimate for location.

Anomaly Detection

In contrast to typical data analysis, where outliers are sometimes informative and sometimes a nuissance, in anomaly detection the points of interest are the outliers, and the greater mass of data serves primarily to define the "normal" against which anomalies are measured.

Example: Location Estimates of Youtube Views and Engagement Rates

Table below shows data set containing a number of views and associated engagement rate (ratio of total likes and comments per total views) for each category in US Youtube trending videos.

example-table

Compute the mean, trimmed mean, and median for the views using Python. To compute mean and median in Python we use Pandas methods of the dataframe. The trimmed mean requires the trim_mean function in scipy.stats.

The mean is bigger than the trimmed mean, which is bigger than the median. This is because the trimmed mean excludes the largest and smallest categories within 10 % from each end. If we want to compute the average engagement rate for the category, we need to use a weighted mean or median to account for different views in the categories. Weighted mean is available with NumPy. For weighted median, we can use the specialized package wquantiles.

In this case, the weighted mean and the weighted median are about the same.

Key Ideas

  1. The basic metric for location is the mean, but it can be sensitive to extreme values (outliers).
  2. Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distribution and hence are more robust.

Further Reading


Read next

CRUD Application in Object-Oriented Programming

A simple CRUD application from scratch and implementing its functionality in Object-oriented Programming Paradigm with Python.

Exploratory Data Analysis: Estimates of Variability

Estimates of variability measures whether the data values are tightly clustered or spread out.

Exploratory Data Analysis: Exploring the Data Distribution

Each of estimates sums up the data in a single number to describe the location or variability of the data. It is also useful to explore how the data is distributed overall.

All Articles

Starting a new chapter in my career and believe I'm the right candidate?

Let's Talk