Data Analytics

Identify Outliers using Python

Pachipulusu S Mahesh

Pachipulusu S Mahesh

August 30 - -1 min read

Outlier is the data values that lie at an abnormal distance from other data values in a random sample from a population and which in turn means high or extremely low variance in a given data set. STRAPI IMAGES.png

EXAMPLE:-
Let us understand with an example. The below-mentioned data values are of the age of people located in Vijayanagar.
X= [25,27,32,12,35,68,14,72]
Looking at the above data set, we can guess like 12,14,68,72 might be the Outliers because they lie at extreme ends. Let us prove the same with the code in Python.
Mean and Standard Deviation is two metrics (method of measuring) in statistics. We can use modules in Python to calculate Mean and Standard Deviation which are in-built functions.

Next, we will define the Mean and Standard Deviation.

Mean:- It is nothing but a simple arithmetic average. The mean is the sum of all the data values to the total number of data values in a given data set. Mean is denoted by “μ” (Mu)
Eg:- x=[25,27,32,12,35,68,14,72]
Sum of Values= 25+27+32+12+35+68+14+72 = 285
Number of Values = 8
So, Mean = Sum of values / No. of values
Mean = 285 / 8 = 35.62

Standard Deviation:- It is a measure of the amount of variation or dispersion of a set of values.
Standard Deviation is denoted by “σ” (Sigma)
• If the Standard Deviation is low means the data values are close to the Mean
• If the Standard Deviation is high means the data values are spread out from the Mean To find the Standard Deviation, we need to calculate the Mean and Variance.
Already, the Mean is calculated.
Mean =35.62

Variance is the average of squared differences from the mean.
Difference from the Mean = [-10.62,-8.62,-3.62,-23.62,-0.62,32.38,-21.62,36.38]
Squared Differences= [112.78,74.30,13.10,557.9,0.38,1048.46,467.42,1323.50]

Average of Squared Differences = Sum of Squared Differences / Total No. of values
Variance = 3597.87 / 8
Variance = 449.73
Standard Deviation is the square root of variance.

Standard Deviation (SD) = Sqrt(Variance) = Sqrt(449.73) = 21.20

Now, we will calculate the Mean and Standard Deviation using the python programming language.

  1. Using Statistics Module By using the Statistics module, we can calculate the Mean and Standard Deviation directly.
    Firstly, import the Module
    import statistics # importing statistics library
    x= [25,27,32,12,35,68,14,72]
    Now, we will calculate the mean using the module
    statistics.mean(x) # calculating mean

Print(“The Mean of a given data set is %S”, %(mean))
Output: The Mean of the given data set is 35.62

Now, we will calculate the Standard Deviation using Module
statistics.stdev(x) # calculating standard deviation

Print(“The Standard Deviation of a given data set is %S”, %(stdev))
Output: The Standard Deviation of the given data set is 21.20

Now, we will write a code in python for finding Outliers in a given data set.
Here, we need to check each value in the data set above or below a certain threshold. This threshold is defined in terms of Standard Deviation.

Firstly, we will import the Statistics Module and calculate the Mean and Standard Deviation. Next, we will calculate the lower and upper limits to determine whether the values in the data set lie within those limits.
Finally, we will return a list (output) that does not contain the Outliers.
The above process can be done individually by calculating and checking. If the data values change again, we need to repeat the process which in turn calculating everything to become tedious. Hence, we will define a function that performs the task repeatedly when the function is called.

Let’s define a function called no_outliers
def no_outliers(x,threshold): # defining a function (no_outliers) with x(list) and threshold as arguments

Where def – keyword used for defining a function no_outliers – function name (x, threshold) – entire thing in parenthesis is called arguments X – list (Data values) Threshold – user-defined frequency

Next, we will calculate the Lower and Upper limits
z1 = mu+thresholdsigma # [Upper Limit]
z2= mu-threshold
sigma # [Lower Limit]
Where z1, z2 are two variables to store the values of limit and mu is the Mean and sigma is the Standard Deviation.

Next, we will write a code for iteration.
The list might contain several values. For checking every value we need to construct a loop which is with “for loop”.
For items in x: # iterating values in the list (x), means to check with each value.

here we are saying that, total items (data values) present in x(list). We are iterating values in the list (x), which means checking with each value present.

if items>=z2 and items<=z1: # items in x should satisfy the condition that, values should lie between the lower & upper limit.
Here we are checking for the condition that values in ‘x’ should lie between the Upper and Lower limits. Values in ‘x ’ should satisfy both the conditions. 'and’ keyword is used for both the conditions.

no outlier.append(items) # we are attaching the output

Here no outlier is an empty list created to store the output (without an Outliers list) and if the values satisfy the above condition the values will be added to the output as a list.
return no outlier # retuning the output with a list of data with no outliers

We are calling the return keyword to terminate the function and return the output and control terminates here and passes to the code where the function is called.

Now, we will construct a full code to find the Outliers in the given data set.

CODE FOR IDENTIFYING OF OUTLIERS

Code for identifying outliers in the given list
We need to check each value in the list above or below a certain threshold. This threshold is defined in terms of standard deviations.
Firstly, we will import the statistics library and calculate the mean and standard deviation. Next, we are calculating lower and upper limits to determine whether the value lies in that range. Finally, we are returning an output of a list that does not contain outliers.

In[1] import statistics # importing statistics library

x=[25,27,32,12,35,68,14,72] # declaring list no outlier=[] # no out is an empty list, to store the final output.

def no_outliers(x,threshold): # defining a function (no_outliers) with
x(list) and threshold as arguments
mu=statistics.mean(x) # calculating mean
sigma=statistics.stdev(x) # calculating standard deviation
z1=mu+thresholdsigma # declaring a variable (z1) and calculating
z1 (upper_limit)
z2=mu-threshold
sigma # declaring a variable (z2) and calculating
z2 (lower_limit)
for items in x: # iterating values in the list (x), means checking with each value
if items>=z2 and items<=z1: # items in x should satisfy the condition that values should lie between the lower & upper limit.

no outlier.append(items) # we are attaching the output
return no outlier # retuning the output with a list of data with no outliers

no_outliers(x,0.9)

Out[1] [25, 27, 32, 35]

about the author

Mahesh is a passionate, ambitious and highly organized guy. He holds an Engineering degree in Telecommunications. He works for a Nationalized Bank and has vast experience and knowledge in the Banking industry.