What

When

Where

Who

Why

How

How many

**What is Outlier?**

In statistics, an outlier is a data point that differs significantly from other data points in a given dataset. Outliers can be identified by being unusually high or low, and they can have a significant impact on statistical analyses, especially those that rely on measures of central tendency or variance.

When

**When do Outliers occur?**

Outliers can occur in any dataset, regardless of the type of data or the method used to collect it. Outliers can arise due to various reasons, such as measurement errors, data entry errors, sampling biases, or genuine observations that are far outside the expected range.

Where

**Where do outliers occur?**

Outliers can occur in any type of data, whether it is collected from natural or social phenomena or artificially generated by simulations or experiments. Outliers can arise in various ways, such as measurement errors, data entry errors, sampling biases, or genuine observations that are far outside the expected range.

Who

**Who needs data cleaning?**

Data cleaning is crucial for anyone who collects and uses data, including individuals and organizations across industries. It ensures accuracy and reliability by identifying and correcting errors, inconsistencies, and incomplete data, improving the quality and usefulness of data for informed decision-making, insights, and impact measurement.

Why

**Why do Outliers occur?**

Outliers are data points that deviate significantly from the other values in a dataset. They can occur due to measurement errors caused by faulty equipment or human error, data entry errors like typos or misplaced decimal points, sampling biases from unrepresentative samples, or genuine observations that fall outside the expected range. It's crucial to understand the causes of outliers to analyse data accurately and make informed decisions.

How

**How outliers can be removed?**

There are different methods for removing outliers from a dataset. Some common ones include the Z-score method, which removes data points outside a certain range of the mean, the box plot method, which removes data points outside the whiskers of the box plot, and the density-based method, which identifies data points in low-density regions. Domain-specific methods may also be used, where knowledge or expertise in a particular field is required to identify and remove outliers.

How many

**How many ways are there to handle outliers?**

Outliers in a dataset can be handled in several ways. Winsorizing sets extreme values to a specified percentile, trimming removes a percentage of top and bottom data points, transformation applies a mathematical function to the data, removal deletes outliers, robust methods are less sensitive to outliers, and imputation replaces missing values with estimated ones. However, it's important to exercise caution when removing or manipulating outliers as they may contain valuable information.