In order to guarantee the accuracy, consistency, and dependability of the data used for analysis, reporting, or machine learning, data cleansing is an essential stage in the data preparation process. Inaccurate judgments, faulty models, and ineffective decision-making can result from low-quality data.

1. Removing Duplicates
Duplicates in datasets can skew results, inflate counts, or cause redundancy. SQL Server provides ways to remove duplicates using DISTINCT or GROUP BY.
Example. Identify and remove duplicate rows in the Person.EmailAddress table.

If you want to remove duplicates permanently, you can use CTE (Common Table Expressions) with ROW_NUMBER().
USE Hostforlife;
GO

SELECT DISTINCT
    EmailAddressID,
    EmailAddress
FROM
    Person.EmailAddress;

WITH CTE AS (
    SELECT
        EmailAddressID,
        EmailAddress,
        ROW_NUMBER() OVER (PARTITION BY EmailAddress ORDER BY EmailAddressID) AS RowNum
    FROM
        Person.EmailAddress
)
DELETE FROM CTE
WHERE RowNum > 1;

2. Handling Missing Values
Missing values can impact analysis and decision-making. Use SQL functions like IS NULL, COALESCE, or CASE to identify and handle missing data. Example. Replace missing PhoneNumber values in the Person.Use the personPhone table with a default value or remove rows with missing values.
USE 
Hostforlife;
GO

SELECT
    BusinessEntityID,
    PhoneNumber,
    COALESCE(PhoneNumber, 'Unknown') AS CleanedPhoneNumber
FROM
    Person.PersonPhone;

DELETE FROM
    Person.PersonPhone
WHERE
    PhoneNumber IS NULL;

3. Correcting Data Inconsistencies
Inconsistencies like variations in case or formatting can lead to errors in joins or grouping. SQL Server string functions like UPPER, LOWER, or REPLACE can help standardize data.

Example. Standardize FirstName values in the Person.Person table to uppercase. Replace incorrect substrings in email addresses.
USE 
Hostforlife;
GO

UPDATE Person.Person
SET FirstName = UPPER(FirstName);

UPDATE Person.EmailAddress
SET EmailAddress = REPLACE(EmailAddress, '@
Hostforlife.eu', '@Hostforlife.eu
');

4. Standardizing Data Formats
Standardized data formats ensure consistency and compatibility across systems. Functions like CAST and CONVERT are often used for this purpose. Example. Convert ModifiedDate in the Sales.SalesOrderHeader table to a specific format. Here, the date format 101 converts the date to MM/DD/YYYY.

USE
Hostforlife
GO

SELECT SalesOrderID,
       CONVERT(VARCHAR(10), ModifiedDate, 101) AS FormattedDate
FROM Sales.SalesOrderHeader;

5. Removing Outliers

Outliers can distort statistical analyses and trends. Use statistical functions and filtering to identify and exclude them.

Example. Remove outliers based on TotalDue in the Sales.SalesOrderHeader table.
USE
Hostforlife
GO

SELECT *
FROM Sales.SalesOrderHeader
WHERE TotalDue BETWEEN
      (SELECT PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY TotalDue) OVER ()) AND
      (SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY TotalDue) OVER ());

6. Validating Data
Data validation ensures data integrity by applying constraints or rules. SQL Server constraints like NOT NULL, UNIQUE, and CHECK are essential for enforcing data quality.
Example. Enforce data integrity when creating a new table.
USE 
Hostforlife;
GO

CREATE TABLE Sales.Promotions (
    PromotionID INT PRIMARY KEY,
    PromotionName NVARCHAR(100) NOT NULL,
    DiscountPercentage DECIMAL(5, 2) CHECK (DiscountPercentage BETWEEN 0 AND 100),
    StartDate DATE NOT NULL,
    EndDate DATE NOT NULL,
    CONSTRAINT CK_Promotions_EndDate CHECK (EndDate > StartDate)
);

Alternatively, validate existing data using conditional queries.
SELECT *
FROM Sales.Promotions
WHERE DiscountPercentage < 0
   OR DiscountPercentage > 100;


Conclusion
Data cleansing is an ongoing process and a crucial component of the data lifecycle. By removing duplicates, handling missing values, correcting inconsistencies, standardizing formats, removing outliers, and validating data, you can significantly improve the quality of your data. These techniques, demonstrated using the AdventureWorks2022 database, can be applied to real-world datasets to ensure accurate and actionable insights. By incorporating these practices into your data workflows, you can ensure that your analysis, reports, and machine learning models are built on a solid foundation of clean data.

HostForLIFEASP.NET SQL Server 2022 Hosting