How to delete duplicate rows from a table using the ROW_NUMBER function in SQL?
ROW_NUMBER()
is one of the SQL’s most useful functions which can be leveraged for many advanced scenarios and use cases. Some of those examples include finding a precise subset of rows, creating pagination in applications, finding top-N and bottom-N rows, ranking items, and many other types of sorting related queries.
One of the examples where I find this function very useful is identifying duplicate rows, which is a critical stage of data cleansing. During this process, we want to detect duplicate data which may be irrelevant to our data set, becuase they can cause invalid and messy reports and lead us to wrong calculations or bussines decisions. By using the ROW_NUMBER()
function, duplicates can be easily identifed and removed from the database.
Let me show you how that can be done with the ROW_NUMBER()
function.
We will start with a simple table ( called USER_INFO) that has 5 different columns which contain user info like name and address. The data in the table is not “clean” because there are some duplicate rows containing the same info in each column.
In this example, the 2nd and 3rd row are the same as well as the 5th and 6th row, and our goal is to identify and remove them from the table by using the ROW_NUMBER()
function. Our SQL statement will look like this:
The PARTITION_BY
clause divides the result set into partitions (a group of rows). In the query above we have created one group for every column in the table (name, addr_1, addr_2, addr_3, addr_4). If PARTITION_BY
is not specified, the ROW_NUMBER
function will treat all rows from the query result as one group.
The ORDER_BY
clause is mandatory because the ROW_NUMBER()
is order sensitive, and it will define the logical order within each partition. The ORDER BY
keyword sorts the records in ascending order by default, but we can also sort the records in descending order, by using theDESC
keyword in the end [ORDER_BY userid DESC
]
ROW_NUMBER()
is a window function that returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition. It will assign the value 1 for the first row and increase the number for the next one. The function will assign a row number for each row within the group as shown below:
ROW_NUMBER is a temporary value calculated when the query is run. For easier data manipulation we can add a new column ( let’s call it DUPLICATE) into our table and insert results from the ROW_NUMBER query.
Standard UPDATE doesn’t work for ROW_NUMBER or any other window function. Window functions can only appear in the SELECT or ORDER BY clauses, so the DUPLICATE column cannot be updated and populated with ROW_NUMBER values by using the “simple” UPDATE function in the following way:
Instead, we can use the DUPLICATE column which can be updated as in the example row_number_3.sql below, or by using CTE, which you can see in the example row_number_4.sql.
Once the table/column is updated, we will see the results in the DUPLICATE column. At this point, we can easily identify all duplicate rows because they will have indicator >1, so they can be removed from the table by using a delete statement combined with a where clause: DELETE FROM USER_INFO WHERE DUPLICATE>1
Examples above are using a column ‘DUPLICATE’ where we are assigning a sequential integer number to each row in the result set of a query which allows us to easier manipulate data, but using an extra column ( in our case column ‘DUPLICATE’) is optional and it is used here to show us what is actually happening with the ROW_NUMBER function, how the resultset looks like and to show us a wide range of possibilities to handle typical data challenges. ROW_NUMBER can be also used without adding any additional columns to our query, and can be used in Select, Delete or any other statement as shown in the example below:
Thank you for reading.
References: