How do I remove duplicate rows?

yasumiya · July 24, 2016, 8:51pm

I’m working on a data that has many duplicate rows. How can I remove those duplicate rows?

Hideaki_Hayashi · July 24, 2016, 10:27pm

Hi,
One way to remove duplicate rows would be “distinct” function.
distinct(COLUMN_NAME) gives unique set of the column value, removing the duplicate.
But by default, distinct function also drops other columns than the one you specified, but I would guess you just want to remove duplicate rows and keep columns.
In that case, distinct(COLUMN_NAME, .keep_all=TRUE) would keep the column values too. (The column values kept will be the the ones from the row that appears the first in the original data frame.)
Here, I created an example with “iris” data frame. (The rows with same “Species” there are not really duplicate rows, but I think this shows the idea.)
https://exploratory.io/data/hideaki/de097202a844

yasumiya · July 24, 2016, 11:23pm

Thank you for the help. That worked great!

Rob_Sherlock · September 11, 2023, 9:34pm

Thanks, Hideaki, that is very helpful! But what about if the rows aren’t exact duplicates? I have merged data from two dataframes and have duplicated some of the data. I want to get rid of rows with duplicate timecode, even though in some cases other data in the rows are not exact duplicates. I tried grouping by timecode then running the ‘distinct’ command but that did not work. Ideas?

Hide_Kojima1 · September 21, 2023, 2:18pm

Hi Rob,

You can try the “Keep Only Unique Rows (Remove Duplicated Rows)”

Click the “Select Columns” radio button and select the “timecode” column.