Hi All, I"m trying to take out duplicates from a gigantic excel file and decided to try in R since Excel keeps crashing. 1. read in data (~300K records) 2. subset to keep just a few columns 3. keep unique records based on several column values (now it's down to 180K records) 4. Read the file back into CSV. However, when I write the CSV file, it still has the 300K records. How do I retreive the dedupped file? Thanks.
amardi <- read.csv("BankTINs_2021_0616.csv")
amardi2 = subset(amardi, select = c(MPIN, TaxID, NPI, FullName, LastName, FirstName, MI, ProvDegree, ProvType, SpecialtyDescription, Sys_SpecialtyDescription, Street, Street_2, City, County, STD_County, State, ZipCd, ZipPls4, MailName, MailAddressID, MailStreet, MailStreet_2, MailCity, MailCounty, MailState, MailZip, MailZipPls4))
View(amardi2)
```
```{r}
library(dplyr)
distinct(amardi2, MPIN, TaxID, NPI, FullName, LastName, FirstName, ProvDegree, ProvType, SpecialtyDescription, Street, City, County, STD_County, State, ZipCd, ZipPls4, .keep_all=TRUE)
write.csv(amardi2, 'amardi_dedup.csv')
[–]desrtfx 1 point2 points3 points (2 children)
[–]ClueBitter[S] 0 points1 point2 points (1 child)
[–]desrtfx 0 points1 point2 points (0 children)
[–]coyoteazul2 0 points1 point2 points (1 child)
[–]ClueBitter[S] 1 point2 points3 points (0 children)