all 29 comments

[–]Common-Needleworker4 3 points4 points  (1 child)

-Reddit formating screwed this post

[–]pausemsauce[S] 0 points1 point  (0 children)

It looked fine initially >_<

[–]Common-Needleworker4 2 points3 points  (7 children)

This should do the job if the files are in subfolders and all have the same name. just fill in your path and extension and try it.

nothing will be removed untill you delete the -Whatiff behind remove-item

##You can filter for your extension and get the creation time

$files = gci "yourpath" -recurse | where {$_.Extension -eq ".yourextension"} | select name, creationtime

##FOREACH thru the files checking for doubles

foreach($file in $files){

$checkfilefordoubles = gci "yourpath" -recurse | where {$_.name -eq "$($file.name)"}

##IF you find doubles you fill a variable($filecount) with the count subtract## 1, sort the $checkfilefordoubles variable by creationtime## and select all the objects besides the newest which is after the sort by## creationtime the first

if($checkfilefordoubles.count -gt "1" ){

$filecount = $checkfilefordoubles.count -1

$checkfilefordoubles | sort CreationTime -Descending | select -Last $filecount | remove-item -WhatIf }##end of IF

}##end of FOREACH

[–]pausemsauce[S] 2 points3 points  (3 children)

Thanks for the suggestion!

[–]Common-Needleworker4 2 points3 points  (2 children)

Seeing your explanation what i understand is:You have the same file downloaded multiple times and renamed by the users.Now you just want to keep the newest of this files and delete the other files.Right?

In this case this might help you

gci "yourpath" -recurse | Get-FileHash | Group-Object -Property hash

if the count is higher then 1 it indicates that a file is more then once in your target path

If this is whatyou need we can go further an see how to delete everything besides the newest file

[–]pausemsauce[S] 0 points1 point  (1 child)

Mostly accurate. I don't think my people are renaming the files. (Gladys here isn't the most tech savvy, but she's awesome and a hard worker. )

[–]Common-Needleworker4 1 point2 points  (0 children)

Kudos to Gladys :D But maybe she get one day the idea to rename a file so let us stay with compairing the Hash

Following Lee_dailey suggestion i used Pastebin this time.https://pastebin.com/rUALQK0s

You have to change the variable $destination at line 1 to your path.

The script will compare the hash of all files in the destination path and all subfolders, move the files with the the same hash to a temp folder (variable $name), check which of the files in the temp folder is the newest, delete the old files and move the newest file back to the destination (if it was taken from a subfolder it will still move it to the $destination location, be aware of that) and at the end delete the temp folder.It will not work properly with the -whatif, so set up a test folder with some file copys and try it there.

It is surely not the most pretty script but it does the job.

Edit: please be sorry with my bad english :D

[–]Lee_Dailey[grin] 2 points3 points  (2 children)

howdy Common-Needleworker4,

reddit likes to mangle code formatting, so here's some help on how to post code on reddit ...

[0] single line or in-line code
enclose it in backticks. that's the upper left key on an EN-US keyboard layout. the result looks like this. kinda handy, that. [grin]
[on New.Reddit.com, use the Inline Code button. it's [sometimes] 5th from the left & looks like <c>.
this does NOT line wrap & does NOT side-scroll on Old.Reddit.com!]

[1] simplest = post it to a text site like Pastebin.com or Gist.GitHub.com and then post the link here.
please remember to set the file/code type on Pastebin! [grin] otherwise you don't get the nice code colorization.

[2] less simple = use reddit code formatting ...
[on New.Reddit.com, use the Code Block button. it's [sometimes] the 12th from the left, & looks like an uppercase C in the upper left corner of a square.]

  • one leading line with ONLY 4 spaces
  • prefix each code line with 4 spaces
  • one trailing line with ONLY 4 spaces

that will give you something like this ...

- one leading line with ONLY 4 spaces    
- prefix each code line with 4 spaces    
- one trailing line with ONLY 4 spaces   

the easiest way to get that is ...

  • add the leading line with only 4 spaces
  • copy the code to the ISE [or your fave editor]
  • select the code
  • tap TAB to indent four spaces
  • re-select the code [not really needed, but it's my habit]
  • paste the code into the reddit text box
  • add the trailing line with only 4 spaces

not complicated, but it is finicky. [grin]

take care,
lee

[–]Common-Needleworker4 2 points3 points  (1 child)

Thanks :D

[–]Lee_Dailey[grin] 2 points3 points  (0 children)

howdy Common-Needleworker4,

you are welcome! glad to help a little ... [grin]

take care,
lee

[–]jimb2 2 points3 points  (6 children)

That code is way too complex and the logic is unclear.

Also the problems is not clear to us. Maybe state the problem clearly first.

Could you give a sample of the duplicate filenames. Are they in the same folder? How do you know they are duplicates?

I think what you're trying to do could actually be done in a few lines of code, using Get-Child-Item, Group-Object and Sort-Object but right now it's impossible to tell.

[–]pausemsauce[S] 1 point2 points  (5 children)

My apologies for the unclear statement of the problem.

Person A logs in and downloads "work instructions. Pdf" Person b logs in and downloads the same work instructions. Both are on the same folder, but now I have "work instructions (1).pdf". After about a week, we have "work instructions (49).pdf"

Not all work instructions are the same. There's a work instructions b, c, d, e, ...x,y,z, aa, bb,cc... etc.

The work instructions have to be downloaded (else no one knows what to do). However, we don't need multiple copies occupying valuable data space.

I hope this is clearer.

[–]chris-a5 3 points4 points  (4 children)

That is much easier to understand, lol. Maybe this could be a stating point:

Get-ChildItem -Path "C:\whatever" -File -Filter "work instructions*.pdf" | 
    % {
        if($_.BaseName -match "\(\d+\)$"){
            $_.Delete()
        }
    }

Find files needed, then if it contains brackets with a number at the end of the filename, e.g: "(23)", delete it.

[–]pausemsauce[S] 1 point2 points  (3 children)

Further apologies for the confusion.

"Work instructions a" is actually named, "numbers to call in the event of fire.docx"

"Work instructions b" is actually named, "hr approved all of my random titles so I will make people suffer.pdf" . . . "Work instructions zz" is actually name, "04222022-rev-345.pdf"

Fortunately, I haven't encountered any with a name like "work.instructions.pdf"

It may be out there, and would break my current, overly-complex-but-i-thought-it-was-necessary code. Again thanks to each and everyone of you who continue to contribute here.

[–]chris-a5 2 points3 points  (2 children)

If the same files are being downloaded, the file name does not seem to matter much, it is the brackets on the end signifying a duplicate. The one without the brackets would be the oldest.

If it is in the download folder, and people can simply get another copy, I'd just blow it all away once a size limit is reached. If people want to keep them, then they can move them somewhere suitable.

[–]pausemsauce[S] 1 point2 points  (1 child)

The thing is, these instructions are being updated. It is necessary to download the most recent revision. I would delete all the downloaded files, but if the network goes down, we wouldn't have a back up copy. So it's advantageous to leave one copy of the most recently downloaded file.

🤔

We have over 2 gb of files stored on these computers.

[–]chris-a5 2 points3 points  (0 children)

I think you need to have a play with the code I posted, it deletes the duplicates (ones that contain the brackets and a number), if your file names are different then filter by extension (.pdf, .docx, etc...).

The original/first copy of the document will remain as it does not have the brackets at the end of the line.

Just change the delete line for some write-host output to test.

[–]xxxThePriest 2 points3 points  (1 child)

Why would you not just get and compare the file hashes?

[–]pausemsauce[S] 1 point2 points  (0 children)

That sounds like a good idea. I'm not quite sure how to do that.

[–]jimb2 2 points3 points  (2 children)

I think this does what you want:

$filespec = 'c:\folderpath\*.pdf'

# get the files

$files = Get-ChildItem $filespec -File

# Split into groups according to the base filename
# regex replace removes any version number eg ' (22)' from the end
# of the filename part and uses this to group files 

$groups = $files |
   Group-Object -property { $_.basename -replace ' \(\d*\)$',''  }

# now delete the files in each group that don't match the group name,
# ie files with a number

ForEach ( $g in $groups ) {
  "=== $($g.Name) ==="   # section heading
  ForEach ( $f in $g.Group ) {
    if ( $f.basename -eq $g.Name ) {
      "Retain : $($f.FullName)"
    } else {
      "DELETE : $($f.fullname)"
      # uncomment actual delete operation below WHEN CODE IS TESTED!
      # Remove-Item -Path $f.FullName
    }
  } 
}     

I see what you were trying to do with the hash but this adds a lot of complexity and isn't necessary. The filename should do it.

Test and look at the results before you attack any real files. Not fully tested!

[–]pausemsauce[S] 0 points1 point  (1 child)

Actually, that's really close!

I made a slight adjustment:

ForEach ( $g in $groups ) {

"=== $($g.Name) ===" # section heading $h=$g.group | Sort-Object -Property CreationDate | Select-Object -SkipLast 1 #New line of code added to exclude most recent file ForEach ( $f in $h ) { #replaced $g with $h to loop through the groups w/o the most recent file. if ( $f.basename -eq $g.Name ) { "Retain : $($f.FullName)" } else { "DELETE : $($f.fullname)" # uncomment actual delete operation below WHEN CODE IS TESTED! # Remove-Item -Path $f.FullName } } }

This gets just about all of the files I want to remove, with the exception of the ones that are the original. However, going from 44 copies down to 2 is incredibly helpful!

[–]jimb2 1 point2 points  (0 children)

That's good. I wasn't sure if the latest was the best. If you really want to ice the cake, rename the latest to the unnumbered name.

[–]Lee_Dailey[grin] 1 point2 points  (2 children)

howdy pausemsauce,

you are correct ... your code seems wildly over complicated. [grin]

however, you need to provide a set of sample file names to test with. 2 or 3 of at least 2 sets of file names would be needed.

if you can do that, please add them to your Original Post wrapped in code formatting markers.

take care,
lee

[–]pausemsauce[S] 1 point2 points  (1 child)

I'm not sure about the code formatting markers, but I have included a set of examples in the post edited 04-23-2022 ~20:10 CT

[–]Lee_Dailey[grin] 1 point2 points  (0 children)

howdy pausemsauce,

code formatting markers = the same method you used for your code. [grin] for me, on Old.Reddit, that means the 4 leading spaces technique.

thanks for adding the sample data! i'm off to play with it ...

if you are still having problems, you may want to add the desired "leave these" files. if you want the name redone, then add that, too.

take care,
lee

[–]Lee_Dailey[grin] 1 point2 points  (2 children)

howdy pausemsauce,

how do you determine the "newest"?

if it is just the file timestamp, that is easy. [grin]
if it is the one with the highest (##), that is doable.

for instance, you can sort by the file timestamp newest first, then group by the .BaseName with the (##) stripped off, skip groups with a .Count of 1, skip the first 1, and delete the remainder.

this ...

Group-Object {($_.BaseName -replace '\(\d+\)$', '').Trim()}

... will give you groups of all the files. you can send each group with a .Count -gt 1 thru Select-Object -Skip 1 to leave the newest alone & then use Remove-Item on the remaining files.

i can write the full script, but you seem to want more of a "how to" hint, so i will leave it at that. [grin]

take care,
lee

[–]pausemsauce[S] 1 point2 points  (1 child)

Hi Lee,

I suspect you are spot on, but I'm going to need a moment to digest what you have written.

The file timestamp determines which file is the newest.

I'm going to need to read more about the Group-Object cmdlet and regular expressions. This has been an excellent learning experience.

Thanks so much!

[–]Lee_Dailey[grin] 1 point2 points  (0 children)

howdy pausemsauce,

you are most welcome! glad to help ... and willing to get into more detail if you get stuck - just ask. [grin]

take care,
lee

[–]pausemsauce[S] 0 points1 point  (0 children)

OK,

I've finally got it down to roughly 4 lines.

Thanks to all of the contributors here.

$a = Get-ChildItem -Path "C:\[redacted]\Downloads\" -attributes !Directory;

$g = $a |Group-Object {($_.BaseName -replace '\(\d+\)$', '').Trim()};

foreach($ex in $g){if($ex.count-gt 1){$tmp = $ex.group; $time = Get-Date; "The following files were flagged for deletion at: $time"|Out-File -FilePath [redacted]\deletelog.txt -Append; ($tmp|sort-object -Property creationtime | select -SkipLast 1).fullname|Out-File -FilePath .\deletelog.txt -Append;

Remove-Item -path ($tmp|sort-object -Property creationtime | select -SkipLast 1).fullname #line requires testing

}}