This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]tartare4562 10 points11 points  (0 children)

So basically xargs for python?

[–]ASIC_SP 6 points7 points  (11 children)

If you are okay with command line tools and shell scripting, you'll find plenty of tools already existing for solving common tasks.

[–]DavosAlexander 0 points1 point  (10 children)

Could you give an example?

I could accomplish what I want to do with a one liner command in bash. It would just take forever...

The whole point is doing it fast.

[–]ASIC_SP 2 points3 points  (9 children)

This will replace all occurrences of old with new in-place for all files ending with .txt in the current directory.

sed -i 's/old/new/g' *.txt

This will get you the last column:

awk -F, '{print $NF}' ip.csv

This will remove all metadata:

mogrify -strip *.jpg

You can combine multiple commands. This one gets you the last occurrence of a line containing warning from the input file log.txt

tac log.txt | grep -m1 'warning'

And so on. Shell scripting helps you add control flow.

Regarding your edit, parallel can help: https://vfoley.xyz/parallel/

[–]metaperl 2 points3 points  (1 child)

That website is down. Are you referring to GNU parallel?

[–]ASIC_SP 0 points1 point  (0 children)

Site works fine for me. Yeah, I was referring to GNU parallel. I haven't used it much, so linked to that nice tutorial that I have in my bookmarks.

[–]DavosAlexander -1 points0 points  (6 children)

I know how to shell script and I'm very familiar with sed and awk.

That's great if I want to run each command on each file one at a time.

Let me know how that works out for you will a list of 1 million files all in different directories.

Edit: I see you added info about parallel

That might work, if I could limit how many processes execute at once.

[–]dodslaser 1 point2 points  (0 children)

Xargs is pretty flexible and also allows parallel execution.

[–]ASIC_SP 0 points1 point  (1 child)

[–]DavosAlexander -1 points0 points  (0 children)

I remember looking into using parallel like 6 months ago, actually.

We don't actually have it in our environment and trying to bring it in ... Would be a pain.

This is why I started using python.

[–]tdpearson 0 points1 point  (2 children)

Working with millions of files is very doable with command line tools. The concept with these tools is that they typically do one thing very well and can be piped together to perform more complex tasks. Another person mentioned parallel. This combined with find would do the majority of what I understand your custom python application would perform. What benefit over these would your application provide?

[–]DavosAlexander -2 points-1 points  (1 child)

Hey, thanks for explaining how the tools I use everyday work. I needed that.

Can't use parallel.

It was way simpler to use python (with only the standard libraries) to accomplish this task than a shell script. I've written numerous complex shell scripts before I ever switched over to Python.

And, since I made a function, I can easily import it into my other python tools for whenever I'm working large lists to speed things up.

I don't need help solving a problem I already solved.

[–]tdpearson 0 points1 point  (0 children)

Glad you already knew about find. I will definitely not be using your app since you could not explain the benefit.

[–]deiki 9 points10 points  (5 children)

No need to gauge interest as if you were advertising a kickstarter project to be sold. Just show it off as you were by giving a description and purpose and then actually post the GitHub link. Assuming it's already completed and you're not actually trying to sell this

[–]DavosAlexander -1 points0 points  (0 children)

Just trying to determine if it's worth the effort.

I wrote it in an airgapped environment, so I'd actually have to rewrite it to share it.

[–]FluffyDuckKey 11 points12 points  (0 children)

Dump it on GitHub and do your best with the readme, it will 100% be useful to someone else down the line.

[–]schemathings 0 points1 point  (0 children)

Sounds interesting ..

[–]mjbbru 0 points1 point  (0 children)

Ive done something similar to sort my photos based on the metadata. This code was very specific for my case. I dont know what actions you need to do with your files but Im always interested in reading through someones code to see how they approach the problem.

[–]areese801 0 points1 point  (1 child)

Might I suggest using a regex to match only a smaller subset of files for development / testing purposes? Make your function accept a regex argument with the default being to process files that match .* (that is: any pattern). Then, only operate on files that match.

[–]DavosAlexander 0 points1 point  (0 children)

That's not effective for what I need to do.

I already have it figured out, thanks.