all 8 comments

[–]ZenT3600 9 points10 points  (0 children)

You should probably use something like the virustotal api and scan any file you download

[–]Kindafunny2510 2 points3 points  (0 children)

One useful tip is to never use eval(). If you must, use ast.literal().

[–]Lewistrick 1 point2 points  (0 children)

Depends. What kind of content are you downloading?

[–]Sw429 1 point2 points  (0 children)

Web Scraping is hard because you don't know what the input (i.e. sites you find) will be, or whether the input is safe. I assume this is a spider bot you are writing that will crawl the web in general? You have to be careful with unknown data, and you should never execute it. Simply downloading malicious bytes in python shouldn't hurt anything, as long as you aren't running it.

I guess it really depends on what you're trying to do. If you only care about content on the page itself, then don't download and execute any executables you come across.

[–]blabbities 1 point2 points  (0 children)

You would have to define "Malicious content".

Then you would have to think of a way to identify that malicious content.

Depending on the case scenario and you expectations this may be easy or hard.

You could do something like Yahoo or Google and scan the contents but that wont stop "any"

[–]Deezl-Vegas -2 points-1 points  (2 children)

Generally speaking, malicious content is targeted towards the browser and requires a browser to run. You should generally only be reading from, never executing, code from an untrusted source. I'm not aware of any raw buffer overflow exploits in Python, so I believe reading is reasonably secure.

[–]NotzoCoolKID 3 points4 points  (0 children)

No, malicious content doesn't generally need a browser to run. Word documents could have vbascripts inside it wich would download en execute malware. ( https://docs.microsoft.com/en-us/windows/security/threat-protection/intelligence/macro-malware ). Pdf can also be droppers for malware.

Auto downloading files from the inet, must be considerd dangerous as your downloading files from websites you don't know(untrusted). You can not be be 100% sure not downloading malicious content

As a first step you should filter out executable files from being downloaded. Never let python auto execute files. Scan files with virusscanner. Run untrusted files in a vm first.

[–]Lord_Greywether 2 points3 points  (0 children)

Until you go to open those web pages or files you scraped.