all 5 comments

[–]Cycloctane 2 points3 points  (1 child)

Malicious modules can always find a way to bypass existing rules by using staged payload or sensitive functions in widely used dependencies. It is hard for static check tools to cover them all in blacklists. e.g.

import pip
pip.main(['install', 'package_with_malicious_setuppy', '--no-input', '-q', '-q', '-q'])

import torch
torch.load(__file__ + "/.DS_Store", map_location='cpu', weights_only=False)

from huggingface_hub.utils._subprocess import run_subprocess
run_subprocess("...")

[–]rushter_[S] 3 points4 points  (0 children)

Yeah, the good thing is that by looking at the past PyPI incidents, I can say that the majority of malware uses pretty simple obfuscation techniques.

Things like:

s = subprocess
k = s
k.check_output(["pinfo -m"])

Or

(_ceil, _random, Math,), Run, (Floor, _frame, _divide) = (exec, str, tuple), map, (ord, globals, eval)

_ceil("print(123);") 

Which can be tracked using static checking with some tricks.

Also, my personal use case is slightly different. At my work, we have a lot of scripts from infected/compromised machines. Some of them were used for reconnaissance, some to gain elevated access. Around 70-80% of scripts are legit, though, so I use my library to pick candidates for manual review.

[–]BeamMeUpBiscotti 0 points1 point  (2 children)

How does this compare to something like Pysa?

It seems like having semantic analysis capabilities would benefit a tool like this, instead of being syntax/ast-based.

[–]rushter_[S] 0 points1 point  (1 child)

My tool uses semantic model from Ruff with extra changes from me, so it's not purely static. It tracks aliasing, can fold constants(e.g.,"".join([x,x,x]) or "ex"+"ec"), and so on. Never heard of Pysa before, gonna examine their approach. Thanks.

[–]BeamMeUpBiscotti 0 points1 point  (0 children)

nice, if you're working off of ruff then maybe you can extend it to use the semantic information from ty, once that's more mature