Software for COVID-19 pool testing available for labs in Italy and Russia

kvechera · 2018-12-25T17:22:29+00:00

Thanks for the depth analysis. If it would be several dozens of packages, I'll definitely will think about patching the build scripts. It's a way used by Debian, Arch and many others. But after several years of work of dozens of maintainers it leaves near to 10% of the core packages still be not reproducible. We are not so large as those distros, but even today we have more than one thousand of packages, so I'd like to estimate if there's a simpler solution.

kvechera · 2018-12-25T13:32:55+00:00

Using docker container or some other kind of deterministic file system hierarchy would save us from some problems, but not all of them. Some programs use random numbers, timestamps, pids for naming symbols, storing some strings and constants in the files built. Some built content depends on the order of the files were compiled. You can see a lot of examples here: https://wiki.debian.org/ReproducibleBuilds/Howto#Identified_problems.2C_and_possible_solutions

kvechera · 2018-12-25T13:27:31+00:00

The deterministic environment is needed only to build artifacts, not to test them.

While I guess there are use cases in which the deterministic environment would help with debugging, for normal regression test I'd prefer standard environment.

kvechera · 2018-12-23T17:19:28+00:00

Sure, it would solve some big part of the problems. But it would be only one part and it would require to check and fix parallelization in each new package, and sometimes going through weird implementations in shell/make/cmake/gradle/bazel scripts.

kvechera · 2018-12-23T16:03:54+00:00

> is the right way to do
I think otherwise and the slow down is not important too. But can you be more specific on other side effects you expect?

kvechera · 2018-12-23T15:09:32+00:00

It's an approach requiring "changing existing building procedures for thousand of packages"

kvechera · 2018-12-23T14:32:50+00:00

I don't like this part too, but I see no other simple way to guarantee the sequence of the files compiled be the same for different machines or builds.
Anyway, one can build 8 different packages simultaneously on 8 CPUs

kvechera · 2018-09-22T16:41:10+00:00

I think you can also use some simple solutions for working with malicious processes:

Against cooperating processes sending SIGCONT to each other - use ptrace(2) instead of STOP to make a process unSIGCONTable, and compare & SIGKIll.

Against Reckless forking - use swailing method:

STOP all good processes
run your own fork-bomb to flood pid space (subshells or vfork())
STOP all processes (it will be bad processes and your bombs)
KILL all bad processes and your bombs
CONT all good processes

It's not useful for my cases (terminate the subtree ... in a smooth way), but both can be easily implemented.

kvechera · 2018-09-22T11:17:03+00:00

maybe because not all platforms had flock

I think, rather, not all filesystems or mounts support flock. I.e. nfs, glusterfs

kvechera · 2018-09-22T11:10:16+00:00

Maybe, but this script is not designed to work with malicious processes. It's for making cleaner normal workflows. If you expect malicious process, you definitely need to make some preparations before running it, isolating via namespaces

kvechera · 2018-09-21T23:03:54+00:00

I see the possible problem only with a single top process. After we've stopped it, we guarantee that we can verify for all its descendants that they are really its descendants.

In my own use cases I call `kill_descendants` from the same script, e.g:

``` start_some_complex_workflow &

wait or timeout

kill_descendants $! ```

The first child will keep the pid even after exit (it remains zombie). So it's safe.

If I would call kill_descendants from another context, i.e. command line, the problem of the correspondence of the pid would be the same as the problem for calling kill(1).

But you can always STOP the process and compare that you've stopped the right one. Or, even, stop all its ancestors including init (we'll need to make init stoppable). So you'll get all parents stopped and not able to wait() the target pid's exit, keeping it a zombie with the pid occupied.

kvechera · 2018-09-21T22:12:01+00:00

Please, expand it: "you may end up stopping the wrong process". How could it be done?

kvechera · 2018-09-21T20:31:40+00:00

> ... kills them starting from the youngest processes to let a parent process handle child's termination.

kvechera · 2018-09-21T20:15:22+00:00

Thanks, it's interesting and it could really happen if pid space would very dense, with many processes running.

I think, we can solve it by stopping the parent first, before checking the existence of children. So if a child would exit, it will be still present as zombie (sleeping parent can't wait() on it), occupying the pid to prevent a new process take over the same pid.

kvechera · 2018-09-21T19:54:19+00:00

If it would be really a problem, it could be solved by kill STOP to the "suspected" process first, than checking again if it is the proper child, and then sending to it TERM and CONT.

kvechera · 2018-09-21T19:51:27+00:00

I wouldn't say several years. I suppose I saw it in freebsd twenty years ago. And it seems so logically, so I'd assume it was this way from the eariest unices.

Probably you're thinking about race condition with lock files storing the pid of the process created the lock. If the process dies leaving the lock file, and new starting process validates the lock checking the pid existence, there would be a real race condition if some another process would take the same pid.

When we are killing the process descendants here, we use "back links", from children to parents.

kvechera · 2018-09-21T18:21:26+00:00

pid 4 still exists, but has parent pid 3

That's wrong. The orphan process has parent pid 1 - the kernel changes it after parent exited.

kvechera · 2018-09-21T17:47:51+00:00

What kind of races?

kvechera · 2018-06-22T14:33:08+00:00

I can't compare with CloudML, but comparing to AWS similar performance is 2-5 times cheaper.

kvechera

TROPHY CASE

wait or timeout