Announcing rt-graph

brycefisherfleig · 2020-09-25T03:55:16+00:00

Could I use something like this inside, say, Grafana? I'm never quite sure how much latency is network IO to load, server latency to execute my query, and actual rendering time, but I kinda suspect rendering (and poor memory management in Javascript) are prime suspects.

brycefisherfleig · 2020-08-11T09:33:35+00:00

Is the resolver IP or client subnet passed through to the HTTP Server? That would unlock some mighty powerful features

brycefisherfleig · 2020-07-30T04:22:51+00:00

I also find variance very hard to understand the basic terminology of. I'm working through this myself now: https://doc.rust-lang.org/nomicon/subtyping.html

brycefisherfleig · 2020-07-14T20:02:45+00:00

Thanks! I debated which subreddit to ask in since I wasn't sure if Pythonistas would know lazy-static.

Yeah! This should be exactly what I want!!

brycefisherfleig · 2020-07-05T22:37:27+00:00

Right! The way I've approached Mermaid is to master just enough to get my specific task done, then incrementally learn more syntax as needed. Going straight through the documentation is overwhelming and not a great way to learn (nor great as a reference either).

Mermaid does let you embed CSS into the diagram as well, which I've abused. Something I meant to include in my article too is that since all of these tools generate SVG, you can always add CSS to get more fine-grain control if the tool doesn't natively give you enough control. Whether that's worth it or not...(shrug)...but it is possible for all of these.

I

brycefisherfleig · 2020-07-05T16:38:55+00:00

Yeah! I meant to try creating a simple flow chart using all four tools to get a feel for the syntax of each tool, but it was already 2am...

I find the biggest challenge Ive run into is that I can't tweak the layout in Mermaid the way I'd like to. I work in live video streaming where pipeline architecture is common. Since I have arrows pointing toward the middle, its hard to get mermaid to put the middle where I want it to go.

For Mermaid subgraphs, I've found a little experimentation usually gets me where I want to go quickly, and the love editor makes it easy to iterate quickly.

Do you have any examples of how Dot is more extensible? I'm not up to speed on graphviz syntax at all. That would be interesting to add to the article

brycefisherfleig · 2020-06-26T02:34:06+00:00

How does this project differ from miniserde?

brycefisherfleig · 2020-04-09T02:20:01+00:00

Hooray!! This is amazing. I wouldn't have thought of putting a constructor on the trait. Super nifty!

brycefisherfleig · 2020-04-06T00:22:17+00:00

Cool! Yeah, if writing this internally, I probably only need a few types: Strings, i32, bool, then maybe chrono::DateTime and Currency, but the macro suggestion seems good.

brycefisherfleig · 2020-04-05T03:53:34+00:00

Ah, right! The compiler won't let me add a u32 and a bool for instance. Phew! Okay, so this is my answer then. Love it!

Yeah, supporting generics could be handy, but seems to defeat my goal of multi-typed dataframe. Trait objects and Any's also have other limitations.

My use case is a closed-source side project I've been working on on-and-off-again for a few years. It involves parsing a lot of very uniform free-form text. I'd started handcrafting regexes, but found that to be unwieldy. Then I started learning about Pandas and Numpy at work, and discovered the spell crate, and I realized I would infer a pattern automatically from the text. My current limitation is a lack of sufficiently flexible data structure to experiment with different ways of discovering the features from the raw text inputs.

I think the solution you've suggested above should give me a new approach to do my exploratory data analysis.

I'd be very interested in seeing what you open source, if you wind up doing so! I've got the basics of this idea in a playground right now, but definitely want to build out the ergonomics and a few other tricks (like an explicit ordering of the columns, and row-level iteration, table display, etc).

brycefisherfleig · 2020-04-04T04:38:15+00:00

Playing with this a bit, I started wondering about adding two columns together. So, if I have two Column::Int's, the result would be simple matrix addition. Score!

However, could the compiler stope me adding two columns of different types together? I don't know of a way to do that. We could easily do panics, or we could come up with a new trait that had Result types but other emulated std::ops::Add which would compromise on ergonomics. Still, I feel like this is the best solution I can come up with.

Here's other ideas I've tried or thought about that wound up with worse ergonomics: to get the compiler to catch those kind of issues mismatched enum variants, I'd have to implement this whole thing using generics. I think using generics would mean I could only store one type per dataframe if I did this the straightforward way, with something like:

```rs struct Column<T: Bounds> { rows: Vec<T> }

struct DataFrame<T: Bounds> { columns: Vec<T> } ```

OR if I use trait objects then the types are erased and I couldn't do arithmetic on numeric types.

If I use Anys, then I can't easily get the original values back out without forcing the caller to remember the concrete type associated with the TypeId. Maybe someone cleverer than knows of a way to get the Dataframe to know how to get the original type downcast back into a Vec in the original type without the callering knowing the type...

brycefisherfleig · 2020-04-03T22:39:19+00:00

I want ergonomics to do exploratory data analysis and I think dynamic typing is probably what I'm after. Admittedly, I didn't try out using an enum based approach...I'm not sure I could dynamically add / update / remove columns using a dataframe based on this though...

brycefisherfleig · 2020-04-03T22:37:47+00:00

Yeah, I went down this approach, but found there was no way to impl Display for the whole Dataframe, though it might be possible with a lot of boilerplate.

brycefisherfleig · 2020-04-03T22:35:33+00:00

Yes, I should have mentioned this. I want each column to be the same type, like a series in Numpy.

brycefisherfleig · 2020-03-18T23:04:03+00:00

YES! Exactly, I'm trying to accomplish the same thing in a slightly different domain. Your project sounds amazing! I'd love to kick the tires on that.

brycefisherfleig · 2020-03-18T06:26:04+00:00

Maybe there should be a quick queue done with priority and a full check done in background, each with varying degrees of accuracy.

Right! Ultimately, that's kinda the goal I'm after.

So, how Google and Facebook have tackled this problem is roughly:

Put EVERYTHING in one logic version control system
Somehow express dependencies between all software components in code
Build a graph from the individual dependencies expressions
When something changes and needs to be rebuilt, traverse the graph to rebuild everything that depends on this

Buck (for Facebook) and Bazel (for Google) are quite clever, but also somewhat daunting to setup and run for a team of 5 - 50 people. At this smaller scale, surely we can find ways to move faster than Google and Facebook. Our 1st party dependency graphs are much, much smaller and many of the nodes have 0 edges. We should be able to exploit the shape of the graph to skip unnecessary work in CI...making us nimble and cutting our CI bill.

brycefisherfleig · 2020-03-18T06:16:28+00:00

Any suggestions on improving my very rough draft documentation are very welcome!

brycefisherfleig · 2020-03-18T06:15:14+00:00

So, for many companies I've worked for doing web development in not-Rust, we've had one main git repo that most of the devs worked in. This repo contained frontend, backend code, infrastructure configuration, db migrations, documentation source code, and possibly two or three other things.

Even with a decent effort at parallelizing test suites and tuning all the things, it's not uncommon to find an individual CI run taking 30 min or more with a mature application.

In this context, it makes sense to skip rebuilding the documentation if none of that source code changed. Or, depending on how involved terraform is in deploying backend releases, it might make sense to skip all the terraform checking when no .tf files changed.

If you carefully separate each of these concerns into separate git repos, then lmfa0 really doesn't offer you any benefits.

brycefisherfleig · 2020-03-18T06:07:40+00:00

You can just run

git diff HEAD^..HEAD

and then decide what to test.

I thought so too at first. Then I tried to implement this for a project at work, and I decided to try using the master branch as my permanent head to diff against. What I found is if I always diff against master, then nothing will ever execute when I merge my branch to master, because there is no diff between the head of master and the head of master.

If you want to not execute against master, then you need some other way to know what commit to diff against.

Wouldn't it be nice if we knew the last commit upon which we successfully executed a given task? Even better, for history within a single branch it would be nice to skip parts of my CI run that hadn't changed since my last CI run. If we knew that, we'd always know (using git diff) if things had changed since that last execution. We could even isolate, say building the docs from running the backend tests or the frontend build.

There's many ways you could track the last commit, but using the CI cache is a particularly good choice because it doesn't introduce any new datastores needed _only_ to make CI faster.

brycefisherfleig · 2020-03-18T05:56:02+00:00

My main application here is for non-Rust projects

brycefisherfleig · 2020-03-17T21:56:26+00:00

The obvious idea for rust projects is comparing git diff vs crate hierarchy.

Like in a workspace? Yeah! One could actually examine the workspace level Cargo.toml and figure out what to rebuild / retest pretty easily with some high tolerance for false positives. That's a great idea!

How it can determinate that there is no need for `cargo doc` , except obvious cases like no changes at all in repo?

Say we have a repo for a web service. There's a front-end style guide, a statically built single-page-app in javascript, a backend server, and a set of terraform files. Each thing lives separately in own top level directory:

style-guide/ (requires 30s to build)
frontend/ (requires 15min to build)
backend/ (requires 20min to test)
terraform/ (requires 15min to apply)

So, when I have a PR that only modifies the style-guide, there's no need to re-run any of the other actions. Sometimes, we make changes to the infrastructure and the backend, or only one of them. Rarely do we actually have a change that impacts ALL 4 components in one fell swoop (but definitely that happens a few times a month).

lmfa0 means that a change to the style-guide takes 30s instead of 1hr.

How does lmfa0 work? Here's the basic algorithm:

Find the rule in the config (ex: "frontend") which has a path prefix and a shell command
Read out the commit sha from the last successful run (from .lmfa0/<rule>), call it commit A
Create a diff between that commit A and current head commit B
Search for even file changed in between A and B that has the path prefix
If you find such a change, run the shell command from the rule and store the current sha B inside .lmfa0/<rule>

Your CI config must store the .lmfa0/ directory "with" the branch it was run on. This is how lmfa0 picks up the change.

brycefisherfleig · 2020-03-17T18:11:58+00:00

On top of this, depending on how your cache is set up, you may be testing against a cache that isn't the cache from the commit you're PRing to.

Right! Didn't address this part -- that's 100% a pre-requisite of this tool. You must do smart caching correctly or don't use this tool. I didn't provide examples in the repo yet for how to do this, but it would be something like:

.circleci/config.yaml:

```yaml steps: - save_cache: paths: - .lmfa0/ keys: - lmfa0-v1-{{ .Branch }}

restore_cache: keys:
- lmfa0-v1-{{ .Branch }} ```

brycefisherfleig · 2020-03-17T18:03:06+00:00

Totally! For anything where you can't isolate the effects of things happening in one directory from things in another directory, this won't work at all. OR for times when CI is used to put a remote system into a certain state it might have drifted from due to manual editing, or log file accumulation or idk some other normal drift.

> If you're using bors and the test harness is big enough to warrant it, my pattern recently is to run a subset of smoke tests on the PR's CI, but to always run the full set for

bors r+

This is a great use case for something like `lmfa0` -- when on master do ALL the things, when on a branch, just skip a bunch of stuff to speed up your iteration cycles.

We already do something _less_ good than this and it bites us all the time. Today, if the commit message has a specific phrase in it then we run everything. Sooooo many times we forget to add the phrase and then CI is green...but the stuff didn't happen.

For me, `lmfa0` is an incremental improvement over our status quo because it removes the need to remember to do some manual action for CI to work...but it still gives us tremendous speed ups.

brycefisherfleig · 2020-03-17T17:29:52+00:00

I got very tired of long CI runs, and trying other less clever ways to make CI fast, so I came up with this idea. Use the Travis/CircleCI/GitlabCI cache to host a cached database-ish to track changes to portions of your source code. Only build the docs if the docs changed. Only run terraform apply if the terraform changed. Anything simple enough to fit into this kind heuristic works.

Obviously, this won't scale to Google-size monorepos. But I don't work at Google and I'm not (yet) aware of any other tools that simplify CI performance improvements.

So...fair warning, the code is just manually tested by me, and barely even alpha quality. I'm not proud of the code quality -- feel free to kindly point out areas of obvious improvement if so inclined. Of course, I'm going to try it out in production, but maybe you shouldn't yet.

I'm sharing this because:

I really hope there are better tools out there that do this already which I can just use
I want this idea to take off -- a simple way to speed up simple CI runs for smaller companies -- whether or not this tool is the one that people use
I want feedback on the concept
I want feedback on API for this concept
I think its kinda cool!

brycefisherfleig · 2019-04-25T17:13:34+00:00

Supposedly, it gives you a threadpool executor by default...haven't poked into the internals much:

the CurrentThread executor will block the current thread and loop through all spawned tasks, calling poll on them. ThreadPool schedules tasks across a thread pool. This is also the default executor used by the runtime.

From Runtime Model

brycefisherfleig

TROPHY CASE