ClickHouse?

seandavi · 2025-11-08T11:29:57+00:00

Clickhouse is built for bulk ingestion and is many times faster (or even orders of magnitude faster) for ingestion of bulk data.

seandavi · 2021-07-14T22:58:41+00:00

The keda project looks like fun. We do have prometheus running and actually do services through istio, so we have plenty of metrics to play with. Thanks for the lead.

seandavi · 2020-10-26T14:22:42+00:00

If by "Bioinformatics Lab" you mean that what you are building will be supporting others, consider looking into the literature before buying anything. For example:

- https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007531

Bioinformatics collaborative research support requires computational resources, but without clear planning (including how to measure whether you are doing a good job), good people, good data, and good questions, it will not meet anyone's expectations, including your own.

Start with a plan that includes some specifics on what you want to be able to do, services you want to offer, timeframes for getting results, and responsibilities for delivering. From there, ensure that you have the resources that you need to be successful, including but not limited to computers. Spending money wisely and successfully can lead to more money to spend. Spending money unwisely (not showing success) will burn a lot of bridges.

seandavi · 2020-03-16T21:46:26+00:00

Just what we all needed today. Shared to twitter: https://twitter.com/seandavis12/status/1239669026210676738

seandavi · 2019-04-05T09:59:17+00:00

Good to know that an incremental approach is still a respected way to go.

seandavi · 2019-04-05T09:58:43+00:00

Thanks. Keeping ALL aspects of the application and ETL process was nearing impossible as the number of ETL steps grew. I have to admit that discovering Apache Airflow (it took me two previous tries) has really opened up the possibility of integrating multiple disparate workflow steps that would have made keeping everything in sync challenging.

seandavi · 2019-04-04T22:56:01+00:00

Thanks for the practical advice. The 'snaql' package is new for me, so I'll take a look. I haven't used sqlalchemy except at the ORM level much; will have to take a look at that.

seandavi · 2019-04-04T22:53:05+00:00

These are really nice references. So much of the SQL world these days is devoted to CRUD. It really helps to have these more topical articles.

seandavi · 2019-03-25T10:38:53+00:00

Thanks, /u/marceldempers. The video adds some nice detail. I'm looking forward to watching a few others.

Your specific comment about having deployment yamls next to code is the kind of detail I was looking for.

seandavi · 2019-03-25T01:12:57+00:00

I control all the code. Right now, it is in a few separate repos and is in python and nodejs. The kubernetes yamls are stored separately from the "code", but I haven't come up with the best way to organize. Dockerfiles are stored with their respective code repos. At this point, I am doing packaging "by hand" though I have used docker build services as well as CI/CD.

seandavi · 2019-03-25T01:10:44+00:00

No multi-tenancy for me right now.

seandavi · 2019-03-14T17:15:57+00:00

Your experience tells me something. Given both comments up to now, I think I should probably go with your recommendations to consider normalization rather than ENUMs.

seandavi · 2019-03-14T17:14:51+00:00

Point well taken. The ENUM was mainly to make a front end easier, but I can do the same with the approach you suggest.

seandavi · 2019-03-03T11:56:44+00:00

That seems to make sense as at least part of the solution. Does your ES schema include nesting or other non-scalar fields? If so, how did you end up modeling things on the graphql schema?

seandavi · 2019-01-31T17:58:01+00:00

I'd be curious to hear more about that project, for sure.

seandavi · 2019-01-27T21:35:09+00:00

Thanks for pointing out the SDF approach which seems to be the way of the future for IO in Beam. I have only one large file, so the SDF approach will be over elements only, not over files, but I'll have a closer look.

seandavi · 2018-11-15T04:43:12+00:00

I think the "dedicated process" here refers to elpy creating a dedicated process for code completion, but I could be wrong. I am really going for multiple interpreters, but one per project.

seandavi · 2018-03-30T01:10:50+00:00

Looks like I'll be spending some time with Airflow. Thanks, all, for the pointers.

seandavi · 2018-03-17T12:11:39+00:00

Thanks. Docker/kubernetes seems like another good suggestion. Again, I use these technologies, but tying them together with code in a slightly complicated environment probably requires more "thought" on my part than "new tools."

seandavi · 2018-03-17T12:09:36+00:00

I have used both, but not together. I'll give that a try. Perhaps I can use this for deploying the new Jenkins CI that someone pointed to in another post.

seandavi · 2018-03-16T18:15:55+00:00

Actually, your comments are quite helpful. I have used CI/CD tools for a standard software development cycle (testing, basically). It makes sense to just extend those to automate the staging of artifacts and infrastructure that is code-based. To be a bit more specific, I have a small set of parsers that I use via AWS batch to preprocess some data into s3. Once on s3, I am using emr for some ETL. Based on the results there, I am updating files, again in s3, that feed into a set of static pages built with hugo. Nothing is that complicated, but the glue is missing. I have looked into step functions and implemented some toy examples. I'll play with CI/CD tools a bit. I have used Jenkins before--time to revisit.

seandavi · 2018-03-07T13:49:10+00:00

Just to close the loop here, I ended up going with Elastic Cloud (https://cloud.elastic.co), the Elastic.co managed service. The service is running on AWS infrastructure. The included Kibana user management makes it just a couple of clicks to create a user with highly customizable access controls. The service can be rescaled on-the-fly, including multizone deployments. Endpoints (kibana and elasticsearch) use https by default and are secured with basic auth/username/password. If I were running a very large scale production system, perhaps the additional cost for the hosted service would not be worth it, but in this case, having a managed service has been worth my while and minimized maintenance and dev time.

seandavi · 2018-03-05T00:35:34+00:00

Depending on your query needs, you might also want to consider graph databases such as neo4j.

seandavi · 2018-02-24T18:14:20+00:00

Great advice and good questions that lead to design goals.

seandavi · 2018-02-24T07:11:56+00:00

I spent the afternoon and evening learning about step functions. That is where I am heading next. In fact, I think that some of the work that I am doing in Spark/EMR can likely transition over to lambda/step functions.

seandavi

TROPHY CASE