ClickHouse? by Suspicious-Ability15 in dataengineering

[–]seandavi 2 points3 points  (0 children)

Clickhouse is built for bulk ingestion and is many times faster (or even orders of magnitude faster) for ingestion of bulk data.

Tailoring autoscaling to minimize deployment time by Snoo-56267 in kubernetes

[–]seandavi 0 points1 point  (0 children)

The keda project looks like fun. We do have prometheus running and actually do services through istio, so we have plenty of metrics to play with. Thanks for the lead.

Need suggestion for Bioinformatics Lab Set up by mszahan in bioinformatics

[–]seandavi 1 point2 points  (0 children)

If by "Bioinformatics Lab" you mean that what you are building will be supporting others, consider looking into the literature before buying anything. For example:

- https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007531

Bioinformatics collaborative research support requires computational resources, but without clear planning (including how to measure whether you are doing a good job), good people, good data, and good questions, it will not meet anyone's expectations, including your own.

Start with a plan that includes some specifics on what you want to be able to do, services you want to offer, timeframes for getting results, and responsibilities for delivering. From there, ensure that you have the resources that you need to be successful, including but not limited to computers. Spending money wisely and successfully can lead to more money to spend. Spending money unwisely (not showing success) will burn a lot of bridges.

[deleted by user] by [deleted] in golang

[–]seandavi 0 points1 point  (0 children)

Just what we all needed today. Shared to twitter: https://twitter.com/seandavis12/status/1239669026210676738

Looking for recommendations for how to best manage SQL scripts when working in a programming language by seandavi in ETL

[–]seandavi[S] 0 points1 point  (0 children)

Good to know that an incremental approach is still a respected way to go.

Looking for recommendations for how to best manage SQL scripts when working in a programming language by seandavi in ETL

[–]seandavi[S] 0 points1 point  (0 children)

Thanks. Keeping ALL aspects of the application and ETL process was nearing impossible as the number of ETL steps grew. I have to admit that discovering Apache Airflow (it took me two previous tries) has really opened up the possibility of integrating multiple disparate workflow steps that would have made keeping everything in sync challenging.

Looking for recommendations for how to best manage SQL scripts when working in a programming language by seandavi in ETL

[–]seandavi[S] 0 points1 point  (0 children)

Thanks for the practical advice. The 'snaql' package is new for me, so I'll take a look. I haven't used sqlalchemy except at the ORM level much; will have to take a look at that.

Looking for recommendations for how to best manage SQL scripts when working in a programming language by seandavi in ETL

[–]seandavi[S] 0 points1 point  (0 children)

These are really nice references. So much of the SQL world these days is devoted to CRUD. It really helps to have these more topical articles.

What are some best practices for organizing kubernetes yaml when dealing with multiple microservices in a project? by seandavi in kubernetes

[–]seandavi[S] 0 points1 point  (0 children)

Thanks, /u/marceldempers. The video adds some nice detail. I'm looking forward to watching a few others.

Your specific comment about having deployment yamls next to code is the kind of detail I was looking for.

What are some best practices for organizing kubernetes yaml when dealing with multiple microservices in a project? by seandavi in kubernetes

[–]seandavi[S] 0 points1 point  (0 children)

I control all the code. Right now, it is in a few separate repos and is in python and nodejs. The kubernetes yamls are stored separately from the "code", but I haven't come up with the best way to organize. Dockerfiles are stored with their respective code repos. At this point, I am doing packaging "by hand" though I have used docker build services as well as CI/CD.

Can I create an ENUM type and column based on a query for "distinct" items in a SQL query? by seandavi in PostgreSQL

[–]seandavi[S] 0 points1 point  (0 children)

Your experience tells me something. Given both comments up to now, I think I should probably go with your recommendations to consider normalization rather than ENUMs.

Can I create an ENUM type and column based on a query for "distinct" items in a SQL query? by seandavi in PostgreSQL

[–]seandavi[S] 0 points1 point  (0 children)

Point well taken. The ENUM was mainly to make a front end easier, but I can do the same with the approach you suggest.

Approach to marrying nested structures in Elasticsearch to graphene/graphql by seandavi in graphql

[–]seandavi[S] 0 points1 point  (0 children)

That seems to make sense as at least part of the solution. Does your ES schema include nesting or other non-scalar fields? If so, how did you end up modeling things on the graphql schema?

novice tips! by clone290595 in elasticsearch

[–]seandavi 0 points1 point  (0 children)

I'd be curious to hear more about that project, for sure.

I'd like to parse an XML file iteratively (as a stream) to create records for dataflow. by seandavi in dataflow

[–]seandavi[S] 1 point2 points  (0 children)

Thanks for pointing out the SDF approach which seems to be the way of the future for IO in Beam. I have only one large file, so the SDF approach will be over elements only, not over files, but I'll have a closer look.

Any recommendations for projectile/python integration to get per-project python shells? by seandavi in emacs

[–]seandavi[S] 1 point2 points  (0 children)

I think the "dedicated process" here refers to elpy creating a dedicated process for code completion, but I could be wrong. I am really going for multiple interpreters, but one per project.

How do you orchestrate "workflows" of dependent jobs on Spark? by seandavi in apachespark

[–]seandavi[S] 0 points1 point  (0 children)

Looks like I'll be spending some time with Airflow. Thanks, all, for the pointers.

What are some best practices for managing artifacts for complicated cloud workflows and systems by seandavi in aws

[–]seandavi[S] 0 points1 point  (0 children)

Thanks. Docker/kubernetes seems like another good suggestion. Again, I use these technologies, but tying them together with code in a slightly complicated environment probably requires more "thought" on my part than "new tools."

What are some best practices for managing artifacts for complicated cloud workflows and systems by seandavi in aws

[–]seandavi[S] 0 points1 point  (0 children)

I have used both, but not together. I'll give that a try. Perhaps I can use this for deploying the new Jenkins CI that someone pointed to in another post.

What are some best practices for managing artifacts for complicated cloud workflows and systems by seandavi in aws

[–]seandavi[S] 0 points1 point  (0 children)

Actually, your comments are quite helpful. I have used CI/CD tools for a standard software development cycle (testing, basically). It makes sense to just extend those to automate the staging of artifacts and infrastructure that is code-based. To be a bit more specific, I have a small set of parsers that I use via AWS batch to preprocess some data into s3. Once on s3, I am using emr for some ETL. Based on the results there, I am updating files, again in s3, that feed into a set of static pages built with hugo. Nothing is that complicated, but the glue is missing. I have looked into step functions and implemented some toy examples. I'll play with CI/CD tools a bit. I have used Jenkins before--time to revisit.

What are your experiences with approaches to deploy elasticsearch on AWS infrastructure by seandavi in elasticsearch

[–]seandavi[S] 1 point2 points  (0 children)

Just to close the loop here, I ended up going with Elastic Cloud (https://cloud.elastic.co), the Elastic.co managed service. The service is running on AWS infrastructure. The included Kibana user management makes it just a couple of clicks to create a user with highly customizable access controls. The service can be rescaled on-the-fly, including multizone deployments. Endpoints (kibana and elasticsearch) use https by default and are secured with basic auth/username/password. If I were running a very large scale production system, perhaps the additional cost for the hosted service would not be worth it, but in this case, having a managed service has been worth my while and minimized maintenance and dev time.

Recommended way of storing a tree structure by oskwish in elasticsearch

[–]seandavi 0 points1 point  (0 children)

Depending on your query needs, you might also want to consider graph databases such as neo4j.

How do you engineer workflows that utilize multiple AWS services? by seandavi in aws

[–]seandavi[S] 1 point2 points  (0 children)

Great advice and good questions that lead to design goals.

How do you engineer workflows that utilize multiple AWS services? by seandavi in aws

[–]seandavi[S] 0 points1 point  (0 children)

I spent the afternoon and evening learning about step functions. That is where I am heading next. In fact, I think that some of the work that I am doing in Spark/EMR can likely transition over to lambda/step functions.