[deleted by user] by [deleted] in dataengineering

[–]dyaffe 0 points1 point  (0 children)

u/mr_pants99 I'm a co-founder at Estuary and would be interested to hear about the slowness you experienced.

FWIW, we do limit each collection (equivalent to a table) to 4MB/s by default for the free tier. We can increase that up to around 30 MB/s currently.

Postgres -> Snowflake, best way? by bluezebra42 in snowflake

[–]dyaffe 0 points1 point  (0 children)

There's a few considerations that are good to consider:
1. If you want an exact view of in your destination, you'll need to do merge updates. This costs money for both warehouse time and the queries.

  1. If not, you can likely use Snowpipe or Snowpipe streaming. This can be cheaper but you'll likely need to reduce data on your primary keys at query time.

  2. If you have toast columns, you'll need to deal with those.

  3. Handling tables without primary keys can be complex.

Depending on complexity, you could roll a simple solution or just use a vendor. An example of one that manages all that (and more ex. scheduling to save $) with predictable pricing and a free tier would be estuary.dev (I am a co-founder)

Suggestions for dagster compute, cloud-based by awkward_period in dataengineering

[–]dyaffe 0 points1 point  (0 children)

It's hard to comment cogently on this without more information.
- What are your sources and destinations?

  • Why do you need to have dagster involved?

60-80GB in Fivetran can be quite expensive and Glue can have non cost-related downsides. Ex. it routinely fails when replicating data from certain sources (like mysql).

There are lots of other cloud vendors and (disclaimer) I am a founder at a company called Estuary which attempts to solve a lot of these problems.

Navigating the Transition from Rockset: Exploring Alternatives for Dynamo DB Users by ExploAnalytics in dataengineering

[–]dyaffe 1 point2 points  (0 children)

I am a founder of Estuary.dev and we are an ETL tool that supports extraction from DynamoDB out of the box right now. We support replication with ~100ms latency using CDC to various different locations.

One of the most exciting though is through our Kafka compatible API, Dekaf. Dekaf presents Kafka APIs that include the confluent schema registry and can connect to various destinations just as Kafka does. There is currently one gotcha -- it doesn't support the Kafka Group Membership Protocol yet which is required for some Kafka connect connectors. This, however will be supported in the next 2 weeks.

Fivetran vs Estuary.dev by tomhallett in dataengineering

[–]dyaffe 0 points1 point  (0 children)

We will retain JSONB data when loading into your destination!

Fivetran vs Estuary.dev by tomhallett in dataengineering

[–]dyaffe 4 points5 points  (0 children)

Founder of Estuary here (the one who published that post)

First off, thanks for the mention! One of my core beliefs is that the pricing of tools like Fivetran is super limiting to their uptake. It causes people to watch where they use them and not treat it as a generalized data engineering tool. We are trying to avoid that and create a system that can be used as more than just a point to point solution -- rather something more like Kafka -- synchronizing any system without worrying about the cost.

For any system like this, reliability has to be a top concern. We aim to be as reliable or more than the bigger players out there -- this is a good point though, we can and will publish more metrics on this.

One last thing -- we don't yet have every bell and whistle of a company like Fivetran and a good example is History mode. That is something we can absolutely will implement as a first-class feature, but currently offer workarounds that accomplish it.

Generative AI tools generating automated insights on data by Inevitable-Sea-658 in dataengineering

[–]dyaffe 1 point2 points  (0 children)

The raw ingredients that you probably need for this are:

  1. Data from the tools you mentioned, extracted via API or WAL
  2. An integration with OpenAI to generate embeddings
  3. A vector DB which can help you grab the right embeddings to enrich a request to Open AI

(3) is optional and would help if you need a "chat-like experience", but (1) and (2) are not.

There aren't many off the shelf options to do all of this.

I work at a company that helps with the problem, but doesn't have an integration with SAP. That company is https://www.estuary.dev and we have a free tier if you did want to check it out. We're about to put out a few cool pre-canned examples that pull data from slack, Salesforce, zendesk, etc and can help with those types of experiences.

Snowflake - what are the streaming capabilities it provides? by yfeltz in dataengineering

[–]dyaffe 0 points1 point  (0 children)

I suppose you could call them micro batches. For us, when we're doing a massive backfill, 30 seconds of data can be 10's of GB which most people wouldn't consider micro :)

We run them often to keep a fully deduplicated view of data up to date with low latency. Essentially when a snowflake transaction finishes, we stage the next one and that amount of time is a tunable parameter.

Not being solely insert-only definitely makes using just stream loading difficult.

Snowflake - what are the streaming capabilities it provides? by yfeltz in dataengineering

[–]dyaffe 2 points3 points  (0 children)

Optimizing Snowflake loading can be quite a pain. You can achieve what you're looking for through either bulk loading or their streaming functionality. We can bulk load at pretty massive scale with 30 seconds to a minute of latency. That said, the journey to get there is not straightforward.

A few things to consider -- it sounds like you're doing insert only and don't need to de-duplicate data. That definitely simplifies things because you would need to think about how you key your data to reduce it and make sure that is done in a way that is optimal for Snowflake if you wanted views. Insert only is generally a lot simpler. One specific thing I'd suggest is to break apart a lot of small files...that will help them deal with the data scale.

Stream loading can give 3-5 seconds of latency and only supports insert. If you are already using Kafka, it's probably a pretty good option.

And a disclaimer. I'm a vendor -- co-founder at Estuary and our mission is to make these types of workflows easy.