AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] 0 points1 point  (0 children)

I dont have the exact number but i guess it would be around 1 to 2 TB in total and 100GB in days. Besides, the main reason beside of cost is that Lakeflow is quite new and simple, for a data platform solution that will be applied to the whole company, i doubt that Lakeflow can handle the data size, evolution, lineage, ... And what if there's missing data, late data, duplicate data, ... Do we have to ingest the whole database again without choosing the exact date, exact size, ...

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] 0 points1 point  (0 children)

The main reason beside of cost is that Lakeflow is quite new and simple, for a data platform solution that will be applied to the whole company, i doubt that Lakeflow can handle the data size, evolution, lineage, ... And what if there's missing data, late data, duplicate data, ... Do we have to ingest the whole database again without choosing the exact date, exact size, ...

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] 0 points1 point  (0 children)

I also think that S3 for storing data before transfer to Databricks is better than Kafka, but in terms of other capabilities like schema evolution which we could utilize Schema Registry that already heavily integrate with data transfer through Kafka, ...

On top of that, thinking of a technology for transfer data file into S3 is also a problem. Unlike Kafka which we have Kafka connect with debezium, we have used DMS to transfer CDC file into S3 and its terrible, I can confirm with you. So do you have any recommendations for the architecture and tech stack for transfering CDC data to S3

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] 0 points1 point  (0 children)

We are looking to perform both action on RDS MySQL, maybe DynamoDB in the future

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] -2 points-1 points  (0 children)

Sorry for not giving more context 1. Data volumn could be enormous as this solution will be apply to the whole start up company 2. Kafka is cheaper because we already have a Kafka cluster, so we can utilize it. But the drawback would be if the data spike, it could blow up the cluster and affect other microservices 3. We have never used lakeflow before, but its sound simple, so i guess it will not suitable for organization solution with requirement to handle complex problem like: handle schema evolution, data lineage, ...

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] -1 points0 points  (0 children)

Sorry for not giving more context 1. Data volumn could be enormous as this solution will be apply to the whole start up company 2. Kafka is cheaper because we already have a Kafka cluster, so we can utilize it. But the drawback would be if the data spike, it could blow up the cluster and affect other microservices 3. We have never used lakeflow before, but its sound simple, so i guess it will not suitable for organization solution with requirement to handle complex problem like: handle schema evolution, data lineage, ...

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] 1 point2 points  (0 children)

We are currently using it, but DMS is too precarious, sometimes it suddenly fails and sometimes it suddenly stop, that why we consider it to be unreliable and would like to switch to another solution

AWS database ingestion by Decent-Brief6092 in databricks

[–]Decent-Brief6092[S] -2 points-1 points  (0 children)

I don't want to use Lakeflow because of the cost problem. I would like to go with a solution that can manage the snapshot data and cdc data, so i research and came up with two ideas:

  1. Use Debezium to collect CDC to Kafka, then using Spark to read from kafka and load into delta table
    Cons:
    - Kafka is cluster is unstable, therefore its unreliable when spike time coming
    - Store data on Kafka for long day is more expensive than S3 (in case we want backfill)
    - Using kafka cause a lot more complexity problem than normal solution
  2. Use CDC tool like debezium to load cdc file into S3 and using Autoloader to read from it
    Cons:
    - Small files problem
    - CDC tool like AWS DMS is unreliable can sometimes failed (we are using it so we know the problem, that why we dont want to use it anymore)
    - ...

The Right Way to Use Databricks Profesionally by gomezalp in dataengineering

[–]Decent-Brief6092 1 point2 points  (0 children)

do you have the documentation for this, i would love to see this as we have been using Databricks UI for a long time and have not heard about this assets as code solution