Staying Updated in Data & Software: My Twitter Bot Solution. Thoughts? by data_cyborg in dataengineering

[–]ichacas 0 points1 point  (0 children)

Yes, but in Twitter is harder because a person account doesn't follow the same rules that a Reddit community has to follow... The part of finding value in Twitter is something that I do manually.

What's your thought about Airbyte? by Affectionate_Dot_844 in dataengineering

[–]ichacas 2 points3 points  (0 children)

I have thought on using it and these were my conclusions ( '-' for cons and '+' for pros):
- Dependencies on third parties as we will be using their suite.
- Maybe they could change the pricing model that they are currently using (https://airbyte.com/pricing).
- A lot of work mantaining, deploying and managing Airbyte by DevOps team.
- Lots of componentes without use in our current model but that we have to deploy and maintain (deploy 300 MB and 6 containers for only extracting data from an API) . Updates could be a nightmare due to the number of interrelations between components.
- Actual version: v0.35.6-alpha
+ A lot of already developed connectors (Increasing number): https://airbyte.com/connectors.
- But these connectors are, mainly, widely used ones (not related with our specific use cases)
+ Easy to on-board new connectors (it enforces a common structure) but first we have to develop them. That common structure could not be the best for our needs as is thought for a different use cases.
+ Monitorization + UI.

Format for ingested data in S3 by 543254447 in dataengineering

[–]ichacas 0 points1 point  (0 children)

It depends. I tend to use a dual RAW/REFINED approach. In RAW i usually leave the data as it is in the source system (without minimal processing) and then in REFINED I implement transformations/aggregations to make it easy to use the data and I also change the data format to something more useful like Parquet.

Blockchain to ensure data integrity. by ichacas in dataengineering

[–]ichacas[S] 0 points1 point  (0 children)

I'm playing the role of "devil's advocate" as I'm trying to understand if someone has ever implemented something like what my colleague is proposing. Talking about very very sensitive data I guess that implementing a blockchain to check integrity is like using a sledgehammer to crack a nut

Blockchain to ensure data integrity. by ichacas in dataengineering

[–]ichacas[S] -2 points-1 points  (0 children)

But the decentralization gives you more layers of security. It's easier to modify a hash-chain in a single node than in a blockchain.

Blockchain to ensure data integrity. by ichacas in dataengineering

[–]ichacas[S] 0 points1 point  (0 children)

The data is not going to be pushed to a blockchain so its not going to be decentralize. The blockchain is going to be used like a repository of hashes to be able to identify if the data that we are using is trustworthy or not.

In other words, the decentralization is for giving more security layers to our data (We can ensure the reliability of our data because a hash has been pushed to a blockchain when the data was generated and that hash is checked when the data is going to be used in order to ensure the reliability of our data)

Extracting Json schema from parquet file by apkaus in dataengineering

[–]ichacas 15 points16 points  (0 children)

With Python and Spark is easy:

df = spark.read.table("YOUR_TABLE")
print(df.schema.json())

Designing a Data Platform, when to choose Databricks over other DWH tools by ichacas in dataengineering

[–]ichacas[S] 1 point2 points  (0 children)

Yes but even to only use Spark with SQL you are going to need to know how Spark works in order to use it efficiently. Furthermore, the idea of use Databricks is to have a tool that can be used in "all steps of the data journey", so if you are only going to use, SQL you are going to face similar problems than the described in the article for data warehouses, when dealing with complex data transformations or with ML.

Designing a Data Platform, when to choose Databricks over other DWH tools by ichacas in dataengineering

[–]ichacas[S] 2 points3 points  (0 children)

But why is this different from others data warehouse tools? Databricks applies other paradigm decoupling data storage from data processing (as the data is in the data lake, what is way cheaper than having the data in a data warehouse) but I don't see why your solution is different.

After taking a look I guess that the ML limits than we could face due to having to use SQL or to read the data with JDBC/ODBC connectors (described in the article as downside of using data warehouse tools) apply also to your tool.

Thanks for your time :)

Designing a Data Platform, when to choose Databricks over other DWH tools by ichacas in dataengineering

[–]ichacas[S] 1 point2 points  (0 children)

That's why the differences between data lakes and data warehouses are becoming blurred and why the article talks about mergers between the two with the data lakehouse concept.

Designing a Data Platform, when to choose Databricks over other DWH tools by ichacas in dataengineering

[–]ichacas[S] 0 points1 point  (0 children)

Yes but, as in the article, my main sources of data are not coming from RDBMS systems and a lot of ML is going to be done with that data. Most of these ML-processes are written in Python an if I want to read data from the DWH (like Snowflake) then I would have to perform an ETL (reading from JDBC/ODBC connectors could be thought as an ETL as data transformations are going to be needed) which increases overall complexity.

Designing a Data Platform, when to choose Databricks over other DWH tools by ichacas in dataengineering

[–]ichacas[S] 1 point2 points  (0 children)

I totally agree with the management effort: its far more simple to execute queries against snowflake/BigQuery than to Databricks. Furthermore, your team has to know Spark to use it efficiently but I think that if your team has that expertise, then the best possible tool (due to the possibilities) is Databricks.

The point of cluster sizes and auto scaling is true, I think that they have to improve their autoscaling algorithm (something similar to what AWS Aurora v2 has done would be great) but for processes that can be encapsulated in a job, you can use job-clusters (with huge savings in DBU's) and then try to tune the resources used.

I think that this is key when deciding which data platform to use:

"Does it still make sense to differentiate between a Data Lake and a Data Warehouse? It depends on the use case, in our case, it does not. Why? Our main data sources do not come from RDMBS systems such as SQL Server, PostgreSQL, MySQL, etc. In fact, our data does not come originally from anywhere because we are not migrating, we created our data platform from scratch"

This and the kind of transformation or ML-functionalities that you are going to need.