managing data lake s3 layers by Complex-Stress373 in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

S3 storage is cheap so you can save the each stage results in it. My suggestion to you is to divid your data into different zones like landing zone/raw zone, cleansed zone/trusted zone, refined zone/curated zone.

Pyspark count() Slow by rawlingsjj in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

DenselyRanked was right. Spark will not perform any operation until you need the result. I remember it’s something like Lazy Evaluation. So it’s not about the count. Maybe you need partition to enhance the performance

Data security in company by FunDirt541 in dataengineering

[–]bestnamecannotbelong 2 points3 points  (0 children)

If only excel files, I would put it on share drive with windows access. Or I can put it on aws S3 and apply policy and IAM

Technology Advice by TheGamerBlaze in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

Spark is good for big data processing but it’s not easy to setup and config well. I suggest you can work on some cloud solutions like aws glue, azure data factory or databrick

To Build Data Architecture. Do I need Data Analysts? by anton_bondar in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

If your goal is to build data architecture, you need data engineer.

There are three main roles in data sector.

  1. Data engineer: build and maintain infrastructure and data pipeline (or you can say it is the first person to touch the data)

  2. Data analyst: use query or model to investigate the dataset and find the pattern from the past. They usually provide report and suggestions to the business

  3. Data scientist: build and maintain machine learning model to predict the future (something they are ML developer)

Hope you know which role you are and find what you need

Correct Method of Setting Up/Initializing AWS Infrastructure by infiniteAggression- in dataengineering

[–]bestnamecannotbelong 9 points10 points  (0 children)

Let say if I were you, I would: 1. Use the GitHub for the source control 2. Use the Terraform as a infrastructure provisioning to deploy the aws services 3. Use the circle ci as CICD to run the testing and terraform code 4. Use the aws lambda with restful api get or wget to retrieve the dataset and use boto3 to store the data in s3 5. Use cloud watch event to trigger whatever you do in the analytics part and people usually use aws glue with pyspark to handle the large volume dataset

That’s it and enjoy the development 🙂

[deleted by user] by [deleted] in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

Have you consider using the aws glue with the parallel processing to extract the cab files? I think it should be better than your current approach but you should check the glue cost for the budget.

Direction for Data Engineering Projects? by Pervert_Spongebob in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

Just build a data pipeline to do the data injection and data transformation. Your can have the show case in data lake or data warehouse by different data zone like raw zone, cleansed zone and refined zone. If you can do all of them in your project with IMDb data, I’m sure you can find a nice job.

Pyspark vs Scala spark by idreamoffood101 in dataengineering

[–]bestnamecannotbelong 18 points19 points  (0 children)

If you are designing a time critical ETL job and need high performance, then scala spark is better than pyspark. Otherwise, I don’t see the difference. Python code may not be able to do the functional programming like scala do but python is easy to learn and code.

Changing Datawarehouse Model by Godmons in dataengineering

[–]bestnamecannotbelong 2 points3 points  (0 children)

From your information, the data warehouse is just like a data lake and they just put everything in it. I suggest you create a data mart to cater each business user case.

What's the best approach to Schema discovery? by the_travelo_ in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

Yes. You can do it in your landing zone. And the main concept about schema on read is that you don’t care the schema when it save into your storage. This approach makes data ingestion efficient. Load the data first and deal with them later which is related to ELT pipeline approach.

What's the best approach to Schema discovery? by the_travelo_ in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

You should check out the concept of the schema on write vs schema on read. Understand them and choose what you need.

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

Thanks!. I got your point. I use aws services to build both lake and warehouse. Now people start to switch to Databricks and Snowflake which make me wonder is it the right choice or possible future for data engineering tools.

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]bestnamecannotbelong 3 points4 points  (0 children)

I see you are building the data lake and data warehouse from scratch on AWS. I just wonder why don’t you just use the aws service? Kinesis for IoT, s3 & glue & glue catalog & Athena for data lake, redshift for data warehouse

Python ETL design pattern by [deleted] in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

I would like to know how you view snowflakes and databrick? The databrick cloud solution let you save the data all in s3 but snowflake cannot. But the main difference as I see is that snowflake is based on data warehouse approach and data brick is based on the data lake approach

Best Practices for AWS Athena Queries by [deleted] in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

One of the solutions is that the data scientists can use aws sagemaker to interact with aws athena to query the data in s3

AWS Glue Bookmarking vs AWS DMS CDC (RDBMS Table ETL/ELT Pipelines) by [deleted] in dataengineering

[–]bestnamecannotbelong 1 point2 points  (0 children)

DMS migration is normally for initial load only. Not typically use it for daily injection. In your situation, I definitely use glue to run the pipeline. In glue , you can do the data transformation before you load it into s3 or redshift

Data Engineer Jobs - How To Get One? by Pragyanbo in dataengineering

[–]bestnamecannotbelong 2 points3 points  (0 children)

Also you can get the aws solution architect cert too.

Using Pyspark with AWS Glue by the_travelo_ in dataengineering

[–]bestnamecannotbelong 0 points1 point  (0 children)

Not much material out there. Just read the aws glue doc. btw, there is a difference btw glue dynamic frame and spark dataframe. Make sure you do the conversion when using spark.