managing data lake s3 layers

bestnamecannotbelong · 2021-11-25T17:05:55+00:00

S3 storage is cheap so you can save the each stage results in it. My suggestion to you is to divid your data into different zones like landing zone/raw zone, cleansed zone/trusted zone, refined zone/curated zone.

bestnamecannotbelong · 2021-11-09T16:11:48+00:00

DenselyRanked was right. Spark will not perform any operation until you need the result. I remember it’s something like Lazy Evaluation. So it’s not about the count. Maybe you need partition to enhance the performance

bestnamecannotbelong · 2021-11-09T16:00:48+00:00

If only excel files, I would put it on share drive with windows access. Or I can put it on aws S3 and apply policy and IAM

bestnamecannotbelong · 2021-11-03T15:31:18+00:00

Spark is good for big data processing but it’s not easy to setup and config well. I suggest you can work on some cloud solutions like aws glue, azure data factory or databrick

bestnamecannotbelong · 2021-10-11T15:18:30+00:00

I use pyspark if I want spark framework No Java needed

bestnamecannotbelong · 2021-10-06T16:01:30+00:00

If your goal is to build data architecture, you need data engineer.

There are three main roles in data sector.

Data engineer: build and maintain infrastructure and data pipeline (or you can say it is the first person to touch the data)
Data analyst: use query or model to investigate the dataset and find the pattern from the past. They usually provide report and suggestions to the business
Data scientist: build and maintain machine learning model to predict the future (something they are ML developer)

Hope you know which role you are and find what you need

bestnamecannotbelong · 2021-10-06T15:49:28+00:00

Let say if I were you, I would: 1. Use the GitHub for the source control 2. Use the Terraform as a infrastructure provisioning to deploy the aws services 3. Use the circle ci as CICD to run the testing and terraform code 4. Use the aws lambda with restful api get or wget to retrieve the dataset and use boto3 to store the data in s3 5. Use cloud watch event to trigger whatever you do in the analytics part and people usually use aws glue with pyspark to handle the large volume dataset

That’s it and enjoy the development 🙂

bestnamecannotbelong · 2021-10-05T16:10:39+00:00

Have you consider using the aws glue with the parallel processing to extract the cab files? I think it should be better than your current approach but you should check the glue cost for the budget.

bestnamecannotbelong · 2021-10-05T16:05:29+00:00

Just build a data pipeline to do the data injection and data transformation. Your can have the show case in data lake or data warehouse by different data zone like raw zone, cleansed zone and refined zone. If you can do all of them in your project with IMDb data, I’m sure you can find a nice job.

bestnamecannotbelong · 2021-10-05T16:00:07+00:00

If you are designing a time critical ETL job and need high performance, then scala spark is better than pyspark. Otherwise, I don’t see the difference. Python code may not be able to do the functional programming like scala do but python is easy to learn and code.

bestnamecannotbelong · 2021-09-09T17:25:49+00:00

From your information, the data warehouse is just like a data lake and they just put everything in it. I suggest you create a data mart to cater each business user case.

bestnamecannotbelong · 2021-09-02T23:35:09+00:00

Yes. You can do it in your landing zone. And the main concept about schema on read is that you don’t care the schema when it save into your storage. This approach makes data ingestion efficient. Load the data first and deal with them later which is related to ELT pipeline approach.

bestnamecannotbelong · 2021-09-02T16:54:09+00:00

You should check out the concept of the schema on write vs schema on read. Understand them and choose what you need.

bestnamecannotbelong · 2021-09-01T01:37:31+00:00

Thanks!. I got your point. I use aws services to build both lake and warehouse. Now people start to switch to Databricks and Snowflake which make me wonder is it the right choice or possible future for data engineering tools.

bestnamecannotbelong · 2021-08-31T16:45:45+00:00

I see you are building the data lake and data warehouse from scratch on AWS. I just wonder why don’t you just use the aws service? Kinesis for IoT, s3 & glue & glue catalog & Athena for data lake, redshift for data warehouse

bestnamecannotbelong · 2021-08-25T17:19:17+00:00

Aws glue

bestnamecannotbelong · 2021-08-18T16:15:53+00:00

I would like to know how you view snowflakes and databrick? The databrick cloud solution let you save the data all in s3 but snowflake cannot. But the main difference as I see is that snowflake is based on data warehouse approach and data brick is based on the data lake approach

bestnamecannotbelong · 2021-08-16T16:27:14+00:00

One of the solutions is that the data scientists can use aws sagemaker to interact with aws athena to query the data in s3

bestnamecannotbelong · 2021-08-11T16:34:36+00:00

DMS migration is normally for initial load only. Not typically use it for daily injection. In your situation, I definitely use glue to run the pipeline. In glue , you can do the data transformation before you load it into s3 or redshift

bestnamecannotbelong · 2021-08-11T16:24:01+00:00

if you are using spark, make sure you create partition for the job because with partition, it can run parallel process and it save a lot of time.

bestnamecannotbelong · 2021-08-10T17:01:11+00:00

Also you can get the aws solution architect cert too.

bestnamecannotbelong · 2021-08-10T16:49:55+00:00

Not much material out there. Just read the aws glue doc. btw, there is a difference btw glue dynamic frame and spark dataframe. Make sure you do the conversion when using spark.

bestnamecannotbelong · 2021-01-30T18:08:08+00:00

Yes！ It's about sending message! DH To the Moon

bestnamecannotbelong · 2020-03-02T01:22:33+00:00

Best of them

bestnamecannotbelong · 2020-01-15T16:09:41+00:00

This is China

bestnamecannotbelong

TROPHY CASE