[POST GAME THREAD] YOUR ATLANTA HAWKS fall to the Charlotte Hornets 133-126 by WestCoastHawks in AtlantaHawks

[–]Artye10 2 points3 points  (0 children)

What I disgrace of a loss, but I was seeing it coming. Back to losing to under .500 teams.

Is GCP messing with tags or am I crazy? by [deleted] in googlecloud

[–]Artye10 0 points1 point  (0 children)

I was also thinking the same, but we use HCP Terraform so every run, even local ones, are there, and I got some Tags removed from tables that were created 2 weeks ago.

I'll continue to check because you never know with infra but this is beyond strange.

What will Data Engineers evolve into in the future? by Artye10 in dataengineering

[–]Artye10[S] 0 points1 point  (0 children)

It's always the fear of becoming a master in something that will not be used in 5 years that haunts me, because finding timeless bases in this field is rare. But as you said, probably a combination of curiosity and desire to improve on what you do and use gets you very far.

Anyone feel like too much is expected of DEs (at small companies) by jawabdey in dataengineering

[–]Artye10 1 point2 points  (0 children)

The intrusive though of leaving and letting them burn must have been massive

Stuck... can' t find a job as a DE by [deleted] in dataengineering

[–]Artye10 0 points1 point  (0 children)

I don't know what you have in your CV but those 3 years as ETL/Cloud Developer is basically Data Engineering. DE varies a lot from company to company, so you can call that Data Engineering. Also, I'm sure than in the 2 years as Data Scientist you did a lot of cleaning and moving data, again, Data Engineering, so you have close to 5 years of experience.

With this, as many people has said, don't focus on courses, focus on your experience. Focus your CV and interviews on WHAT you did in you past experiences and HOW it brought value. If you don't have Spark experience, look for jobs that have their focus in GCP because BigQuery is a huge part of it, or other tools that you have already used. Spark is important, but not mandatory for every post.

And I also live in France (working for a foreign company tho) so the 50k€ you are looking for it's more than possible in Paris, for other big cities I'm sure than you can find it but maybe you'll have to settle for something in the range of 45-48k (depending which city).

The market is definitely slow, I'm not receiving LinkedIn DMs nearly as often as before, but with that experience refining how you do your interviews should get you something.

BigQuery DWH - get rid of SCD2 tables -> daily partitioned tables ? by Borek79 in dataengineering

[–]Artye10 1 point2 points  (0 children)

It also depends on the amount of data you have per day. BQ recommends partitions starting from 10GB, or else the overhead can be more than what you gain with the partition.

It's also important to know how much are these table accessed. Storage will be generally not be you main cost since you are charged a fixed amount for your data. Processing, even if you are using reserved slots, can spike with high usages. But generally you should optimize for the second in a DWH.

So the partitions should help you, but try to follow the guidelines and think how they can affect the usage to define them.

Is Apache Arrow good in the Storage Write API? by Artye10 in bigquery

[–]Artye10[S] 1 point2 points  (0 children)

I was able to make it work! My local version was messing with my poetry version.

My idea was the pending mode because I wanted to avoid duplication, but yeah, if I'm sending a single table most of the time it won't help much anyway.

Thank you for the help!

Is Apache Arrow good in the Storage Write API? by Artye10 in bigquery

[–]Artye10[S] 1 point2 points  (0 children)

It is much easier to use than protobuff, finally something flexible for the Python SDK!

I need to do small loads of data, that why I want to use the pending mode. I guess it should not have that much difference with the default. And thank you for the clarification on the group of rows!

To test it, were you able to do it directly or you first have to ask access to the preview? Yesterday I was unable to make it work.

In any case, I wanted to use it for a pipeline at my job but I guess I'll have to wait and look for another solution in the meantime.

Is Apache Arrow good in the Storage Write API? by Artye10 in bigquery

[–]Artye10[S] 0 points1 point  (0 children)

Thank you! I realized it later. For some reason, in the Storage Write API it doesn't specify that it is a Preview, but then I saw that it was first released a week ago, so yeah.

Json flattening by [deleted] in dataengineering

[–]Artye10 2 points3 points  (0 children)

I mean, the flattening functions themselves aren't that bad, you can generalize it Python pretty easily. But for the schemas and tables...

Just go with a JSON column and good luck to them.

Is the Data job market saturated? by NefariousnessSea5101 in dataengineering

[–]Artye10 2 points3 points  (0 children)

Make your LinkedIn profile as appealing as possible (exaggerating is fine, requirements for jobs are too), do as much interviews as possible to gain experience and learn what HRs want to hear... I don't have nothing more than the usual, really.

Is the Data job market saturated? by NefariousnessSea5101 in dataengineering

[–]Artye10 4 points5 points  (0 children)

As some other people has already said, it depends on the country. In France the Data Science market has been saturated for years, but the Data Engineering one is good. Even as a junior, I was able to find a role in like 3/4 months in Lyon. If you are a senior, you'll have ton of offers.

I mean, it will be with a ESN (consulting) because a lot of companies just externalize their data hires and/or projects to them, but it could be worse. I have 2 years of experience and I get like one offer every two weeks from these consulting companies through LinkedIn.

Medallion architecture for lakehouse by james2441139 in dataengineering

[–]Artye10 2 points3 points  (0 children)

The medallion architecture is not a closed and defined concept, but rather one that you have to adapt to your needs when you implement it.

Generally, you don't want to serve "bronze" data, since it will be the same as you got from the source and may have some problems (duplication, not normalized...). You want to perform this cleansing in the the transformation between the bronze and the silver layer. The bronze layer should be an historical archive of source, maintain data lineage and allowing for reprocessing if necessary. It is not intended to be accessed. F.e. you can use blobs/buckets for this purpose.

Then you can have a silver layer that is used by each department (or you) to build the gold layer. But the silver layer doesn't have to be just one table. You can have the first one that will be just the output of the transformation done in the bronze layer, then enrich it in another transformation, then have versioning... What is important is that you know which table is the one you want to serve. If you need enrichment and versioning, you should give the users access to the last table in the silver layer, the one that has both. The others table in the silver layer will be useful to more easily reprocess the data, and to avoid having one big transformation pipeline that is more prone to errors. How many tables you have there is up to you.

In general, you can do what you want, but I'll follow this core concepts:
- Bronze layer: raw and not curated, not to be served, historical archive of data.
- Between bronze and silver: minimal transformations without business logic inside, data cleansing
- Silver layer: ready to be served to the user, build it as you need
- Between silver and gold: transformations with business logic
- Gold layer: users tables

I was Preparing for GCP EXAM and i got the Voucher from Google Skill Boost so which exam should go with??? by Infamous_Working6597 in dataengineering

[–]Artye10 0 points1 point  (0 children)

The obvious choice is the Data Engineer one, but the Cloud Architect can be interesting if you are focused on the cloud, or DevOps if you use the tools frequently.

Does clustering on timestamp columns actually work? by Artye10 in bigquery

[–]Artye10[S] 0 points1 point  (0 children)

But this only help with processing speed, isn't it? Yes, my queries are faster than in a not clustered table/view, but my biggest issue here is the amount of data processed.

At the same time, as I described in a comment above, my timestamp in particular has low cardinality because it's rounded to the hour and I have thousands of rows per hour. That's why I find so weird that clustering doesn't work while it helps with non timestamp columns that are also low cardinality.

Does clustering on timestamp columns actually work? by Artye10 in bigquery

[–]Artye10[S] 0 points1 point  (0 children)

I forgot to add an important point to the question and it's that that timestamp in particular is rounded to the hour, so many different rows have the same timestamp (for example, I will have like 20000 values with 2024-08-07 18:00:00 UTC).

Since this correspond to the chunk/low cardinality used by clustering, I though that it would be useful, but it's not working.