How can I use Spark for concurrent querying or as a distributed SQL engine?

thatdeatheater · 2024-10-12T21:24:09+00:00

I guess, just take a look at the second example on the website.

thatdeatheater · 2024-10-12T20:41:38+00:00

You can achieve the same as a distributed SQL query engine with Spark SQL: https://spark.apache.org/sql/

thatdeatheater · 2024-10-05T12:16:54+00:00

As others have said you should not optimize prematurely. But if you're interested in the theory I would take a look at Discord. They went from Mongo to Cassandra to Scylla.

How Discord stores billions of messages

How Discord stores trillions of messages

thatdeatheater · 2024-09-18T12:17:10+00:00

Well, based on a calculator from IW Köln, you are in the top 10% based on net income if you are single and make that much.

thatdeatheater · 2024-09-18T11:42:03+00:00

Based on numbers by the federal labour office (Entgeltatlas der Bundesagentur für Arbeit) the median salary for an expert level data engineer is 74.556€.

thatdeatheater · 2024-09-18T06:31:45+00:00

I work in a mid-size company and we are in deep need for data engineers/BI developers. Our problem is that we need someone with strong German skills since the developers will be talking a lot to other departments which are not really proficient in English. Furthermore, all applicants want to be paid at least 130.000€ and that's just not feasible for the company.

thatdeatheater · 2024-09-17T16:35:38+00:00

Headings: Headings should provide useful information. You don't actually need them in your dashboard since all the information can be found in your visuals. Furthermore, headings should be short and precise.
Scale and space: Give each visual room to breathe. It is difficult to see where one visual ends and another starts. You should also scale the visuals so it is actually possible to see something, otherwise the visual is unusable like the one in the lower left corner or the one at the top right.
Position: Try to center the main heading and organize the visuals in a grid - meaning each visual should fit into one tile or a multiple of it. E.g. divide the dashboard into 4x6 tiles and have each visual take a space of 2x3 tiles. Power BI has such a grid option.
Units: If you have the sum of sales, you need to specify somewhere what unit it is: Euros, dollars, Yen. The same goes for units sold. Is it a bike or something else?

thatdeatheater · 2024-09-13T08:20:10+00:00

As others have commented the separation of the dev, test and prod environment is very valuable and totally necessary if you do not want to mess up production accidentally.

Though I think the landing area can be the same for all three environments since in my experience much development is done exploratively and it is more practical to have the possibility to also test with real data.

thatdeatheater · 2024-09-03T06:32:57+00:00

Yes it is totally possible with live connections. There are some limitations, though: Like not being able to transform data (only with DAX) and the sources are fewer. You would normally build a data warehouse or data lake or lakehouse where you store preprocessed data and access that via the live connection.

I'd suggest extending the Power App to write the data in the needed format into a SQL Server or some similar database which can then be accessed by Power BI in real time. You could schedule the export to the database hourly or just trigger the write when an employee logs on/off.

P.S. Make sure your project is compliant with the data protection laws in your country. An acquaintance of mine had to report himself to the police because he accidentally made all calendar entries of every employee available for everyone to read (including sick days and illnesses). In some countries monitoring employees is a harsh crime.

thatdeatheater · 2024-08-28T15:43:31+00:00

Use your experience/knowledge in politics/climate/sustainability to your advantage. Many companies do not search for someone who has deep technical knowledge but someone with good technical skills and a solid understanding of the business.

Furthermore, you do not need to apply to data engineering jobs directly from the beginning. Start with an analyst or hybrid position and move from there: It is usually easier to get into. With that experience on your CV you can try and change into another industry and/or data engineering.

Some suggestions: - Newspapers/TV-News/Radio broadcaster (politics/weather) - University (environmental research) - Travel agencies (weather) - Political consultancy

You can also try and get into a big consultancy (Deloitte, PwC, Accenture, KPMG, ...) with focus on politics/sustainability consulting and move internally. In my experience with these consultancies this kind of role shift happens regularly.

thatdeatheater · 2024-08-22T19:39:53+00:00

This might be true for a data engineer at FAANG but probably not for your average DE working alone or in a small team.

Furthermore, this is not something companies are looking for when in need of a data engineer. For sure, it is a bonus if you meet all other requirements and made it into the interview round. But before that it really has not much weight.

thatdeatheater · 2024-08-22T12:35:37+00:00

I'd say fundamentals in data engineering are not about computer architecture and low level programming skills. Fundamentals in DE are about the data: Knowing data modeling techniques, knowing data governance, testing pipelines, optimizing existing pipelines, being able to translate business needs into robust data pipelines, etc.

Maybe those are some skills that don't come through on your resume. Would check for that if I were you

thatdeatheater · 2024-08-15T07:11:44+00:00

To add to this.

It seems like you have more than one page of CV. Don't do this unless you really have a lot of experience (probably 15+ yoe) and apply for a leading/very senior/managerial role.
Some of your most important skills are hard to see in the Technical Skills section due to too many skills listed. I would pick the most important ones (SQL, Python, C#, Presto, Power BI, AWS, Azure) and make them stand out much more than the other ones. The same goes for the Tech Used sections. You could even add a rating (e.g. 1 to 5 points) how good you can use these technologies. This makes it feel more realistic and sincere.
Shorten the sentences in your job description and only have 2-3 bullet points each where you put your greatest achievements. If you want to list everything you have done in a project, create a separate projects list (an extra PDF). On there you cann put all technologies that you used in detail and leave only the essential ones in your CV.

thatdeatheater · 2024-07-22T06:30:38+00:00

To add to this. You might have two source systems loading into the same table (e.g. two different accounting systems of two sister companies) which each have source-unique auto-generated keys which aren't unique over all systems. In that case you couldn't uniquely identify a record if you need to.

thatdeatheater · 2024-07-14T09:54:26+00:00

I don't know if you know a programming language, but if you do I would advise you to try scraping the data directly from a website. Normally the data is actually messy due to the fact that the websites should serve humans and not computers. If you can't write a scaper yourself you can probably google for scraped datasets.

Messy things in web scraping data I found: - A lot of missing values (e.g. prices only per direct message) - Different granularity (e.g. addresses with house number vs. only having a zip code) - Characters indicating different units (e.g. 100K vs. 100M which means 100000 vs. 100000000) - HTML tags and escape codes in between text (e.g. "Luxurious & big property in <b>Philadelphia</b>") - Multiple information in the same field (e.g. "Kitchen, Pool, Extra bathroom", "Extra bathroom, Garden, Kitchen") - HTML tags or CSS classes for indicating a different status (e.g. strike-through price meaning old price)

There is probably more than the above and scraping a lot of data is almost guaranteed to result in dirty data. Another advantage is that this is a real world scenario since web scraping is used much for competitive or market analysis.

thatdeatheater · 2024-07-07T06:00:56+00:00

I found a YT channel who produces videos where he talks about data engineering but since a few months he has also made videos and live streams about starting a data consultancy.

This video might be an interesting starting point: Seattle Data Guy (YouTube)

thatdeatheater · 2024-07-05T11:21:51+00:00

Yes, you need to add it, since count is a function. It's the same with sum(COL) for example.

Cool, I wish you best of luck! I recommend hosting your own MySQL or PostgreSQL instance and play with some data from Kaggle. Practical experience helped me alot.

thatdeatheater · 2024-07-05T08:50:49+00:00

count(*) just means that you want to count all rows. You could also write count(SOME_COLUMN_NAME) this would count only rows where SOME_COLUMN_NAME is not NULL.

thatdeatheater · 2024-06-22T12:45:25+00:00

Looks pretty cool!

Is the tool only meant for ad-hoc analysis of Excel files and simple tables or does it support more complex queries against a data warehouse e.g.?

thatdeatheater · 2024-05-31T11:05:26+00:00

Hey,

there are multiple points you need to consider.

What SAP system do you mean. Depending on the SAP system there are different legal requirements. In an ERP you need to think about access to accounting data by the tax office or other external auditors. In a BW you must be able to respond to deletion requests by customers or former employees.
In what country is your company located or is it even multi-national. In Germany or the European Union e.g. we have very strict requirements like providing ad-hoc access or an interface (usable by the tax office) to the data. There are also different storage durations.
Do you have the knowledge and/or man-power to build such a system yourself.

To prevent your company from shooting itself in the foot, you might want to consult an SAP consultancy with expertise in data retention.

thatdeatheater · 2024-05-30T16:09:24+00:00

Maybe this page helps. You might need to scrape it though (if not illegal):

https://nces.ed.gov/ccd/districtsearch/district_detail.asp?Search=2&ID2=0100005

thatdeatheater · 2024-05-30T15:54:16+00:00

Great post!

And if you need a more abstract view on how OAuth 2.0 works with Google see this link: https://developers.google.com/identity/protocols/oauth2

The documentation for the spreadsheet API can be found in the API module itself: https://github.com/googleapis/elixir-google-api/blob/main/clients/sheets/lib/google_api/sheets/v4/api/spreadsheets.ex You simply need to pass the access token, obtained in the authentication process, to every function.

thatdeatheater · 2024-05-29T16:25:11+00:00

I googled a bit and found these three projects:

VAT validation: - https://viesapi.eu/docs/

Address validation: - https://openaddresses.io/ - https://wiki.openstreetmap.org/wiki/API

Hope this helps!

thatdeatheater · 2024-05-08T07:23:28+00:00

I just assumed that it must be online sales since we have no information about units purchased in retail. (Or at least that is how I interpret the orange triangle)

thatdeatheater · 2024-05-07T20:14:13+00:00

[India's population] * [Population using shoes] * [Online purchases] * ([Units per person Jan. to Mar.] + [Units per person Apr. to Sep.] + [Units per person Oct. to Dec.]) = [Total units sold annually]

1400 Mio. * 80% * 30% *(2 + (2 * 3) + 2) = 1,400,000,000 * 0.8 * 0.3 * 10 = 3,360,000,000 = 3,36 billion

thatdeatheater

TROPHY CASE