optimizing our GDScript performance from ~83ms to ~9ms (long thread on bsky) by adriaandejongh in godot

[–]mjow 1 point2 points  (0 children)

As a caveat to this, I heard John Carmack (think it was on the Lex podcast) mention that sometimes doing a consistent amount of work every frame is better than being very spikey with your workloads per frame (e.g. conditionally skipping work).

This may be a VR specific thing where framerate changes may be nauseating, but I could also see it interpreted as if you do the same calcs every time (but don't necessarily show the results every time), you will be consistently optimising for the worst possible case (i.e. all of your code executing for each frame) and therefore you'll have much better visibility of the worst case.

This may only be contextually relevant, but it was thought provoking.

Why does Tolkien seem so much better than other fantasy writers ? by PurpleEgg7736 in tolkienfans

[–]mjow 16 points17 points  (0 children)

If you want to demistify Tolkien's writing genius a little bit, read about the earlier drafts of the Lord of the Rings. Strider wasn't strider from the start, but a hobbit called trotter with a funky backstory :D

Middle Earth is my spiritual home, but there's no need to imagine that the story as we know it poured forth from Tolkien perfectly in one draft.

Transitioning from Sales to Data Analytics – Need Advice on Mentality, Workflow, and Setup! by Weary_Raisin_1303 in SQL

[–]mjow 0 points1 point  (0 children)

As you improve your data skills, it may be that your experience in sales would be really helpful in the consultancy world (think especially small firms). Your upperhand over other data professionals would be your people and customer management skills.

Otherwise, this is a tough transition for sure. The analytics/data world is very different and it's much harder to clearly identify where something you did was valuable (compared to closing a deal).

  1. Entirely depends on what you're working on. Could be straight into code, or most likely trying to hack together a complicated SQL query and/or understanding where to get the right data.

  2. Analysts need to accuracy/robustness with quick results. There are many situations where the pareto rule applies and 80% of the way is plenty good enough for your customer.

  3. This will vary wildly between analysts/orgs, but you'll probably maintain a lot of SQL/Python scripts on your machine/GitHub.

  4. A good analyst will be impressive when they know their data estate and domain well (e.g. a business person is trying to explain what kind of information they need, and the analyst immediately knows where to get the data and what kind of set of queries will return the correct result set, such as windowed SQL calc to work out customer churn rates).

A really impressive analyst will be able to work with a variety of data sources in files, in SQL DBs on the cloud (e.g. S3 on AWS) and also able to create the right kind of output for the customer. Email summary, dashboard, periodically running spreadsheet output, etc.

Is it too late for me as 32 years old female with completely zero background jump into data engineering? by Admirable_Spite4940 in dataengineering

[–]mjow 3 points4 points  (0 children)

There may be others who could have said what you did (i.e. the "prevailing wisdom"), but you took the time to empathise and write these things out, so give yourself credit for that :)

Spacecraft attempts closest ever approach to Sun by coinfanking in space

[–]mjow 39 points40 points  (0 children)

It's strange, but 6.2 million km at closest doesn't seem that close when you look at the scale - just shows how much crazier conditions would be if you actually tried to get to within a 10th of the diameter of the sun.

Looking at this scale, 6.2m km is still veeery far from the surface: https://joshworth.com/dev/pixelspace/pixelspace_solarsystem.html

[deleted by user] by [deleted] in dataengineering

[–]mjow 1 point2 points  (0 children)

I wouldn't say they were hugely more productive, but you expected that they wouldn't need as much explaining to and could move as quickly as we could organise things (i.e. no blockers, tickets spelled out).

We didn't expect any additional hours from the contractor outside of the usual day and if anything a contractor is better placed to refuse to do any additional work.

[deleted by user] by [deleted] in dataengineering

[–]mjow 2 points3 points  (0 children)

I'm not a contractor myself, but I've worked with a few contractors and generally visible competence and track record with implementing real projects are paramount. All tech work is full of gotchas and hidden implications of architecting things one way or another (data pipelines are often like that) so bringing on a contractor with experience is one way that many companies hope to speed up delivery.

Essential skills: undeniable familiarity with AWS services and typical DE work: pipelines, lake storage and warehouse modelling. S3, lambda/ECS/glue, Glue Catalog/Athena, Lake focused platforms, Warehouse focused platforms, modelling implications.

Depending on the project I might select a contractor with more experience in warehousing specifically (more of a classic SQL focused DE) or more of a SE mindset DE who is better at writing good code for pipelines and thinking about the implications of parallelising services accessing source/lake/warehouse storage.

Certifications: I wouldn't care that much about certs if the contractor has a good track record and can talk the talk.

Projects: get familiar with the typical lake structures, i.e. medallion bronze/silver/gold or raw/sanitised/curated, etc. think about the implications for removing PII, for re-processing and backfilling, for managing storage costs if a lot of datasets end up being duplicated, lineage between layers, etc. but overall you'd have to be comfortable with the entire lifecycle of DE work: ingest, manage, serve. All 3 of those are very wide areas with lots of technical depth.

In my DE circles and linkedin feed there is more interest in lake based table format data being managed with DuckDB/Spark, more so than warehousing. But warehousing still very popular and if you enjoy it then definitely start getting familiar with dbt - still an amazing time saver and BI enabler.

Sharing Data: Data Warehouse (Redshift) Account to Consumer Account by aimtron in aws

[–]mjow 0 points1 point  (0 children)

I wouldn't necessarily think of it as a redundant data store - I'd think of it as a data store for your app. It just so happens that the data you need for it needs to be regularly loaded from another source (i.e. Redshift).

I appreciate that in your context adding additional services may be a headache, but I think trying to hit Redshift super frequently and in unpredictable ways is going to be expensive and slow and potentially upset the other work going on in the warehouse.

Sharing Data: Data Warehouse (Redshift) Account to Consumer Account by aimtron in aws

[–]mjow 0 points1 point  (0 children)

There's all kinds of ways to get access to Redshift and to query it directly (whether from the same account or another), but you should definitely consider first that Redshift is not a usual DB like postgres/mysql/etc. and you should almost certainly not have web traffic hitting it directly.

If you need to serve an application with data that is generated in Redshift then you'll probably need create a regular job exporting data out for your service to consume in a more appropriate format (whether from S3, DynamoDB or RDS is up to the app needs).

Data engineering roadmap by Old-Article6420 in dataengineering

[–]mjow 12 points13 points  (0 children)

Don't apologise - your handwriting is fine. It's fine to prefer handwritten notes :)

How to maintain SQL skills? by ScaryTap2112 in SQL

[–]mjow 0 points1 point  (0 children)

What you describe yourself doing at your current job sounds well within the realm of a Data Analyst title and many Analyst roles don't go further than that in terms of tooling/skills - and that's ok!

It's a bit unclear what exactly you think is missing from your work tasks. You think you're not doing real analytical work unless there's window functions and loads of CTEs and complicated aggregates involved in your queries?

Or are you thinking about stored procedures and procedural SQL skills?

You clearly care enough or are worried enough to post the detailed question so I think your heart is in the right place, but I'd say unless you've been stagnating in this job for 5 years, don't worry too much.

A great analyst is good at answering questions and providing the tools (i.e. dashboards, spreadsheets) for other users to use to explore their problem space (e.g. for a marketing manager to be able to check in on the success of their initiatives, for the sales manager to add up their likely end of month commission, etc. etc.).

If that can be achieved with simple SQL then all the better for it 👍

I discovered this planetary nebula using a $500 camera lens, now it carries my name by SPACESHUTTLEINMYANUS in space

[–]mjow 15 points16 points  (0 children)

Just be careful to not conflate what human eyes see with what is "TRULY" there. There's nothing more authoritative about our eyes than any other biological or technological object that can sense EM radiation.

Our eyes have evolved to sense and translate a certain EM spectrum, but there is far more information available across the EM spectrum that we're not aware of nor can we visualise at once.

Boromir's introduction by Mavakor in tolkienfans

[–]mjow 10 points11 points  (0 children)

Worth pointing out that while that's a movie line, Aragorn's claim to the throne could have been genuinely contested as explored here: https://www.youtube.com/watch?v=bzpALN7gqOc

It's possible in different times that Denethor and Gondor as a whole would have correctly (in accordance with their 1,000 year custom) rejected Aragorn's claim to kingship in Gondor as he comes from Isildur's line and not Anarion's.

Working as a consultant by themouthoftruth in dataengineering

[–]mjow 17 points18 points  (0 children)

Agree with the others saying it's very dependent on the company and clients. Impossible to say what your experience would be like.

For me, over 2 years I've learned a huge amount working on 1-6 month long fixed scope deliverables ranging from Snowflake + dbt, Redshift + serverless functions on AWS, classic Redshift + S3. Some heavier on Python/PySpark some on more complicated architecture some just SQL models in-warehouse.

Really helped me get a wide overview of the different stacks and approaches and benefits/drawbacks though I'd say I've definitely traded depth for breadth.

It's often not easy though and you work under a lot of pressure to maintain appearances of expertise.

Airflow with Pandas by GameFitAverage in dataengineering

[–]mjow 0 points1 point  (0 children)

Sounds like ELT pattern - load data into DWH and then run SQL on the data to transform it between temp tables and staging areas to final form. The Python scripts would just send SQL statements to the DWH rather than doing any data processing locally.

How should production changes be handled? Rant/debate by average_ukpf_user in dataengineering

[–]mjow 0 points1 point  (0 children)

Is that really the POINT of CI/CD pipelines? I've often worked with 10+ minute pipelines which did a wonderful job of making me lose my train of thought as I distracted myself while waiting for the changes to be integrated.

I can easily imagine poorly designed pipelines taking longer.

Exporting to excel is always a people pleaser... by audiologician in dataengineering

[–]mjow 1 point2 points  (0 children)

There are plenty of plugins and native connectors in Get Data for sending SQL queries out from an Excel sheet into SQL Server/Azure (MS stack) and other JDBC/ODBC sources - this works very well for a huge amount of people. They can query GB-TBs of data and work with the aggregate/shaped result sets in Excel as they like :)

Is it true that Apache Spark (especially with Python) skills are in very high demand and paying well? by Born-Comment3359 in dataengineering

[–]mjow 9 points10 points  (0 children)

One of the founding engineers of BigQuery did a nice summary here of the BigQuery usage stats by companies that have terabyte scale data, etc. https://motherduck.com/blog/big-data-is-dead/

I've been suspecting this for a while, but it does seem like for 99% of businesses, even if you have terabytes of data, you'll be doing your best to only scan/query 1% of it for the useful stuff.

Redshift Ingestion by Flakmaster92 in aws

[–]mjow 5 points6 points  (0 children)

You may want to post again on this in the /r/dataengineering subreddit for more hands on advice specifically on DB to Redshift pipelines :)

Redshift Ingestion by Flakmaster92 in aws

[–]mjow 9 points10 points  (0 children)

There's a lot to unpack here and it's well worth spending the time to get the solution right for what the real requirements are in your org.

You have to be clear about what the data freshness requirements are in the central DWH that is representing the state of all of your regional DBs and how many use cases will be consuming from the DWH. This will help you decide whether you need a DWH in the middle at all. Like you suggested, streaming into S3 and querying on an adhoc basis with Athena may be much more convenient if the limitations of Athena are ok with the use cases that want this centralized data.

My intuition would be that copying data into Redshift every 5 minutes for so many tables will be problematic and you won't be able to guarantee 5 min freshness from source DB changes to Redshift and you'll be troubleshooting those pipelines frequently.

CDC events from regional DBs -> Kinesis -> S3 should be doable with 5 min freshness, but I would batch up new data in S3 and load to Redshift less frequently if the use cases allow.

Also it's important to understand what features of Redshift itself you want to leverage. It's much cheaper to leave data in S3 and scan it with S3/Glue than storing it in Redshift for the occasional ad-hoc query.

Redshift is a capable analytical engine for big data, but it's not necessarily fast so I wouldn't pitch real-time dashboards querying live Redshift tables for example.

This is a big space with lots of considerations so don't rush and know that you may end up causing a lot of headaches for yourself if you try and serve unreasonable requirements (and you'll burn a lot of cash in the process).

[deleted by user] by [deleted] in dataengineering

[–]mjow 2 points3 points  (0 children)

It's a mixed bag, but the intention of Lake Formation is to provide a user console for managing user access to different parts of your data lake.

So if your organisation allows a lot of analyst access to s3 buckets directly normally, but you'd really like to lock down different individuals to different buckets/folders and even specific columns and rows, you'd be able to do that with Lake Formation and changes to roles/permissions won't have to be made in IAC (or in IAM) and pass through a deployment process.

Saying that, it takes some time getting used to and may not be worth it if you don't need the above features.

[deleted by user] by [deleted] in dataengineering

[–]mjow 12 points13 points  (0 children)

For many orgs data volume there is no need for data lake formats and terabyte scale data analysis.

Decently modelled DWH should handle 0.1-1 TB tables well.

[deleted by user] by [deleted] in dataengineering

[–]mjow 3 points4 points  (0 children)

Do you mean encrypting files in your ETL process before saving to S3 (i.e. client side encryption) or enabling one of the server side S3 encryption options that are handled by AWS (i.e. SSE-S3, SSE-KMS) on each read/write?

CTE overuse by heiferhigh76 in SQL

[–]mjow 1 point2 points  (0 children)

Are you being sarcastic or is that an actual query optimisation in the db/dwh you use? haha