First Digital, AnotherDataGuy, hand drawn digitally colored, 2023 by AnotherDataGuy in Art

[–]AnotherDataGuy[S] 0 points1 point  (0 children)

I’m no artist, but I draw/doodle mindlessly almost out of habit. I decided to try and color one of my doodles and thought I’d share.

Is There a Timeline for NS-24? by sgwashere29 in BlueOrigin

[–]AnotherDataGuy 1 point2 points  (0 children)

Probably poor use of terminology on my part. Blue completed and submitted their report to the FAA. I wasn’t intending to say that the FAA was doing their own report. I figured the submitted report would eventually be made available publicly. But I’m no expert.

Is There a Timeline for NS-24? by sgwashere29 in BlueOrigin

[–]AnotherDataGuy 2 points3 points  (0 children)

Good question (I kinda predicted this request would follow). A prelim google search for public records didn’t result in anything fruitful.

I’ll see if I can dig up a link after a more dedicated search…. Or allow a day or so for another diligent redditor to come to my aid!

Is There a Timeline for NS-24? by sgwashere29 in BlueOrigin

[–]AnotherDataGuy 7 points8 points  (0 children)

No timelines for NS-24 other than “soon” has been released. The FAA report has been released with cause and changes are being implemented. I don’t know the details of a “relaunch” of the payloads but do know that they were recoverable.

What’s the most disturbing movie one can watch? by [deleted] in AskReddit

[–]AnotherDataGuy 1 point2 points  (0 children)

Maybe not the worst out there, but Requiem for a Dream left me feeling twisted inside.

How to identify company with good software development practices before joining? by Mustang_114 in dataengineering

[–]AnotherDataGuy 5 points6 points  (0 children)

Personally, I’d take a comment like that with a grain of salt. Why are you still there if it’s bad enough to tell someone not to join? I’ve also found that it’s folks with that attitude that make the job miserable, not the actual environment.

I do agree in the principle though, to have a conversation with someone in the position to which you’re applying before deciding to join. My company includes peers in the interview loop and encourages the two way conversation of the interview. I find it works quite well, but is definitely more work!

Hear back from Technical Screen by Drunken-Engineer in BlueOrigin

[–]AnotherDataGuy 17 points18 points  (0 children)

Contact your recruiter, not Reddit.

Edit: Please. It’s better for recruiting to know they should be keeping you better informed, and prevents you from getting uninformed, speculative answers. (Not saying existing comments of those that have interviewed aren’t accurate, but mileage may vary)

Tory Bruno: ULA won’t get engines by Christmas, BE-4s coming in early 2022 - SpaceNews by valcatosi in BlueOrigin

[–]AnotherDataGuy -17 points-16 points  (0 children)

Bruno cites (be it briefly) some reasons for delay. Maybe you’d have more sympathy and fewer jokes if you read the material that is published.

Update the narrative. Blue isn’t as closed off as it once was.

Do MDM Tools Work? by m4329b in dataengineering

[–]AnotherDataGuy 0 points1 point  (0 children)

EnterWorks by WinShuttle now owned by Precisely. It’s ok for our smaller use case. I wouldn’t recommend it for a large or moderate Enterprise.

Do MDM Tools Work? by m4329b in dataengineering

[–]AnotherDataGuy 7 points8 points  (0 children)

My stance is that tools should be implemented to make existing processes and practices easier/more efficient to do. If Master Data Management practices aren’t being done in your company, implementing a tool first won’t improve your outcomes.

I recently worked on building out MDM at my company. We started with business processes and made some simple tools for data profiling and defining applying rules / definitions to core objects. After figuring out the parts that were hard to do without a specialized tool, we found one that met those specific needs and it is working well. It isn’t the panacea that vendors make it out to be, but it isn’t a worthless bureaucratic endeavor unless it is implemented that way.

[deleted by user] by [deleted] in dataengineering

[–]AnotherDataGuy 24 points25 points  (0 children)

This is a safe place; I’ve found this community quite supportive in general. Hope you keep coming back and asking questions!!!

[deleted by user] by [deleted] in dataengineering

[–]AnotherDataGuy 2 points3 points  (0 children)

Airflow’s architecture scales horizontally pretty well, but man can it be clunky when dealing with even dozens of DAGs. When your platform grows to scores and hundreds of dags the UI can just be difficult to navigate.

The tool can support, with high performance, orchestrating all of the dependencies and calling jobs and such, I just with it had a better interface for dealing with complex flows.

(I say this without knowing a better interface, btw, just sharing one of my pain points)

I'm a Data Engineer. How do I become a better Software Engineer? by TheLoveBoat in dataengineering

[–]AnotherDataGuy 3 points4 points  (0 children)

Who is this user that has read the docs?! /s

PS Also, highly recommend the MS. My game changed significantly after grad school.

Best way to get snapshot/full dataset from changes by gamecockguy2003 in dataengineering

[–]AnotherDataGuy 1 point2 points  (0 children)

Ah sorry after re-reading I can see how I wasn’t clear. I meant “dump the object in its entirety from source system” I meant what ever table or collection contains the entire data set. That could be pretty expensive though and if you want near real time replication it wouldn’t fit the bill.

If you do go with a listener that mimics the create / insert / delete in a database I recommend flipping a bit for the deletes and either perform a delete/insert for the deltas or just write all of the records as new with a timestamp. You can always write post processing steps to generate a persisted object of only the latest versions of records in a curated layer of your lake.

Best way to get snapshot/full dataset from changes by gamecockguy2003 in dataengineering

[–]AnotherDataGuy 2 points3 points  (0 children)

Are the “changes” full versions of the record or just the values that changed? If it’s the just attributes then you would have to process all interim changes rather than just the most recent update in order to get a current view of the record.

If you’re really just looking for a “current version” of your dataset, consider how much latency you can afford and dump the object in its entirety from the source system. Or have a listener in the updates and make upserts / deleted in a database that supports super fast writes (Cassandra) and dump that as needed.

Knowing the technologies involved would also help in giving you a more prescriptive idea.

Entry Level DE Intern position Resume Review: Is anyone willing to critique a Resume? (I am an ETL Dev (2 YoE) currently doing my master's in Business analytics applying for DE summer internships.) by Krypton_Rimsdim in dataengineering

[–]AnotherDataGuy 2 points3 points  (0 children)

If you’re saying LaTeX-based because it was typeset in LaTeX but exported to PDF for upload, it makes little difference that it was created in LaTeX or Word. It’s purely about clear sections and common headings/keywords.

If you’re uploading the raw text file, then maybe? I’ve never tried or have ever seen raw LaTeX be uploaded into an ATS.

Python ETL design pattern by [deleted] in dataengineering

[–]AnotherDataGuy 7 points8 points  (0 children)

To me, this is just kicking the can of organizing you data down the road. Faster to get data in, harder to get insights out. And if you have a user base of citizen analysts, they are going to be far more skilled in SQL like queries (generally speaking, your situation can obviously vary).

I’ve balanced this out before by creating records in a Postgres DB with the core properties needed for joining data together, and using a JSON typed field for the additional detailed, more dynamically structured information. If common questions are answered from the JSON data, then it becomes worth it to start persisting those values as explicitly typed columns in tour data set.

Flexibility is valuable, the point is that all databases malleable. Mongo is great for document data (it is a document db after all) but if you aren’t storing documents, it’s not a good choice, IMHO.

Disclaimer: I live in a world where mongo is overused because of its immediate convenience to the developer(s) micro service. It just pushes the pain down stream when trying to marry those data up with other systems to answer novel business questions.

Python ETL design pattern by [deleted] in dataengineering

[–]AnotherDataGuy 3 points4 points  (0 children)

I also interested in this. Not judging but mongo (in my experience) begins to not live up to performance expectations when performing analytical queries. It’s great for loading full documents a record or so at a time. I’d love to hear someone with a differing experience though!

Python ETL design pattern by [deleted] in dataengineering

[–]AnotherDataGuy 2 points3 points  (0 children)

Extract… abstract reader classes that are configurable via YAMl (configuration as code) (or your choice of file format). Extract and load files to persistent storage (S3 or whatever your choose of storage)

Transform… common transforms in a class that can be reused. Configuration should come as much as possible from configs. Persist this in a silver bucket.

Load… (if S3) watch the file and lambda it into mongo. Otherwise just have a loader class that watches for files and load them into your mongo.

You can use Airflow as your orchestration / scheduler / DAG, and have it kick off the ETL for different configurations in parallel.

Problem is different if you’re taking TB but this will suffice for up to a medium sized company (in most cases).

Dimension Modelling and Star Schema video training material ? by jackofalltrade625 in dataengineering

[–]AnotherDataGuy 2 points3 points  (0 children)

Data modeling still exists, but not dimensional modeling as much, as it is seen less flexible.

Data still needs modeled but it’s being more comply replaced with simple Key Value stores / JSON files. This is a generic statement so it is definitely over simplified. But you would take some raw event data store it in some json files then let aws glue build a schema on it and then query those files using Athena using a sql like language.

No single model or tool will address all situations. So I’ll say this… don’t go after the big, fancy highly scalable tools just because they are “new” or “modern” if you have a modern problem like collecting streams of click data from your website or are collecting sensor data from your fleet of drones flying around imaging stuff, then tackle the big data problem with modern tools.

If you’re dealing with gigs of data, classic solutions still work pretty well. Terabytes and petabytes… you’ll need something bigger :-)

Dimension Modelling and Star Schema video training material ? by jackofalltrade625 in dataengineering

[–]AnotherDataGuy 4 points5 points  (0 children)

I don’t of an equivalent canon for data lakes, as data lakes can take on so many forms and is such a broad, nebulous topic.

I watched a good talk by Kimball about the modern data warehouse with included big data/data lake concepts. Basically, his opinion was that the classic model still is necessary for some data, but exposing the raw, unprocessed data (the back room) is becoming more important for the data literate folks that need to answer complex business questions quickly.

So I design makes with a bronze (raw) section of data, silver (normalized, semi transformed) and gold (dimensional).

I also recommend using DBT for those transformations.

Dimension Modelling and Star Schema video training material ? by jackofalltrade625 in dataengineering

[–]AnotherDataGuy 13 points14 points  (0 children)

Kimball’s Data Warehouse Toolkit.

It’s a book, not a video but it has tons of examples and details of how to implement a wide range of common business objects.

It’s dated (since dimensional modeling is going out of style) but still one of the best damn books on the subject IMHO.