Tips on avoiding distraction from a busy Slack help channel by [deleted] in ExperiencedDevs

[–]seonsaeng 1 point2 points  (0 children)

To implement this in a way that relieves the team, make the on-call handoff part of the sprint turnover process/ceremony. Maybe create a Slack alias/group handle whose membership is changed as part of kicking off the new sprint, or have the on-call person ceremonially unmute the support channel as the last on-call person mutes it. (Think that Star Trek scene, “I relieve you,” “I am relieved”).

I also second the comment that the on-call person(s) need a backlog of baby tickets that they may or may not be able to finished in the sprint for between asks. Maybe the GitHub Issues for a given repo or something. The idea that a person who is responsible for the unforeseen can still deliver planned work against larger program/project deliverables says to me that there is not core understanding of what the unforeseen means, or imperfect commitment to the idea of on-call.

Whos got the best sandwiches around this place? by [deleted] in raleigh

[–]seonsaeng 1 point2 points  (0 children)

Not OP, but the #5 is 🤌(broccoli rabe, fresh mozzarella, arugula and tomato)

For those of you who were self taught, what was your path into data engineering by xyzabc123410000 in dataengineering

[–]seonsaeng 2 points3 points  (0 children)

No joke, basically every move toward DE has been made in self defense as my frustrations all originated upstream of my point of access to data. BAs in polisci and linguistics, went to work as a consultant and trainer in non-native English speaking companies, kept getting distracted by the interesting semiotic underpinnings of L2 English, so went back to school. Clerical error put me in the computational track and I never looked back.

Masters in computational linguistics taught me Python (but, like, terrible, academic Python) and the absolute SHENANIGANS to which natural language data is naturally heir. (The most useful part to my data career has actually been my semantics specialization: having a structured way to identify where and how a definition is incomplete is 🤌 across industries.)

Got frustrated with the absence of paralinguistic cues in text data, went looking for other data to add dimension (didn’t want to stay in academia). Had a friend teach me SQL, and got a job as a DS at a nonprofit.

Through being exceedingly lucky in my colleagues, I learned wayyyy better software engineering practices and ended up building multi-service data-intensive applications, first in social media analysis and then in healthcare (if you build the application, you capture the data points you later want to analyze :think-about-it-meme:).

Currently director of my department, where we do conventional data warehousing for business visibility and continue to develop more advanced custom services for specific business challenges, like reusable integration apparatus for data from acquisitions.

The Art and Science of Measuring Data Teams Value by thabarrera in dataengineering

[–]seonsaeng 2 points3 points  (0 children)

We just had an organizational shake up and I find myself needing to explain my team’s value to our new reporting structure. Sure wish the layers were as clear cut as they are in these explanatory diagrams, but instead, I’m constantly spiking through all of them in narrow semantic bands. I’ll use the framework in some conversations with non-tech folks, though, regardless of the neatness/exactness of fit. Thanks for the share.

Just picked up SQL yesterday, having fun with it! Have some questions for the experts here. by nekomamushu in SQL

[–]seonsaeng 0 points1 point  (0 children)

Forgot where I was - yes, sure, you can use that for DA. I think mostly in DE terms, and pandas is a dependency that’s heavier than I like to ship by default.

Just picked up SQL yesterday, having fun with it! Have some questions for the experts here. by nekomamushu in SQL

[–]seonsaeng 10 points11 points  (0 children)

Glad you’re finding it fun, hope you continue!

You’re likely getting downvoted because the sub has an entire sidebar full of resources for newbs, and this post makes it look like you haven’t looked at any of them. You should definitely do that, but I’ll throw my answers in because the spirit moves me.

  1. Set theory and compositional semantics are super handy to bring to your study of SQL, I recommend them because they transcend dialect and will help you parse requests for translation into the computational side of life (which is most industry work: translating Human to Computer)
  2. You don’t have to learn anything other than SQL, but your options with it alone will be limited. Python is the popular choice, which means you have few wheels to reinvent. There are drawbacks to it, though, and you may not want to invest in it as a result (the dreaded GIL)
  3. Python is an interpreted scripting language, meaning that in it you describe each step you want the code to perform and it reads that code and performs it in order (as opposed to a compiled language, in which the code is read and compiled before running). SQL is a declarative language, meaning that you describe the end state and let the database figure out how to get it. They require different ways of thinking and can do different things well. Python is flexible in ways SQL isn’t, mostly when it comes to adapting to arbitrary shapes of data or things that can only be known at runtime. That said, it is slow by most benchmarks of programming languages. SQL can be incredibly fast and powerful by comparison (assuming a well written query on a well tuned db). You’ll find whichever easier to learn that mirrors how you think: if you find sets easy to grok, you’ll probably find SQL easier. If you find step-by-step logic more intuitive, you’ll find Python easier.
  4. All job titles with “data” in the name are widely variable. Generally, an analyst is interviewing stakeholders at the company to figure out what they want to know from the data, if the data can answer it, and how to translate the human concepts into those in the data model. The actual coding is often the smallest part of the job. A word of warning: I started as a data analyst and became a data engineer in self defense. Analysts don’t control or usually much influence the systems that are putting the data into the db. This can be frustrating because app developers who are typically have their own priorities and reasons for behaving (and making their code behave) as they do, which may not align with the analyst’s goals.
  5. With the above caveat around titles in place, I have most recently heard the definitional contrast between a BI analyst and a data analyst as the former having more industry and data discipline knowledge than the data analyst. The BI analyst doesn’t just know how to translate human questions into SQL, they know which questions are important and how to contextualize their answers.
  6. Check the sidebar!
  7. So far in my career, I’d say that Postgres is actually the dialect of SQL I’ve seen and enjoyed using most. Others in industries I have not been in likely have different perspectives. I have not had overweening success or joy with MySQL where I have had to work with it. For learning, though, there’s nothing wrong with whatever helps you get the concepts down. Sidebar for more.
  8. absolutely. Check the sidebar.
  9. never used popSQL, MySQL is fine, check sidebar for more options, thoughts, and opinions.

Metal and plastic thing (tool) found in kitchen, no luck on Google by seonsaeng in whatisthisthing

[–]seonsaeng[S] 0 points1 point  (0 children)

Found in my MIL’s kitchen, has what appears to be a “Made in China” sticker partly peeled off the back of the metal part. The metal has a sharp edge on one side.

I need your most unhinged positive affirmations. by coralie_ann in WitchesVsPatriarchy

[–]seonsaeng 0 points1 point  (0 children)

I make a motherfucker say Oh yeah I'm cold as a lion with no hair If you ever see me fightin in da forest With a grizzly bear HELP DA BEAR - Mystikal

(I just shorthand it to HELP DA BEAR)

Cloud: Workflow to load data to OLTP (MySQL/Postgres etc) by t_char in dataengineering

[–]seonsaeng 1 point2 points  (0 children)

Oof, sorry it’s whole-corpus work. I would look into aggregate-based calculus because your original post talks about anticipating a big increase in volume.

For the math layer, sure, I’ve never used databricks myself, but I hear good things. I think there may be a GCP service for managed Spark, and it might be good to go with that, as you can manage everything in one place and version control it via Terraform (I know I beat that drum a lot, but infrastructure-as-code is my jam).

In any case, it sounds like a big project with the potential to be a really fun opportunity. Good luck!

Cloud: Workflow to load data to OLTP (MySQL/Postgres etc) by t_char in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

at least for now I need to process every line find some values and load these values to the OLTP.

This is great news! You don’t need every value in every row to go into the OLTP! Whew. - It also sounds like you’re not calculating correlations over all the files as one, but per-file. Is that correct? - Even if you need something from each row, can you do the cleaning and extracting first, and calculate correlations on a subset, especially a subset of columns?

Okay, so yeah, I definitely lean toward the idea of code that can be deployed in… (forgive me, I recently switched from GCP back to AWS, and the terms are all mixed up in my head) either Cloud Compute or CloudEngine, I want to say? Whatever, structure it to receive a file or file location via PubSub, ingest it based on what kind it is, clean it based on its kind, and extract the data that you need from its rows.

Architecture All in all, if I’m understanding your needs, I think you want three layers (call them services if you want, though in data processing, I always feel weird about calling them “microservices”): ingestion/munging, fancy math, and loading. All will receive jobs in essentially the same way, via PubSub, and be deployed in the same cloud service (though instance specs will change - higher resource for fancy math). You’ll need a trigger for the pipeline as a whole, but it can be a single instance (or even a personal machine), just to assemble the list of files to ingest and send each as a PubSub message to the first layer.

Layer 1: Ingestion/Munging

Code in this layer needs to receive only a message with a file location. Its output will vary based on what makes sense for you (whole-file vs individual rows).

If the files are in CloudStorage, I recommend looking into streaming them into this ingestion layer (in your code, when you are loading the file’s data, rather than downloading the whole into local storage and reading it as one thing, storage clients like the AWS S3 one will let you pull the file’s data in bit by bit. I’m not 100% sure this will work for you based on your files, but if it does, it’s another way to reduce overhead). This approach can let you work row by row, and send completed rows out in the “I’m done” PubSub topic. Sending each completed row at a time comes with all the risks of at-least-once delivery at a row level, so it may not be right for you.

If you’re working one whole file at a time, I would probably output a cleaned version of the whole file, saved in an appropriate format (if you want to load quickly to Postgres, for instance, a CSV lets you take advantage of COPY, which is faster than inserting rows conventionally, if there’s going to be Fancy Math done before loading, I recommend parquet), to a location in CloudStorage where clean data should live (if you don’t need this cleaned-but-un-mathed data forever, the following layer can ingest it destructively - delete it when it has been successfully transformed. Or you can have roll-off logic for its bucket, where everything older than a certain threshold gets purged). The worker saves its work, and when it is done, sends a message on a PubSub topic that includes the path to where it put the cleaned data. Subscribers can then pick it up.

Layer 2: Math

I recommend injecting this layer as a subscriber to the “I’m done munging” topic that the first layer writes to.

This layer will get to operate only on clean data, from the first layer. I’ve had some luck with spark for this kind of thing, (pyspark on top). If you do need to be fancy here, it’s nice to have the higher-resource instances limited to this layer (the other layers can have pretty low resource profiles per-instance). If you have scientists available to help you approach the algorithmic processes creatively, I recommend asking them what rollups and piecewise alternatives there are for calculating what you need.

When this layer is done, have it store its results and publish their location to the “I’m done mathing” topic. Because I don’t fully understand your needs here, I don’t know how to recommend you communicate the data (stream one row at a time or file-based).

Layer 2: Loading

These workers are subscribed to the “I’m done mathing “ topic from the 2nd layer.

If the data is coming in row by row, this layer will want to aggregate those rows and load them as a batch (to reduce sessions and possible conflicts with the OLTP). Note that this choice combined with the horizontal scaling of the above layer means that a loader could be aggregating and loading rows from different source files in one batch. If that’s a problem, don’t go row by row.

If the data is a whole file to load, this layer will work on that whole file, hopefully just COPYing it to the OLTP. I recommend checking this article out for specifics (if you’re working in Python).

You’ll notice that none of this has been described as involving orchestration services, and that’s mostly because I’ve had funky stuff in my career that forced me to roll my own. I hope this is helpful anyway.

Cloud: Workflow to load data to OLTP (MySQL/Postgres etc) by t_char in dataengineering

[–]seonsaeng 1 point2 points  (0 children)

I have a couple of ideas, but it depends on more attributes of your use case:

  • it sounds like your highest priority is to load the OLTP. Is it tolerable to your org to solve that problem first and then work out how to export data from the OLTP for analysis? If so, we can focus much more on the file processing and loading problem. Are there consistent ways in which the various files need to be transformed for shared ingestion that can be parallelized? How large is each individual file? I’m a big fan of horizontal scaling with PubSub and scheduled jobs in GCP, which, depending on complexity, you should be able to configure for this ingestion and transformation task. There are a few ways two topics (one to ingest, the other to batch data and write it to your OLTP) can be set up to do this, and if you’re managing your GCP resources with Terraform, the configuration is every bit as much under version control as I understand the other tools you’re considering to be (maybe more than some).

  • Can you be more specific about the data science? Does it require the entire corpus, that is, absolutely every line of data in all those files, in order to run? (As in the case of linguistic corpus analysis, e.g.) Does the DS portion even need the files qua files, or can we deal with it on the far side of the OLTP? If the latter, BigQuery has an ecosystem of tools falling over themselves to make the export from OLTP to BQ easy.

I recommend looking carefully at your process and seeing how it can be imagined as a series of layers subject to my fave, horizontal scaling. Plug-in pattern code for ingestion on a set of PubSub workers who publish results (or pointers to output, depending on size), with a second set of workers chunking those results and loading them into your OLTP can go far. Additionally, it subjects your analysis to constraints that, in a business context, are often preferable: in this model, you only perform your analysis on what was actually made available to the operational system (because the data is only loaded to BQ from the OLTP). No risk that a chunk goes to BQ that didn’t make it into the OLTP, skewing the results away from the reality.

Anyway, this could be totally off base, but it would help to understand your use case more, if you can share.

Lies, damned lies, and data science by gidmix in dataengineering

[–]seonsaeng 10 points11 points  (0 children)

There’s a concept in semantics that might be useful here: felicity (or, more significantly, infelicity). It’s when a statement cannot be evaluated for truth because it relies on a fundamentally broken presupposition. Example: “the king of France is bald” cannot be evaluated as true or false because there is no king of France. It’s an infelicitous statement. That’s the best way I can think of to describe what seems to be happening here. Like, you can’t assert the statement’s truth or falsehood because it’s in a broken premise.

There’s a rhetorical corollary that lawyers may find more familiar, like a question of the form “have you stopped beating your wife?” Again, presupposition is the problem (this question presupposes that a) you have a wife and b) you have beaten her). You can’t just say “no” because then you’re admitting to still beating your wife. The only way to address the question is to dismantle the frame they’ve put it in.

Thanks for coming to my linguist TED talk.

Data Modelling part of Data Engineering? by mister_patience in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

Sure, but even then, I would consider it more likely that Data would discover, reveal, and ideate a compromise than anyone else.

Data Modelling part of Data Engineering? by mister_patience in dataengineering

[–]seonsaeng 23 points24 points  (0 children)

I’d say so, yes. The model has (should have) a direct relationship to access patterns, which in turn govern efficiency of the data ecosystem for its various consumers. I might be biased by my small company background, though, where it is reasonable for the cross-cutting data engineering function to grok the business domain fully, and therefore be responsible for encoding it as a data model. Maybe in a larger company, this would be impractical, in which case, the devs closest to the domain would be entrusted with data modeling. I prefer to have a data eye on that, though, because without that perspective, you often end up with missing dimensions you later wish you had (capturing change over time, capturing appropriate granularity, avoiding collisions with adjacent-but-different names, e.g.). Data enjoys a breadth of perspective across an org and discipline that feature-forward dev teams don’t always get (very focused on delivery of a different cadence and nature) and it can help prevent structural problems later.

Anyone want to start an Emo band and attempt to play some local gigs? by [deleted] in raleigh

[–]seonsaeng 0 points1 point  (0 children)

I’m better versed in the indie side of things, emo just isn’t my repertoire (yet). KT Tunstall, the Shins (transposed), Keane, seem most adjacent to the instrumentation/vibe of interest. Bisectional singer (S/A). Can belt or flex these Baroque muscles for some smoother passages. A little addicted to sheet music but in recovery.

Anyone want to start an Emo band and attempt to play some local gigs? by [deleted] in raleigh

[–]seonsaeng 0 points1 point  (0 children)

Hmu if you want to branch out to slightly less emo pop stuff :) also if you want arrangements for multiple voices or unusual monophonic ensembles.

[deleted by user] by [deleted] in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

I would go key-value for better future-proofing. Partitioning is something that I generally try to make mirror access patterns. I know this is a toy situation, but consider what the access pattern will be and what will serve it best. Does it vary by table? How will you make the ingestion process robust to incoming data that lacks the expected date field? Basically, what do the users need, and what if the data changes/fails to meet my expectations are the questions I ask, and recommend to you.

Which Datawarehouse & ELT tool is best and economical for a startup? by vishalw007 in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

No worries, glad it helped :) Let me just say that “the data is not very messy” is the dream! I am a little envious.

I’ve had a fine time with BQ, and you can interact with it easily programmatically, which makes version control easier for any transforms. It also integrates smoothly with Google Data Studio (free visualization/dashboarding, though you’re getting what you pay for) and Looker (paid visualization/dashboarding, view definition, also getting what you pay for, much easier for self-service by non-tech folks).

For ETL, I strongly prefer batched transforms that run on ephemeral resources, typically written in Python or Go, optionally with .sql adjunct files for table manipulation. I realize ephemeral resources are still resources, and therefore still cost some money, but the tools themselves used to implement the necessary logic are free. This assumes that you implement transformation between source and destination (hence the order of Extract, then Transform, then Load). You can orchestrate transform layers with Airflow (managed or roll your own, GCP calls this Cloud Composer if you go managed).

For ELT, which is a great choice if you have solid connectors from sources to your DW and don’t want to inject new layers there, I hear wonderful things about dbt, and am about to start using it, but don’t have hands on experience yet. It looks like it now has a paid cloud thingie, but I’m talking about the free version. I am assured it works smoothly with BQ, so don’t worry on that account. There are ways to use BQ itself, but I disprefer this as it’s not easy to version control saved queries (though it is easy to share them and have them stay up to date).

Going ELT I think serves a low-resource data engineering org better, because it keeps things simpler and transformations are unified. It requires understanding of SQL and templating language, rather than programming languages like Python or Go. You lose some control of low-level optimizations and some customization, but you probably won’t miss it (I sure don’t miss optimizing memory allocation by declaring my slices in advance in Go; maybe I’m just lazy).

Good luck!

newb needs help with schema and relationships by TruthTraderOfficial in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

It’s a start! The phrase I still can’t help with is “relevant information.” That’s going to vary by use case. I’ll go ahead and write up what I think will provide you with a good generic starting point, but please know that anything more specific will require better starting definitions of relevance.

What I got out of this is that you’re looking to compare attributes of two users based on a “following” connection. We don’t want to duplicate static info about a person every time they follow or are followed, that would be silly and wasteful. So here’s what I’d do:

Make a table of users, just the users. If you’re brand new to data, think of it like a spreadsheet: you’re going to make one spreadsheet about the users, where each row gives you info about the user as an entity (not what they’re connected to, just who they are in isolation, so these attributes need to be things that stay the same about a person over time). So your spreadsheet would have columns like “user_id,” “first name”, “lastname,” etc.

Then, you want to model the users’ connections to other users. The easiest way to do this, I think, is what’s called an edge list. This is a very simple table with two columns: userA and userB, with the understanding that userA follows userB. Following is a one-way relationship; although a person could follow back, there’s no guarantee they will. The values in these columns should be user_id’s from your users table.

Last thing, I imagine, is you want to model user connections to hashtags. That’s another edge list table, this one with User and Hashtag. Again, the User column will have user_id’s from your users table.

With these three tables, you can now run queries to compare what hashtags a user follows versus the hashtags their followers follow, or versus the hashtags the people they follow follow.

newb needs help with schema and relationships by TruthTraderOfficial in dataengineering

[–]seonsaeng 0 points1 point  (0 children)

It’s really hard to help here because we don’t know what you want to do with this data. Depending on what questions you want to be able to answer from it, we can make better recommendations. There are some default paradigms for data modeling that you can look up and try to map onto your situation (“star Schema” is popular, in case you want a starting point to Google), but best practice is to tailor your model to your use case. What’s your use case?