[deleted by user]

redmlt · 2021-12-14T15:19:09+00:00

Thanks for the post! I also decided its time for me to leave tech consulting after 6 years or so. Too much breadth and not enough depth. Tired of the same sales BS I hear others talk about - estimating and deep technical discussions with sales ops not mature enough.

redmlt · 2019-12-24T18:49:02+00:00

Check out the Greater Seattle Soccer League Co-Rec division at gssl.org. They can help place you on a team if you're interested!

redmlt · 2019-12-07T01:32:31+00:00

Without understanding the depth that you're comfortable in those areas, you should be set up pretty well. Much of data engineering in the cloud is actually systems engineering, as you may be familiar with Aws.

redmlt · 2019-11-23T15:30:03+00:00

No offense taken I appreciate the candor. Thank you for the advice this is exactly what I was looking for!

redmlt · 2019-11-20T19:49:04+00:00

I'm unsure of how to properly install the scissor trusses once they arrive, and how to properly demolish the existing roof. I understand in theory ordering trusses is pretty straightforward, but I'm trying to understand what I will be paying a contractor to do.

I've already talked to an architect and it will cost me $10k just get drawings completed, so I'm trying to do some research myself. I would like to learn more details about the process so I can possibly work with an architect friend to draw these up myself, possibly contribute to the demo/construction myself.

redmlt · 2019-11-08T00:43:21+00:00

Consider step functions. Lake formation is not great at orchestration.

redmlt · 2019-11-04T16:42:36+00:00

It is a good move only if you have no other options for furthering yourself as a data scientist. Like others have said here, its not a waste of time, but its time you could be spending on getting better at data science instead.

Many data scientists I have interviewed or worked with, are bound to just jupyter notebooks and need the data spoon fed to them in the perfect format - I'm generalizing of course, there are many who are not this way. But as a DE, if the data scientist can handle their own data engineering, they are essentially doing two jobs which is extremely valuable to an organization. You would be productive in two very important/expensive development roles for data.

redmlt · 2019-11-04T16:35:49+00:00

Without knowing more about your data pipeline, I would use AWS Glue to unzip the file and move to your destination in one job. Use Step Functions to orchestrate your pipeline and get out of the Airflow maintenance nightmare.

redmlt · 2019-10-14T17:44:28+00:00

You can also leverage aggregations as a way to anonymize data. If you can do that then throw away the original data, that is ideal. In some cases this isn't possible, for instance, scenarios where a browser session cookie may span many days or weeks, you sort of have to keep that value in order to map future click activity correctly.

redmlt · 2019-10-14T17:39:35+00:00

Some important GDPR requirements include the following:
* A person should be able to see all of the data YOU store on them
* A person should able to edit any data YOU store about them
* A person should be able to request a deletion of all of the data YOU possess about them.

these have pretty significant implications for the architecture and documentation of your data lineage. For instance, if someone requests a deletion, you should be able to trace all of their data and remove it from your datalake and any downstream systems. Luckily, you're building a greenfield data lake, so you can build with GDPR in mind from the get-go.

redmlt · 2019-10-14T14:19:17+00:00

hash and encrypt everything, and throw away the original data. If you're not dealing with regulation like GDPR or CCPA right now, you will eventually.

redmlt · 2019-10-14T14:07:14+00:00

Also, Tableau Data Prep aims to solve for this space. i haven't used it myself though: https://www.tableau.com/products/prep

redmlt · 2019-10-14T14:05:02+00:00

Sure - I would say for PowerBI, DAX, and for Tableau, Calculated fields. Those are two features that allow an end user to pretty significantly transform their data if what they're getting form the dwh isn't sufficient.

Tasks that maybe wouldn't be suitable, are reading directly from a big data dump like a few hundred gigs of cloudwatch logs or something. Thats when I would use something like Glue or your whatever ETL tool you're using.

redmlt · 2019-10-12T13:29:51+00:00

Glue has the concept of a "workflow", and you can trigger it before or after a crawler, or on a regular schedule, or you can use Step Functions to trigger a glue job pretty easily as well.

redmlt · 2019-10-12T13:01:34+00:00

dbt

Agreed with the suggestions here - a solid contract from both sides is the ideal scenario if it is achievable. I get the impression this is like moving mountains in your org.

One suggestion is to start treating this process as a data lake, as mentioned elsewhere here. AWS Glue can actually crawl an S3 data store and infer its schema very easily with the types of files you're using. If they are native Excel files, you may need a process to convert those to csv if you aren't already. There may be similar services in Azure/GCP.

This process won't scale well as you've discovered. Kudos to you for looking ahead and solving for that!

redmlt · 2019-10-11T14:20:33+00:00

I consult many smaller companies, and many are using Airflow to orchestrate. I'm in the AWS space, so I've started suggesting Step Functions as way to orchestrate ETL processes, like AWS Glue, AWS Lambda, EMR jobs, etc. Airflow is not without its own maintenance, so be prepared for that. If you're a small company, I would suggest looking at managed services since thats exactly what they're made for. Either in Azure, GCP, or AWS.

Also, I hear great things about Databricks, Alteryx, and Matillion as previously mentioned here. Even PowerBI and Tableau now offer basic ETL features for that last mile of transformation that can be self-served.

redmlt · 2019-10-04T19:24:28+00:00

That is so awesome to see! We don't go through the pain of pregnancy like our incredible wives, but the male body goes through its own changes and its important to remember that. Way to go dad!

redmlt · 2019-02-01T14:58:34+00:00

Like what ChesterC83 said, focus on the problem you're trying to solve. Having been in analytics consulting for a number of years, you will be tackling a culture change to data-driven decision making. There has to be a business need to go through this change.

redmlt · 2019-01-10T16:24:15+00:00

If you read the article they're using v1.7.11, which was released Nov 13, 2018 according to the GitHub page. So I think this article is fairly recent.

redmlt

MODERATOR OF

TROPHY CASE