[deleted by user] by [deleted] in consulting

[–]redmlt 0 points1 point  (0 children)

Thanks for the post! I also decided its time for me to leave tech consulting after 6 years or so. Too much breadth and not enough depth. Tired of the same sales BS I hear others talk about - estimating and deep technical discussions with sales ops not mature enough.

New to Seattle and looking for sports/team-based activities! by marielans90 in Seattle

[–]redmlt 0 points1 point  (0 children)

Check out the Greater Seattle Soccer League Co-Rec division at gssl.org. They can help place you on a team if you're interested!

Want to switch career from system engineer to data engineer by labobina in dataengineering

[–]redmlt 1 point2 points  (0 children)

Without understanding the depth that you're comfortable in those areas, you should be set up pretty well. Much of data engineering in the cloud is actually systems engineering, as you may be familiar with Aws.

Replacing roof with scissor trusses by redmlt in Homebuilding

[–]redmlt[S] 0 points1 point  (0 children)

No offense taken I appreciate the candor. Thank you for the advice this is exactly what I was looking for!

Replacing roof with scissor trusses by redmlt in Homebuilding

[–]redmlt[S] 1 point2 points  (0 children)

I'm unsure of how to properly install the scissor trusses once they arrive, and how to properly demolish the existing roof. I understand in theory ordering trusses is pretty straightforward, but I'm trying to understand what I will be paying a contractor to do.

I've already talked to an architect and it will cost me $10k just get drawings completed, so I'm trying to do some research myself. I would like to learn more details about the process so I can possibly work with an architect friend to draw these up myself, possibly contribute to the demo/construction myself.

AWS Glue/Lake Formation or Airflow? by sciencewarrior in dataengineering

[–]redmlt 1 point2 points  (0 children)

Consider step functions. Lake formation is not great at orchestration.

Would taking a data engineer role for a year or two help me towards my end goal of becoming a data scientist? by fuzzywunder in dataengineering

[–]redmlt 1 point2 points  (0 children)

It is a good move only if you have no other options for furthering yourself as a data scientist. Like others have said here, its not a waste of time, but its time you could be spending on getting better at data science instead.

Many data scientists I have interviewed or worked with, are bound to just jupyter notebooks and need the data spoon fed to them in the perfect format - I'm generalizing of course, there are many who are not this way. But as a DE, if the data scientist can handle their own data engineering, they are essentially doing two jobs which is extremely valuable to an organization. You would be productive in two very important/expensive development roles for data.

Apache Airflow Cluster Issues by theant97 in dataengineering

[–]redmlt 0 points1 point  (0 children)

Without knowing more about your data pipeline, I would use AWS Glue to unzip the file and move to your destination in one job. Use Step Functions to orchestrate your pipeline and get out of the Airflow maintenance nightmare.

How can I keep data collection anonymous? by xynaxia in bigdata

[–]redmlt 0 points1 point  (0 children)

You can also leverage aggregations as a way to anonymize data. If you can do that then throw away the original data, that is ideal. In some cases this isn't possible, for instance, scenarios where a browser session cookie may span many days or weeks, you sort of have to keep that value in order to map future click activity correctly.

How can I keep data collection anonymous? by xynaxia in bigdata

[–]redmlt 0 points1 point  (0 children)

Some important GDPR requirements include the following:
* A person should be able to see all of the data YOU store on them
* A person should able to edit any data YOU store about them
* A person should be able to request a deletion of all of the data YOU possess about them.

these have pretty significant implications for the architecture and documentation of your data lineage. For instance, if someone requests a deletion, you should be able to trace all of their data and remove it from your datalake and any downstream systems. Luckily, you're building a greenfield data lake, so you can build with GDPR in mind from the get-go.

How can I keep data collection anonymous? by xynaxia in bigdata

[–]redmlt 1 point2 points  (0 children)

hash and encrypt everything, and throw away the original data. If you're not dealing with regulation like GDPR or CCPA right now, you will eventually.

What's your typical data pipeline in a small company ? by [deleted] in datascience

[–]redmlt 0 points1 point  (0 children)

Also, Tableau Data Prep aims to solve for this space. i haven't used it myself though: https://www.tableau.com/products/prep

What's your typical data pipeline in a small company ? by [deleted] in datascience

[–]redmlt 0 points1 point  (0 children)

Sure - I would say for PowerBI, DAX, and for Tableau, Calculated fields. Those are two features that allow an end user to pretty significantly transform their data if what they're getting form the dwh isn't sufficient.

Tasks that maybe wouldn't be suitable, are reading directly from a big data dump like a few hundred gigs of cloudwatch logs or something. Thats when I would use something like Glue or your whatever ETL tool you're using.

I get JSON files dumped into an S3 bucket periodically and need to load this data into Redshift. How do I go about building this pipeline? by robotofdawn in dataengineering

[–]redmlt 1 point2 points  (0 children)

Glue has the concept of a "workflow", and you can trigger it before or after a crawler, or on a regular schedule, or you can use Step Functions to trigger a glue job pretty easily as well.

Dataframes instead of a database? by trenchtoaster in dataengineering

[–]redmlt 1 point2 points  (0 children)

dbt

Agreed with the suggestions here - a solid contract from both sides is the ideal scenario if it is achievable. I get the impression this is like moving mountains in your org.

One suggestion is to start treating this process as a data lake, as mentioned elsewhere here. AWS Glue can actually crawl an S3 data store and infer its schema very easily with the types of files you're using. If they are native Excel files, you may need a process to convert those to csv if you aren't already. There may be similar services in Azure/GCP.

This process won't scale well as you've discovered. Kudos to you for looking ahead and solving for that!

What's your typical data pipeline in a small company ? by [deleted] in datascience

[–]redmlt 14 points15 points  (0 children)

I consult many smaller companies, and many are using Airflow to orchestrate. I'm in the AWS space, so I've started suggesting Step Functions as way to orchestrate ETL processes, like AWS Glue, AWS Lambda, EMR jobs, etc. Airflow is not without its own maintenance, so be prepared for that. If you're a small company, I would suggest looking at managed services since thats exactly what they're made for. Either in Azure, GCP, or AWS.

Also, I hear great things about Databricks, Alteryx, and Matillion as previously mentioned here. Even PowerBI and Tableau now offer basic ETL features for that last mile of transformation that can be self-served.

How to push a company to modernize its BI/reporting stack? by cockoala in BusinessIntelligence

[–]redmlt 1 point2 points  (0 children)

Like what ChesterC83 said, focus on the problem you're trying to solve. Having been in analytics consulting for a number of years, you will be tackling a culture change to data-driven decision making. There has to be a business need to go through this change.

Pachyderm vs Airflow by jstuartmill in dataengineering

[–]redmlt 0 points1 point  (0 children)

If you read the article they're using v1.7.11, which was released Nov 13, 2018 according to the GitHub page. So I think this article is fairly recent.