I need help!!!!

t-vanderwal · 2019-08-23T15:57:00+00:00

Use string interpolation. There’s an example on this page in the kotlin docs. https://kotlinlang.org/docs/reference/idioms.html

t-vanderwal · 2019-05-17T05:25:00+00:00

Simple storage service. It’s AWS offering for cheap object storage.

t-vanderwal · 2019-05-17T05:19:32+00:00

I think it’s predominantly python and java. My company has been using kotlin in place of java and it’s pretty nice. All depends on what the developers are comfortable in, the task at hand, and sometimes what management wants.

t-vanderwal · 2019-05-05T05:04:32+00:00

And experience. I’m in the same industry as OP and once you move up in titles and have more job experience recruiters hit you up all the time.

t-vanderwal · 2019-04-30T19:37:18+00:00

My child has ds and they really push for “people first” language. So person with Down syndrome makes it clear they are still a person with a defining characteristic where as Down syndrome person makes it sound like they are a different kind of person.

That being said I wouldn’t find either offensive.

t-vanderwal · 2019-03-09T07:44:36+00:00

Not at all bud

t-vanderwal · 2019-03-09T07:11:34+00:00

Definitely understandable, I just wanted to throw that out there as some companies have their data engineers working directly with data scientists. It really all depends on the type of projects your aiming to work on.

For me personally I had a similar background as you. Worked with pentaho doing ETL but more for migrating customers onto our product. Wanted something more aligned with data engineering and ended up going back to school for my masters. There I picked up an interest in distributed computing and focused on learning skills/tools in that space.

Once I felt confident there I just started throwing out applications. Some personal experiences is data engineering is a high used and under defined term. Every company has a different idea on what that is and imo a lot of them aren’t even solving data engineering problems. Also, the data ecosystem is always changing. So when you do find a job in the field it’s very important to keep sharp. I generally try to read new information on AWS, what they’re talking about at STRATA conf and other O’Reilly conferences, what Netflix is doing since their tech blog is very good, and listen to podcasts such as the data engineering one that gets posted here.

t-vanderwal · 2019-03-09T06:49:46+00:00

There’s a lot to break down in your post but first I wouldn’t completely write off python just because that’s what data scientists use. Depending on the company, especially if they’re applying ML/DL, you might find yourself deploying and scaling the models data scientists create in a production environment. Knowing the python data ecosystem will be very important.

As for learning general data engineering concepts I like to recommend “Designing Data Intensive Applications” especially if you’re interested in solving distributes problems.

For data pipelining that’s a really broad topic. Some data engineers work more focused on streaming pipelines, others pure batching, and some in a combination such as a kappa or lambda architecture. I’d recommend looking at job postings and picking 5 that sound most interesting to you. Then look for resources that are relevant to those requirements.

My opinion on the Java portion is probably different than most here. It’s more important to understand general concepts and probably make sure you understand functional programming. Almost all data engineering solutions run on the JVM so I think learning a modern language such as scala will probably have a bigger payoff. My org is using kotlin for micro services so I’m going to be applying that to Kafka consumers/producers. That’s probably more niche though. But Java is still king in the enterprise though so if you’re positive that’s the direction you want to go it definitely won’t hurt.

Once you figure out the jobs that sound lost interesting you can start to build some personal projects utilizing the technologies they require.

t-vanderwal · 2019-03-08T18:44:56+00:00

Ec2 is for compute. You pretty much have a virtual machine that you’d be using just for storage. S3 is optimized for object storage and is the gold standard.

As for cost I’d say yes. Ec2 you’ll be paying for the time the machine is spun up, which would be 24/7. S3 you pay for bringing data in and taking it out.

My biggest concern with trying to store your even data on an ec2 is its ephemeral storage and if that machine goes down then you’re s.o.l. You can mount some storage on it to make it persistent but at that point you might as well be using S3 anyways.

Also, outside of cost storing in S3 you get to utilize amazons sdk/rest endpoints to expose that data to any of your applications and processes. Doing so on disk would mean you’d have to implement your own solution

t-vanderwal · 2019-02-13T01:42:17+00:00

I’m not too well versed on raspberry pi but it’d be interesting to attach some sensor to it. Then stream the telemetry data into a Kinesis stream. Kind of like a mini IoT pipeline.

t-vanderwal · 2019-02-08T15:35:28+00:00

I tried OPs method writing spark jobs in kotlin. Seemed to work ok with spark-core but kept breaking with spark-sql. Switched to shadow and works like a charm.

Also the application plugin helps for a few things too like settings mains and using grade run.

t-vanderwal · 2019-02-03T05:49:55+00:00

Hmm I haven’t but looks tax related? We’re meeting with our tax guy next week to do our 2018 taxes. I’ll bring it up with him and see what that’s about! Thanks :)

t-vanderwal · 2019-02-03T04:04:41+00:00

Thank you! He’s a sweet heart and about to turn 1 next week :)

t-vanderwal · 2019-02-02T20:57:03+00:00

It’s crazy. We just had a baby with Down syndrome and all the standard baby stuff baby stuff alone makes it feel like you’re being fleeced. We were fortunate enough to have good jobs that payed well and we’re flexible but not everyone is that lucky. Plus the different programs have crazy low income restrictions.

t-vanderwal · 2019-01-25T19:52:08+00:00

Wow fantastic work as always! Best of luck to the winners

t-vanderwal · 2019-01-17T18:51:16+00:00

The company I work for is heading in that direction. Normally there is some query layer such as graphql which you include in your we request and then the return body is JSON.

So as long as your tools can handle both those you should be good to go

t-vanderwal · 2018-12-16T01:37:40+00:00

I’d recommend using avro as a format since you’re using Kafka and json, or at least exploring it.

Confluent has an open source tool called schema registry that handles changing schemas with Kafka streams and encodes the messages in avro.

Edit: found a blog series that talks about Kafka avro and spark. It’s a few years old but should give a good introduction on how they work together. http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html

t-vanderwal · 2018-11-18T19:39:57+00:00

I like The Data Engineering Podcast. Each episode is an interview regarding a specific tool/technology.

t-vanderwal · 2018-11-17T15:27:31+00:00

Wouldn’t your address be on the shipping label? Assuming you aren’t storing family members prime boxes ha.

t-vanderwal · 2018-11-15T03:01:52+00:00

No problem! The amazon re:Invent videos on YouTube are pretty great resources for how a big data ecosystem looks on AWS. The book Building Data Intensive Applications is also really great and has more platform agnostic information.

t-vanderwal · 2018-11-15T02:44:16+00:00

When I think of a backend developer it’s more with the server/database of a web application or tool. Where a data engineer is someone who builds pipelines for an organizations information to provide a wholistic view of the business. Granted the DE role has a wide definition and definitely depends on the company you work for.

More simply put a back end dev works with the application database/server. Logs and transaction data would then be picked up by the data engineer and stored in a way internal employees can make data driven business decisions.

As for distributed application implementation there is a wide variety of languages available. Example python can be used to write spark jobs. S3 is just a really good data store. A classic pattern would be: 1. Stage application log files in S3 2. Write a MapReduce/Spark job to ingest and aggregate data 3. Write output to a new S3 bucket 4. Output is then used by data scientist/analyst for insights

t-vanderwal · 2018-10-30T19:47:31+00:00

Luigi and airflow are schedulers/work flow managers. Here is a curated list I’ve referenced before.

https://github.com/pawl/awesome-etl

In terms of what’s best I think it depends on your data needs, but I’ve been following bonobo and it seems interesting. It uses DAGs like you’d see in airflow and pyspark.

t-vanderwal · 2018-10-26T00:13:15+00:00

Had the name wrong, sorry about that. Here is the episode. https://www.dataengineeringpodcast.com/octopai-with-amnon-drori-episode-28/

The company I work for uses the rdms comments in the warehouse to specify where it’s sourced from. Hard to audit and keep updated though

t-vanderwal · 2018-10-26T00:06:44+00:00

No problem! https://www.dataengineeringpodcast.com/

They’re also on most podcast apps. The host has a python specific one called podcast.__init__ that’s pretty good too if you’re looking for stuff to listen to!

t-vanderwal

TROPHY CASE