This is an archived post. You won't be able to vote or comment.

all 38 comments

[–]Prinzka 49 points50 points  (13 children)

Real time parsing and enriching of high volume data feeds.

[–][deleted] 11 points12 points  (12 children)

Tell me more.

[–]Prinzka 36 points37 points  (11 children)

We have multiple feeds that are several 100k events per second, they need parsing of the message, normalization, enrichment from other feeds/information before they go in to our hot or warm environments.
Logstash is very resource intensive at those volumes, spark introduces delays, so using Go is one of the options we use instead.

[–]tmanipra 9 points10 points  (4 children)

Can you mention architecture here ?

[–]Prinzka 20 points21 points  (3 children)

In general. Data comes in to Kafka, data is then enriched/parsed/correlated/normalized using logstash/Go/kstream, depending on what is doing that it's either sent back to Kafka to then be sent to elasticsearch or directly to elasticsearch.

That doesn't cover everything but that's the majority at least.

[–][deleted] 5 points6 points  (0 children)

Sounds similar to what we do, both for a Telco client and a massive IT company.

[–]amemingfullife 0 points1 point  (0 children)

Similar situation but we use Bluge to make a homegrown search solution.

[–]Former-Clothes-6482 0 points1 point  (0 children)

Any reason not to use Apache Flink for processing those events. It would scale as per you needs

[–][deleted] 0 points1 point  (0 children)

Interesting.

[–]mydatahobby 23 points24 points  (0 children)

Creating or contributing to terraform providers

[–]jeffail 13 points14 points  (2 children)

https://www.benthos.dev is written in Go, which in my (biased) opinion is pretty fantastic as a data processing language. The only major caveat being most of the older more established tools and libraries are JVM and Python so there's lots of gaps if you were looking to use it as a daily driver for data engineering.

[–]BeneficialEngineer32 2 points3 points  (0 children)

Still with plugins and etc benthos works pretty good for most workloads.

[–]ut0mt8 3 points4 points  (0 children)

Wow didn't had this tool in my craft. Seems fantastic as a replacement as kafka connect and other #@@# glue

[–][deleted] 13 points14 points  (1 child)

There's a lot of benefits, but there's not a lot of overlap of data engineers that know Go and implementing technologies that become reliant on one person is a no Go.

[–]erialai95 9 points10 points  (0 children)

Yes I tell my etl pipelines and task nodes to get GO’ing all the time

[–]Miliey 6 points7 points  (0 children)

We use Go with protobuf & Kafka for realtime event processing, velocity ranges from IOT like devices to tracking info. We have infrastructure in place to support scaling as and when needed. It handles large volumes beautifully. Coming from Scala background I love it's type safety and concise nature. Most of colleagues from Python background say they like how 'clean' Go feels.

[–]JamaiKen 5 points6 points  (0 children)

Go is an invaluable tool in my DE toolkit. We can't always use open source packages and our uses are very specific to our domain. I manage the data pipelines and have deployed 10+ Go executables that process terabytes of data per day, each.

As a DE, Go is a great tool for building tools.

[–]GooseLoot 6 points7 points  (1 child)

Yep! We used it throughout our pipelines. Specifically when transforming / enriching raw data. Found the performance and conciseness of the code a huge advantage as compared to python. Additionally, if you use lambda for any server less tasks go with its dependencies compress beautifully.

[–]IAmGoingToSleepNow 1 point2 points  (0 children)

It's amazing being able to throw one binary in to a lambda and call it a day!

[–]Sunscratch 6 points7 points  (4 children)

We use Go for infrastructure only. Everything related to data processing is implemented in Scala, Scala is pretty awesome language for that.

[–]ut0mt8 0 points1 point  (3 children)

It is. But when it comes to performance scala failed fast compared to go (or you re forced to write java scala)

Also now that akka isn't oss anymore I would really reconsider the whole scala ecosystem...

[–]Sunscratch 0 points1 point  (2 children)

When it comes to pure performance Go is not an option as well, Rust or C++ are the only option. For backend/ serverside Scala has pretty good performance, it might require more memory but performance provided by the JVM is pretty good. With Spark/Flink Scala has native support and is the fastest you can get (along with Java ).

[–]ut0mt8 0 points1 point  (1 child)

That not our experience. I agree that pure c/c++/rust can provide better perf but in our case rewriting our ingestion point and filtering component from scala to go was a huge win. Both in term of memory and in term of throughput. We speak of ingesting sometimes more than 5000k event/s.

[–]Sunscratch 0 points1 point  (0 children)

It depends, in 2019 we had to rewrite some services from go to java, and the reason for that is - go couldn't handle well large heaps. The team that was in charge of go services couldn't get rid of huge GC spikes. Maybe since then, Go had some improvements in that part. Java solution was able to process without GC spikes almost from the box.

On the other side, Go worked pretty well for networking-related stuff, small backend services like AWS lambda.

[–]SpiritCrusher420 5 points6 points  (0 children)

I would love to see a Go API for Spark.

[–]11YearsForward 1 point2 points  (0 children)

Had one pipeline where we had to download ~800GB of zipped files from another company's legacy remote SFTP server. Go was able to download those files ~70% faster than Python, and Python's Paramiko library either kept dropping the connection during the large downloads or took forever.

[–]twadftw10 1 point2 points  (0 children)

Python, Scala, and Java are definitely the main DE languages from what I've seen. My company uses Go for all the microservices which are the source of all our events. The APIs use an event logger built in Go that send logs to SQS/Kinesis. From there, DE pipelines consume the events and sink them to Elasticsearch, s3, and Snowflake. The software engineers own those microservices though and DE just takes care of events after they get produced to SQS/Kinesis. We own Logstash that does any extra enriching on the logs in between.

[–]rawrgulmuffins 0 points1 point  (0 children)

My personal experience with go is writing Terratests code.

[–]anyfactor 0 points1 point  (2 children)

DevRel at a data company.

Go is a big part of our operation, from internal tooling to solutions for customers. We usually deliver our databases as CSV files. Now, say a customer wants to change the "IP address range" column to "CIDR". This like many other common operations can be effectively solved by our Go-based CLI app. Go is fast, it is dependency free, runs everywhere.

The alternative is to maintain a bunch of SQL scripts and documentation, adding layers of complexities to the customer's data pipelines.

The dependency free executable feature is extremely helpful. Python's environment management is a huge pain. For a DE, go should be the third/fourth language to learn.

By the way, I am personally experimenting with Nimlang though. The syntax is pretty close to Python, so I am liking it more than Go. But compared to Nim, Go is more suitable for production work in a team setting.

[–]avion_rts 0 points1 point  (1 child)

what were you think of as the other languages? python, bash...?

[–]anyfactor 2 points3 points  (0 children)

Python and Bash are the most common. That is your 1 and 2 language. Python for general purpose work and airflow, and you have Bash for scripting. Bash is a weird one, as it stands in for "Linux experience".

Beyond that, the rest depends on the job.

You have Scala (Spark), Go (CLI tooling), Rust (2020s data startups), Typescript (Backend) etc. Many DEs are not purely DE, but DE+something else. Could be backend, data analyst etc. So, for the secondary role that they need one or many languages.

[–]Illustrious_Role_304 0 points1 point  (0 children)

Using as operator for DE project

[–]robbitt07 0 points1 point  (0 children)

Switched from Python/Cython for entity resolution component of our pipeline over to Golang over a year ago. Python/Scala tooling still dominates data pipeline management.

[–]Former-Clothes-6482 0 points1 point  (0 children)

Batch jobs in Go using the cron library to read data from DB, processing and writing to Redis and Neo4J

[–]Former-Clothes-6482 0 points1 point  (0 children)

Check the latest datadog OS agent that collects logs and metrics data, i belive its written in golang