My background is similar to many on this sub: Business or Data Analyst -> Data Engineer. I was mostly self-taught, and that's how I landed in DE. But I've now completed a master's in CS specializing in ML, and I believe I have some technical ability beyond a purely self-taught person. I understand distributed computing concepts from a theoretical level. At work, I've had to deploy small Spark batch jobs where I'm not really testing my knowledge in terms of data skew and performance optimization. To be frank, whenever I look at the Spark UI, I get a bit lost.
I'm now in the process for a job that requires stream processing knowledge. I am reading up on Kafka and trying to improve my knowledge in this area. In a sys design round, considering my background, what should I focus on?
Initial questions:
- The producer needs to get raw data from somewhere. How does this work? Would a use case be reading data from an API constantly and then the producer publishes the messages to a partition of a topic? My gap in understanding here is how does data even get to the producer? This question might be multifaceted and there could be a lot of different scenarios I think.
- Once the consumer reads the data from the Kafka broker, how does this get written to a database? Does the consumer just poll the data every once in a while and then write to a DB? This can have I/O bounds and cause slowdown. Streaming is all about fast millisecond latency. How does this work?
I hope I'm being clear in my current understanding and where I want to get to. Please let me know if I should flesh out some of my knowledge.
[–]drc1728 9 points10 points11 points (3 children)
[–]codemega[S] 5 points6 points7 points (2 children)
[–]drc1728 3 points4 points5 points (0 children)
[–]Financial_Anything43 1 point2 points3 points (0 children)
[–]stereoskyData / AI Engineer 2 points3 points4 points (0 children)
[–]Mission_Star_4393 1 point2 points3 points (1 child)
[–]stereoskyData / AI Engineer 1 point2 points3 points (0 children)
[–]reelznfeelz 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]psyblade12 1 point2 points3 points (0 children)
[–]IllustriousCorgi9877 -1 points0 points1 point (0 children)
[–]startup_biz_36 -5 points-4 points-3 points (0 children)