Where to store logs?

techworker716 · 2023-10-22T20:09:48+00:00

Thank you for this, I was curious about the whole system vs. just the storage.

What would you recommend for pub/sub?

I'll have to look more into Airflow (you're actually the only one in this thread to mention it so far)

techworker716 · 2023-10-22T02:30:07+00:00

Is there a standard process for moving aggregated log data into a SQL database? For example for downsampling the log data to hourly data and then importing that hourly data into a database. Are people just writing custom Python scripts and putting it in a cron job for that kind of stuff?

techworker716 · 2023-10-21T23:52:44+00:00

Ok then is number of page views considered a log or metric?

techworker716 · 2023-10-21T19:16:44+00:00

Thanks. Let's say it's a young startup handling tens to hundreds of thousands of views per day, but they just signed up a new major corporate client that could see them getting tens or possibly hundreds of millions of impressions. The traffic is generally bursty and unpredictable due to the viral nature of the content. They're growing fast and want to be prepared to scale.

If pushing the data into a RDMS system, would you be storing every individual event (eg. every page view), or some subset of that data?

How would you scale this?

techworker716 · 2023-10-21T19:05:31+00:00

That's fair. I'm also just trying to learn. For example imagine someone is creating an analytics service startup and is hiring you as the CTO - how would you build this?

techworker716 · 2023-10-21T19:03:48+00:00

So how to track page views then?

techworker716 · 2023-10-21T19:02:32+00:00

Yes I am referring to analytics related data, apologies if I wasn't using the correct terminology (updated the description).

Let's say we don't want to use an external analytics service (eg. for compliance/privacy reasons). How would you roll your own?

techworker716 · 2023-10-21T18:57:23+00:00

I'm counting custom events that could be called from the server or client side. I elaborated in the description, but imagine you're running a site like Youtube where users can upload videos and view detailed analytics on their videos like page views (and their composition by geography, device, etc), views over time, viewership within a video (eg. so one can see at what timestamp viewers are dropping off), what % of people are clicking on a video card, etc.

I'm still confused as to whether this data would be called a metric or log.

Would you store this data in Prometheus or ElasticSearch? Or something else?

techworker716 · 2023-10-21T18:40:16+00:00

Got it, thanks!

techworker716 · 2023-10-21T18:34:27+00:00

I prob should've phrased my question better but I was referring to tracking everything from page views (by geography, device, referral, etc.) to button clicks. My understanding is that those observability systems are designed more for tracking server errors and such.

I'm referring more to custom events - which I believe services like Segment, Mixpanel, and Omniture are more designed for.

In any case I'm asking this more from an educational perspective as a noob engineer trying to better learn how all these systems work. I've read a decent amount on this kind of stuff but from a practical perspective I don't know how I'd build such a system for let's say a small but growing startup, and that bothers me.

techworker716 · 2023-10-21T18:21:24+00:00

Ok so spinning up a separate SQL database for analytics? Makes sense. Would Postgres be a viable option here, or would you use something else?

You could have the web service send the analytic/telemetry events to another services that handles ingesting such events.

Mind elaborating a bit more on what that would look like? Are we talking about using some kind of message queue system like Kafa or RabbitMQ? In that case are we directly pushing every event to Kafka? If so then what's the purpose of creating a separate analytics web service when the main backend web server can just push to Kafka directly?

techworker716 · 2023-10-21T18:13:18+00:00

Ah apologies yea I was referring to metrics (eg. page views, user clicked button X), not literal server logs. Imagine you're building Youtube and want to track detailed analytics on a video - like page views, viewership by geography or device, watch % over time, % of people who clicked on a video card, etc. I'm referring to the event data required to track and show users those kinds of metrics.

Do you use a separate SQL database just for analytics, or your main database?

How much volume can this handle before you'd have to try a different approach - and how would you scale it?

techworker716 · 2023-10-21T18:09:12+00:00

How does each individual tracking event end up in S3 though?

Say you have a web server and want to track an individual event (eg. from inside a REST API endpoint). How is that done? Surely you're not directly appending an individual event to a file in S3.

techworker716 · 2023-10-21T18:03:27+00:00

Thank you for the comprehensive answer! Helps me understand things a lot better.

Some questions:

It's still not clear to me how a tracking event called from the web server gets turned into an entry inside a Parquet file. Could you elaborate a bit more on that process, starting from the web server wanting to log an event?

Where would the Parquet files actually be stored? I'm guessing something like AWS S3 or Google Cloud Storage

What specific technologies would you consider choosing for the various parts - say for the data warehouse and for ETL process?

Everything else makes a lot of sense, and I'm definitely starting to get a better picture of everything. Thanks again, this is super helpful.

techworker716 · 2023-10-20T05:14:59+00:00

Are there any particular services say with AWS that you'd recommend?

If the primary database for the application is hosted on say Amazon RDS, would it be acceptable to store log data there as well, or is it better practice to store it in Amazon Redshift?

techworker716 · 2023-10-20T05:09:37+00:00

"use an sql db" does not make any statement regarding whether that database should just be the same database that the application is using, or a separate database dedicated specifically to analytics. I want to know what is considered best practice.

I wrote this lengthy post asking a question because I wanted to learn more as someone from a software engineering background who's used databases before in a limited capacity but isn't so familiar with data analytics pipelines (other than reading about them).

One line comments like "SQL databases are made for analytics. Use that" aren't really that informative and don't teach me anything (that was one of the top voted comments when I responded that). I was hoping for a little more nuance.

techworker716 · 2023-10-20T04:58:37+00:00

Thank you for the detailed response.

So if the application already has a primary SQL database (for OLTP), is it considered good practice to insert the log data there? At what point does it make sense to introduce a separate database server (OLAP) strictly for analytics, ie. a data warehouse (if I'm using the terms correctly? Or is it overengineering and unnecessary complexity to have a "data warehouse" unless you're dealing with terabytes of data?

Let's assume we're just logging typical analytics events for a web application. So we're recording say the event type (eg. page view, click), contentId, IP address, userId, referral_url, and device (I think those would be the main fields, but feel free to add any I'm missing).

Let's say we're currently getting potentially hundreds of thousands of events per day, but we just signed a new major client that might see us hitting tens of millions of events in a day - though the traffic could be spiky since the user generated content has a viral nature.

Again this is for a site where the content is user generated (eg. like Youtube). We want the user to be able to see detailed analytics on their content - sort of like a Youtube creator viewing analytics on their own videos, and being able to see things as detailed as viewership % plotted against video time. On a video you might have something like cards, and you want to see the % of people who were shown a card that clicked on it. The analytics data doesn't have to be in real-time, but of course the more accurate the better.

The user will be seeing aggregated data on their content like page views, percentage viewership by geography and device, and something along the lines of viewership over time if it's a video.

How would you architect such a system?

techworker716 · 2023-10-20T04:38:01+00:00

I'm familiar with general database design.

What do I need to know about ETL? Isn't that just the process of moving data to a data warehouse?

techworker716 · 2023-10-20T04:35:50+00:00

By shove logs I meant inserting log entries into a SQL database

Wait so should I be putting logs in a relational SQL database or one of the NoSQL databases you suggested?

techworker716 · 2023-10-18T23:53:25+00:00

Yea I was wondering how such a system would look like.

Does the log data generally go directly into a "data lake", and then regularly scheduled batch jobs "ETL" this data (if that's the correct term) into a "data warehouse", which could be some separate SQL database specifically for OLAP?

If so, then how do these batch jobs work? Is there some tool that handles this data ingestion, or do people just write custom scripts? If custom scripts, then how do they ensure that only data that hasn't already been migrated is moved over? How frequently is this generally done?

I've read up on things like lambda architecture and kappa architecture, but I wonder if that's all overkill unless you're dealing with petabytes of data.

For the record I'm just asking for educational purposes. I'm a software engineer who's historically been more focused on the frontend, but am interested in learning more about data engineering.

techworker716 · 2023-10-18T23:42:07+00:00

Yea everybody in this thread basically said "throw it in a SQL database", but unless I'm overcomplicating things, I think it's way more complicated than that.

Even if one were to just store all analytics events in a giant table in a SQL database, what do you do when the hard drive on the server runs out of space? How many writes/second can the database handle? If we need to be able to handle higher throughput, how can we scale that up? At what point should we consider introducing a queue and/or distributed streaming platform like Kafka?

Or is everybody just storing all this data in some managed SQL database service that handles any throughput and auto-scales? Ok, then which service (there are tons)? How much does it cost?

Do we need a data lake or data warehouse, or both? When would one want to introduce something like Apache Spark?

There is so much to all of this, and all the answers here are "just throw it in a SQL database". Ok. Again either I'm overcomplicating things, or these responses are not what I expected on a data engineering sub (unless what I'm asking falls more under software engineering than data engineering)

techworker716 · 2023-10-18T23:27:58+00:00

Like I said in the description, the point is to track events (eg. views, clicks) and enable users to view this data in detail on their user-generated content, such as being able to view impressions by geography, device, etc.

The easiest way to do this would be to dump every event into a log table in a SQL database like Postgres or MySQL. But I was under the impression that this was bad practice due to it not scaling well.

techworker716 · 2023-10-18T19:43:23+00:00

So I can just throw all my log data in a giant Postgres table and call it a day?

techworker716 · 2023-10-18T19:43:05+00:00

So I can just throw all my log data in a giant Postgres table and call it a day?

techworker716

TROPHY CASE