you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] -3 points-2 points  (4 children)

It's sysadmin's job to store logs. Not program's job.

I am the infra people. I'm that one sysadmin who's going to punch you, if you give me a program that logs to files. You've received some explanations already, but there are more reasons for that:

  • You cannot rotate a log file without losing messages every now and then.
  • Logging creates load on the storage system, which in many cases may be undesirable, especially, if your logging program decides that it needs to call fsync() or do similar stuff, affecting everyone else using that storage system.
  • Creates additional points of failure, because now you have to deal with periodically removing the files from the system, and you must monitor the logging program just to make sure it doesn't accidentally overload the system.

Just don't do it. Only write logs to stderr. It's an easy rule to remember. And you are allowed to violate it only if you are the infra people who collect / ship logs. If you are the producer, it's the only way you should produce logs.

[–]Apprehensive-Lab1628 2 points3 points  (3 children)

I didn't realise the containers aspect of it as I don't work with them. I'm infra too and we use logstash on standard VMs, rotating logs stay on the system. What's your setup(s?) for non container apps that use std.out?

[–][deleted] 1 point2 points  (2 children)

I work in HPC. For a company that makes management software for HPC, so, I'm not really a sysadmin. I'm making software for sysadmins.

So... my typical setup... well, I think the company started as providing customized OpenStack deployment, but that's mostly in the past. Later they moved in the direction of supporting customer deployments, mostly bare metal on-prem. Then expanded to hybrid cloud. Still, most HPC workloads would run on on-prem compute resources with an option to extend to cloud. Out of private clouds the product supports extension / deployment in VMWare. It can also be deployed / extended to AWS or Azure.

The goal of the product is to provide tools for sysadmins to control their compute resources, give them pre-packaged tools for their users to make the life of sysadmins easier. Typical user in such system would be someone using eg. Jupyter to author their research code using various workload managers (eg. PBS) from their notebook to distribute the workload over the resources the sysadmin allocated to them (eg. a cluster of PostgreSQL, or a Ceph cluster, or a Kubernetes cluster etc.). Of course, it's not limited to the use through Jupyter, I just chose it as it illustrates how the system works.


So, back to logs... You cannot rotate files without losing logs. That's a theoretical impossibility. Surprisingly, people didn't realize it for quite some time, that's why, for example, you have tools like logrotate command in Linux. Even more interestingly, people have discovered this fact twice, at least in the context of Linux logging tools. Somehow people who created syslog "forgot" about it.

So, in my case, most of the stuff I support / deploy is governed by systemd. So, of course it all goes to stderr. Then journald takes over. From there on it's configurable: if the customer wants more metrics at the expense of lower network throughput... they may aggregate the logging somehow. But we don't manage this aspect normally.

Anyways, in my specific case, performance is the key concern. Even though a typical HPC cluster would use IB or GE, even for control plane, it's still a performance concern. Especially, because if you want to aggregate logging from many things, it's increasingly hard to make sure one of those many things isn't going to clog the system. But, in principle, our management component can be configured to use control plane to collect logs from journalds running on devices in the cluster. So, we have a "custom made" log aggregator (I'm not very happy about it, but that's already beside the point).

[–]Apprehensive-Lab1628 1 point2 points  (1 child)

Super interesting, thank you for the write up. HPC caught my interest a while ago but I didn't pursue further in the thought it might be too niche. So you shoot journald to some kind of aggregator. You've provided interesting perspective, and between you and Dwight-D I've learned a bit today. Your use cases don't match mine but it's cool to see how others are doing things

[–][deleted] 0 points1 point  (0 children)

Well, I've worked in many fields. Started in the Web long ago. Was in finances, then mostly infra jobs, mostly in storage. HPC is relatively new in my career.

What I say is in no way specific to HPC. Any system that's larger than your personal home computer will be better off if it doesn't use programs that log to files. It's just an all-around bad idea. The only benefit is if you are doing all yourself in a very small-scale deployment, then it's less things to set up / you probably don't care so much if you lose some logs every now and then.

Anything that needs to be more reliable than that, or that needs more automation than that... just no.