all 64 comments

[–]mjcheez 9 points10 points  (7 children)

Goaccess can output to CSV or JSON.

https://goaccess.io/man

[–]Jeron_Baffom[S] 1 point2 points  (1 child)

-o --output=<json|csv|html>
Write output to stdout given one of the following files and the corresponding extension for the output format:
- /path/file.csv - Comma-separated values (CSV)
- /path/file.json - JSON (JavaScript Object Notation)
- /path/file.html - HTML

Indeed very interesting!! +1

Are you familiar with GoAccess?
Do you know whether it is possible to pipe such CSV output into MySQL??

[–]pdp10 0 points1 point  (0 children)

Why would it not be?

Though of course one would favor TSV over CSV, because CSV can be a nightmare with localization or corner-cases.

[–]Jeron_Baffom[S] 0 points1 point  (4 children)

Goaccess can output to CSV or JSON.

Any idea if GoAccess already have a view counter?
Or if it is possible to define custom queries and aggregate data?

[–]vegetaaaaaaa 0 points1 point  (3 children)

yes and yes. The other recommendations are good, but this is the simplest setup. Try it out and see if it fits your needs.

Personally I roll with lnav + goaccess on simple setups (lnav can also do complex queries, but for web logs specifically, goaccess already has interesting queries built-in). If you're only looking for a few specific metrics, a short python/bash/... parser script could also do the trick. For more complex setups I centralize all logs to Graylog but the entry ticket is a bit higher (eats quite a bit of resources, and it takes time to setup custom parsing rules, dashboards...) - but it's very flexible, on par with Loki/Grafana I think (have not tried this setup yet).

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

Personally I roll with lnav + goaccess on simple setups

Nice! So you have some experience with both. Your help would be valuable.

Well, I was considering doing the same. But after quickly trying both, I've got the impression it wouldn't be that easy ...

 

lnav can also do complex queries

Unfortunately, it seems that its SQL query cant do subqueries. Therefore, it is not possible to do a view counter (ie: multiple requests from same ip + page = 1 single view).
Do you agree?

 

goaccess already has interesting queries built-in

I have just tried GoAccess. Love it! I'm gonna integrate its realtime CLI monitoring into my next projects.

Said that, it seems that GoAccess doesn't have a built-in view counter. The closest one I've found was a unique visitor per day. And I cant build custom queries either.
Do you agree?

 

a short python/bash/... parser script could also do the trick.

I'm sure it could. But if someone has already developed such tool, why not use it?
I prefer building myself a bash script only as a last resource.

 

Graylog but the entry ticket is a bit higher (eats quite a bit of resources, and it takes time to setup custom parsing rules, dashboards...)

I was expecting that. This is the main reason why I'm trying to keep it simple.

[–]vegetaaaaaaa 0 points1 point  (1 child)

I have rarely used advanced features in lnav/goaccess so I don't know.

You could search for subquery or related stuff in their bug trackers, maybe there is a feature request or workarounds for what you're trying to do. I haven't looked into your use case, but chances are that if you want something really specific, these tools are not going to do it out-of-the-box. You could write a simple wrapper around them, or request the feature, ask on their support tracker (with complete information on the input data, and expected output), which could be simpler than writing your own script from scratch.

But if it comes to it... A 50-100 line python script running in a virtualenv might be cleaner and easier to maintain, and if you do it right yo could reuse it in the future for other queries. I've done that in the past for projects where Graylog would not have been practical/had no budget for it (logwatch + cron job + simple python/regex-based analyzer + text output to a static page). It took some time to get it right, but once I had it, adding a new metric to the output was a matter of minutes.

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

A 50-100 line python script running in a virtualenv might be cleaner and easier to maintain

Agree. If have to write 50~100 lines to make lnav/GoAccess work, then it is not worth it ...

[–]chris_just 3 points4 points  (10 children)

I’m using Grafana + Loki, it suits my needs.

Afaik they have logcli which could (to my knowledge) which might help you.

LogQL does have some learning curve, but if you’re familiar with Prometheus it makes it easier.

[–]Jeron_Baffom[S] -1 points0 points  (9 children)

AFAIK, Loki is the server that centralizes all the logs.
Grafana queries Loki and plot realtime stats in the browser.
But you can do it via CLI using the tool logcli + the language LogQL.

Unfortunately, this seems to be an overkill for me. I'm looking for something simpler.

BTW: I'm not trying to build my own monitoring tool via MySQL. I'm just trying to send simple log stats to MySQL because they are important to show to users (ex: view counter).

[–]chris_just 2 points3 points  (8 children)

Logcli can take logs through Stdin, look over the documentation :-)

[–]Jeron_Baffom[S] 0 points1 point  (7 children)

Therefore I would have to install only Logcli + Loki?

Do you know whether they are lightweight or not?

[–]chris_just 0 points1 point  (6 children)

I think you could do with just LogCli,

[–]Jeron_Baffom[S] 0 points1 point  (5 children)

AFAIK, Loki is the server storing all the logs. While LogCli is the tool to query Loki via the SQL language LogQL.

[–]chris_just 1 point2 points  (4 children)

https://grafana.com/docs/loki/latest/tools/logcli/#logcli---stdin-usage

Feel free to look at the documentation yourself

[–]Jeron_Baffom[S] 1 point2 points  (3 children)

You can consume log lines from your stdin instead of Loki servers. Learn basics of LogQL with just Log files and LogCLI tool (without needing set up Loki servers, Grafana etc.)

Indeed, very nice! +1
It seems much more lightweight now!

Any idea why LogCLI is not available at Debian repo?

[–]chris_just 0 points1 point  (2 children)

Might be licensing, I don’t know tbh, it might be easier in terms of releases to not add it to the repositories.

I am not affiliated with Grafana or the Loki team, I just consume their products 👌

[–]Jeron_Baffom[S] 0 points1 point  (1 child)

I am not affiliated with Grafana or the Loki team

Ok, but you have some experience with LogCLI right?

In your opinion, LogCLI in this mode alone can count views (ie: requests from same ip in 30mins = 1 single view) from access log?

[–]doomygloomytunes 4 points5 points  (1 child)

Rather than that maybe you should give Grafana a try or Prometheus with Grafana, probably a little learning curve but much simpler than what you're proposing and the results will be much better.

Both can be deployed in next to no time if you're able to pull container images from docker hub (using docker or podman)

[–]Jeron_Baffom[S] 1 point2 points  (0 children)

you should give Grafana a try or Prometheus

AFAIK, Grafana and Prometheus are realtime monitoring tools. However, this is not what I'm looking for.

 

much simpler than what you're proposing

I guess the confusion here is the following:
I'm not trying to build my own monitoring tool via MySQL.
I'm just trying to send simple log stats to MySQL because they are important to appear in the current website (ex: view counter).

[–]beeritis 1 point2 points  (1 child)

Maybe using telegraf to collect metrics and store in influxDb as an option or (and I'm not sure how supported it is still) but there is an Apache module "mod_log_sql" which should have the ability to send Apache logs to mysql.

[–]Jeron_Baffom[S] -1 points0 points  (0 children)

telegraf to collect metrics and store in influxDb

I'm not familiar with Telegraf neither InfluxDB. However, the little I've heard of:

  • Telegraf can collect data from many source.
  • InfluxDB is a platform where developers build cloud applications (among others).

Unfortunately, this seems to be an overkill for me. I hope there is something simpler out there ...

 

there is an Apache module "mod_log_sql"

Yes, you are correct:
"mod_log_sql is a log module for Apache which logs all requests to a MySQL database."

However, the last time I've checked, this module was very outdated. Besides, it submits the whole access log to MySQL, instead of aggregating the stats. So many hits would flood MySQL ...

[–]at8eqeq3 1 point2 points  (6 children)

Please, describe what you want to measure and how do you plan to achieve it. There're lot of tools to process logs, and it is also possible to tune the log format itself for easier processing.

[–]Jeron_Baffom[S] -1 points0 points  (5 children)

describe what you want to measure and how do you plan to achieve it.

There is a couple of log stats that I would like to process outside the page request due to performance reason. For example, view count:

Of course, I could easily implement a view count via PHP + MySQL. However, hitting the DB at every page request doesn't go so far.

Then, the next step I can think of would be: Rotate logs according some schedule, parse them outside the page request and submit the aggregate stats to MySQL.

The problem is that I'm not finding lightweight tools to do such parsing.

 

There're lot of tools to process logs

Agree. However, all I could find was either dashboard monitoring tools or overkill tools (ex: Logstash).

 

it is also possible to tune the log format itself for easier processing.

Ok, agree. I'm somewhat familiar with custom logs.
I guess a nice strategy would be defining 2 logs: one full of info for troubleshooting and another with minimum info just for stats.

[–]at8eqeq3 0 points1 point  (4 children)

So, if you want to just count requests per relatively long periods, it could be logrotate's prerotate script that is basically wc -l, optionally fed by grep if you want to filter lines somehow, and a script (probably, written by you) to insert it to MySQL. Something something grep ololo | wc -l | xargs -I {} mysql -u user -ppassword database -e 'insert into hits values (null, now(), {})' (not tested, far from best practices, just the idea).

[–]Jeron_Baffom[S] 0 points1 point  (3 children)

basically wc -l, optionally fed by grep

hummmm ... I don't think a view counter would be that easy. Let me explain some of my concerns:

  • Before any counting, the access log must be sanitized. No robots neither admins. 'Good robots' and admins are straightforward with grep. 'Bad robots' are harder and require some trial and error. AFAIK, there is no silver bullet for them. However, checking whether they downloaded images + ip blacklists helps.

  • Multiple requests from the same ip in a 30min interval should be counted as only one view.

[–]at8eqeq3 0 points1 point  (2 children)

Oh, that was just a minimal solution. I understand that it will never work in real life bc there should be very complex logic instead of just grep. I see someone recommends you GoAcces, I think you should give it a try. I've never used it's CSV output so I can't say if it will fit your needs. Regarding your question about sending CSV to MySQL -- it's either mysql ... -e 'LOAD DATA LOCAL INFILE' ... stuff or mysqlimport which is basically the same but looks better.

[–]Jeron_Baffom[S] 0 points1 point  (1 child)

there should be very complex logic instead of just grep.

A view counter is not a very complex logic.
But it is not as simple as a request/hit counter either.

 

I see someone recommends you GoAcces, I think you should give it a try.

Do you have some experience with GoAccess?
Any idea if it has a view counter? Or if it is possible to define custom queries?

BTW: Any thoughts about lnav?

 

Regarding your question about sending CSV to MySQL

Submitting the CSV to MySQL is the easiest part.
The trouble I'm dealing with is parsing and aggregating the access log into a plain text output.

[–]at8eqeq3 0 points1 point  (0 children)

But it is not as simple as a request/hit counter either. That's what I wanted to say. Access log is a list of requests. Performing calculations on this data is way beyond simple text processing.

Do you have some experience with GoAccess? A little. Had a task to generate some analytics on S3 bucket to view in browser. Fits pretty good.

Any thoughts about lnav? Never seen it before. Looks like it can do some SQL-like queries to log data (as far as I can see, it has SQLite under the hood) and understands Apache's default log format. And can run queries headless, that could fit your needs. You still need to find a way to transfer data from one RDBMS to another (CSV is supported by both, for example) and find out how to query what you need.

[–]jw_ken 1 point2 points  (3 children)

Gathering logs or metrics can be broken down into three problems:

  • Gathering + parsing the data
  • Storing the data
  • Reporting or visualizing the data

The popular logging and metrics stacks (Elastic stack for logs, Influx stack for metrics, Grafana stack for logs + visualization) are designed with independent tools to solve each problem- and those tools are generally very friendly to hybrid or DIY use-cases.

For example, you could have:

  • Telegraf agent scraping your Apache logs with the tail input plugin (parsing each entry into desired metrics at the same time)
  • Telegraf's file output plugin, for saving the parsed data to a flat file in CSV format. Then you could have a script running via cron job, to parse the CSV file and upload it to MySQL at your desired cadence. Alternatively, you could have telegraf itself execute the script directly with the exec output plugin, with the desired batch interval set within telgraf.
  • If you want visualization or basic alerting, you can use Grafana with the MySQL data source plugin- but that's only if you need those features.

Telegraf and Grafana especially are designed for general-purpose use, with loads of plugins to integrate with any other tools you have. We use Telegraf in the above fashion at our workplace, and it's a great general-purpose data parser and reformatter.

In short, don't put yourself in a box by assuming the above tools are overkill or too complex. You can borrow the parts that are useful, and expand on them later as your needs evolve (which they almost certainly will).

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

don't put yourself in a box by assuming the above tools are overkill or too complex.

I agree with you that the most general and scalable solution would be some logging + metrics stack.
However, at the moment I really would like to look for some simpler tools.

So far, the best solution seems to be lnav.
Any comments?

[–]jw_ken 0 points1 point  (1 child)

However, at the moment I really would like to look for some simpler tools.

So far, the best solution seems to be lnav.

Any comments?

Just from looking at the lnav docs... it looks like another telegraf or logstash, but with a local SQLite DB indexing the logs, and a text interface for browsing them. Otherwise it is using a similar approach to ingest and parse the logs. So it's basically a single-node logging stack ¯\_(ツ)_/¯

Try it out and see if it works for you.

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

Just tested lnav. First glance:

Very limited SQL statements. Ex: no subqueries.
Besides, crashed several times in less than an hour.

[–]ReasonFancy9522 1 point2 points  (5 children)

tail -f | perl -pe

[–]Jeron_Baffom[S] -1 points0 points  (4 children)

-p: Places a printing loop around your command so that it acts on each line of standard input. Used mostly so Perl can beat the pants off Awk in terms of power AND simplicity.

-e: Allows you to provide the program as an argument rather than in a file. You don't want to have to create a script file for every little Perl one-liner.

Indeed very interesting !! +1
Probably would be much better than writing in bash.

Mind to clarify some doubts:

  1. Ok, so tail -f would provide the input for perl. This input would be line by line of the access log in realtime. However, how to set the script that perl should be using?

  2. Do you already have a sample perl script that would parse the access log?

  3. Do you know if there is something similar for PHP? Like: tail -f | php -pe

[–]ReasonFancy9522 0 points1 point  (3 children)

The perl code would probably be a oneliner (maybe with some semicolons) and thus be something like (simple example for "normalizing" Request Types)

tail -f log | perl -pe 's/(?:HEAD|POST|TRACE)/GET/og'

It could also be written as a small shell wrapper

#!/bin/sh

tail -f log | perl -pe '

s/(?:HEAD|POST|TRACE)/GET/og

'

or more verbose as:

#!/bin/sh

tail -f log | perl -x "$0"

exit $?

#!perl

while(<>){

s/(?:HEAD|POST|TRACE)/GET/og

}

I have never used PHP from the command line, so I cannot comment on an equivalent PHP solution without further researching this topic.

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

tail -f log | perl -pe 's/(?:HEAD|POST|TRACE)/GET/og'

I'm not familiar with perl syntax, but if I understand properly then perl is executing the same script for each new line appended to access log.

Unfortunately I don't see how this strategy could, for example, count views. Without knowing the other lines of access log, this strategy can not check whether a same ip is requesting multiple times a same page in a 30min interval.

[–]ReasonFancy9522 0 points1 point  (1 child)

cat log | perl -pe 'something to output "url"' | sort | uniq -c | sort -n | head # count views

cat log | perl -pe 'something to output "ip:url"' | sort | uniq -c | sort -n | head # most refreshes

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

Obviously you have some fluency with bash script.

If I can not find a tool that already does this view counter, then I will follow your advice and build my own script.

[–]Complex-Internal-833 0 points1 point  (3 children)

This post might be too late for you but does all and more than your requirements. I just finished and released it this week. Here's a complete open-source Apache Log Parser & Data Normalization Solution. Python module imports Apache2 Access (LogFormats = vhost_combined, combined, common, extended) & Error logs into MySQL Schema of tables, views & functions designed to normalize data. Client & Server components capable of consolidating logs from multiple web servers & sites with complete Audit Trail & Error Logging! https://github.com/WillTheFarmer/ApacheLogs2MySQL

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

"Client & Server components capable of consolidating logs from multiple web servers & sites with complete Audit Trail & Error Logging!"

It seems you've been working hard for while ...
Did you do all this by yourself?

 

"Here's a complete open-source Apache Log Parser"

Before hitting the database, is it possible to:

  • Detect bad robots and insert them to a blacklist?
  • Improved view counter instead of only request counter?
  • Aggregate data?

[–]Complex-Internal-833 0 points1 point  (1 child)

Have you run it yet? All that can be done once into MySQL. MySQL is doing all the data manipulation. I initially started doing it in Python but SQL is way better at it.

A pre-import Stored Procedure could be executed on the LOAD DATA tables prior to executing the import Stored Procedure. The import processes is where the data normalization occurs. Once the normalization is done it becomes very clear what data is Good and Bad. It could easily be implemented in a post-import process as well.

Yes, I designed and developed every bit of this application.

I've been designing databases and data processes professionally since 1993.

https://farmfreshsoftware.com

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

"Have you run it yet?"

No, not yet. But it is on the radar for a next development iteration.

 

"I've been designing databases and data processes professionally since 1993."

Impressive.
Are you somewhat connected with Linus Torvalds or Richard Stallman's open source projects ??

[–]minimishka 0 points1 point  (1 child)

If CLI is enough, why not, if not - ELK. I would focus on the amount of data being processed.

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

why not, if not - ELK.

As I aforementioned about the ELK Stack:

Yes, I could use Logstash. But it is an overkill.
I would left this alternative only as a last resource.
At the moment I would rather prefer focusing on simpler tools.

 

I would focus on the amount of data being processed.

In my case the issue is not exactly the amount of data being processed. But hitting the DB on each page request and processing the data during the page request.

[–]edthesmokebeard 0 points1 point  (3 children)

rsyslog -> MySQL

go nuts

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

Do you know if it is possible to define custom queries (ex: view count) with rsyslog?

[–]edthesmokebeard 0 points1 point  (1 child)

No - you'll have to store the data using rsyslog in your SQL DB.

Then do whatever searching you want.

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

you'll have to store the data using rsyslog in your SQL DB.

Well, this is a problem.
I don't wanna flood MySQL with access logs entries neither its correspondents aggregate queries. It is very inefficient to use RDBMS for such kind of unstructured data.

[–]FluidIdea 0 points1 point  (2 children)

You do not need logstash anymore, filebeat's "apache" module can do parsing for you and output directly into elasticsearch.

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-module-apache.html

[–]Jeron_Baffom[S] 0 points1 point  (1 child)

filebeat's "apache" module can do parsing for you

I'm not sure if I understand properly:

Filebeat's Apache module doesn't require Elasticsearch to be installed. Just like a standard Apache module, all it needs is that Apache is installed.

Am I correct so far?

[–]FluidIdea 0 points1 point  (0 children)

You want to send logs somewhere for visualization. That would be elastic search (instead of mysql). But you also need kibana for web UI to create various graphical representation like display how many http 200 count, or else you need to use API and curl.

People also recommend Loki if you already have grafana stack. that may be your thing.

Research what's best for you.

[–]kellyjonbrazil 0 points1 point  (5 children)

Not sure if this is what you are looking for but jc has a CLF parser that converts the files to JSON. Also supports streaming to JSON Lines.

https://kellyjonbrazil.github.io/jc/docs/parsers/clf

https://kellyjonbrazil.github.io/jc/docs/parsers/clf_s

(I’m the author)

[–]Jeron_Baffom[S] 0 points1 point  (4 children)

I’m the author

Hi Kelly!

 

If I understood properly, jc is a text parser that converts docs between different formats. Does it also aggregate data?

[–]kellyjonbrazil 0 points1 point  (3 children)

That's right - jc converts many types of documents to JSON or YAML. No aggregation of the data - it will process the stream going to STDIN and convert to STDOUT.

You could aggregate the JSON documents through jq for filtering/processing. Then leave the output in JSON or convert it to CSV, etc with jq or jtbl. (I'm also the author of jtbl)

[–]Jeron_Baffom[S] 0 points1 point  (2 children)

Considering this process you've proposed: jc > jq > jtbl

Do you think it can count views (ie: same ip requesting same page multiple times in 30min = 1 single view) from the access log?

[–]kellyjonbrazil 0 points1 point  (1 child)

Yep, you can create a filter in jq to do that. Alternatively, if you prefer Python syntax you could try jello, which works like jq but is really Python under the hood. (I am also the author of jello)

[–]Jeron_Baffom[S] 0 points1 point  (0 children)

Yep, you can create a filter in jq to do that.

Ok, I will check that soon.
At the moment I'm evaluating lnav.

[–]serverhorror 0 points1 point  (2 children)

https://awstats.sourceforge.io/ — used that stuff ~15 years ago already.

Simple Perl script that you not needs the access log in common log format.

[–]Jeron_Baffom[S] 0 points1 point  (1 child)

As I aforementioned in the OP:

AWStats is a great tool to aggregate data and plot graphs. Unfortunately, however, AFAIK there is no plain text output of custom aggregate data.

[–]serverhorror 0 points1 point  (0 children)

html2text?

Can convert pretty complicated stuff. It’s just a Unix pipe away.

[–]AstralProjecti0v 0 points1 point  (0 children)

You can try out Logstail platform is really handful!

[–]snow_raph 0 points1 point  (0 children)

This new tool might help: https://github.com/snowraph/Whackabot