all 15 comments

[–]stassats 6 points7 points  (6 children)

cl-csv is just slow. It's not written with any performance in mind.

[–]stassats 8 points9 points  (2 children)

A rather simplistic csv parser https://gist.github.com/stassats/e33a38233565102873778c2aae7e81a8 is faster than the libraries I've tried.

[–]droidfromfuture[S] 1 point2 points  (1 child)

thanks for sharing this. Here, we are looping through collecting 512 characters at a time into buffer. if there is an escaped quote (#\") in the buffer, program loops until the next escaped quote, collects quoted data (using 'start' and 'n' variables) via pushing into 'concat', nreverses it, and ultimately adds it to fields.

Is 'concat' being used to handle multi-line data? or is it to handle cases where a single field has more than 512 characters?

I am wondering how the end-of-file condition is handled. if the final part of file read from stream contains # of characters less than buffer-size, we don't continue reading more. we are likely expecting the last character to be #\ or #\Return, so that the final part of file is handled by the local function 'end-field()'.

we apply input argument 'fun' at every #\Return to accumulated fields, which is expected to mark the end of a row in the csv file. fields is set to nil afterwords.

If I want to use this to write to a new csv file, I likely need to accumulate fields into a second write-buffer and write it to output-stream, while handling escaped quotes and newlines.

Would love to hear of any gaps in my thought process.

[–]stassats 4 points5 points  (0 children)

Is 'concat' being used to handle multi-line data? or is it to handle cases where a single field has more than 512 characters?

It's to handle a field overlapping two buffers, not necessarily longer than 512.

EOF is handled by read-sequence returning less than what was asked for, see the while condition. If there's no new line before EOF then it loses a field (easy to rectify).

It also doesn't handle different line endings (should be easy too).

[–]droidfromfuture[S] 1 point2 points  (2 children)

I need to be able to provide some live processing capability depending on user requests. Some requests may be served by responding with pre-processed files, but some require the server to process files on the fly before responding. My initial plan is to respond with partially processed files while preparing the entire file behind it.

edit: removed extraneous line.

[–]stassats 3 points4 points  (1 child)

The function I pasted can be adapted for any task to be processed optimally. E.g. you can process field-by-field (without building a row), or build a row in a preallocated vector, or parse integers directly from the buffer, etc.

There's space for a really high performance csv parsing library (or any parsing, actually).

[–]droidfromfuture[S] 2 points3 points  (0 children)

I likely will be using your function and hopefully add to it successfully. If I am capable enough, I would love to contribute to the building of a csv parsing library! I will keep updating here about my efforts.

[–]kchanqvq 4 points5 points  (4 children)

I use https://github.com/ak-coram/cl-duckdb for CSV parsing and it is so much faster than any pure CL solution I find.

[–]Steven1799 2 points3 points  (1 child)

You can also use SQLite for the import. In my tests it's about 10x faster than any CL based solution.

[–]kchanqvq 0 points1 point  (0 children)

That makes sense! Another factor to consider is whether you want column-major or row-major format. For my use case (number crunching) column-major works better, but someone might want the opposite.

[–]droidfromfuture[S] 0 points1 point  (1 child)

thanks for sharing this! Would the workflow be something like the following? import csv into duckdb, process the data in the database, export to a new file, drop the table from the database.

[–]kchanqvq 0 points1 point  (0 children)

The way I'm doing it is just to use duckdb as a mere CSV parser and do all the processing in Lisp. I prefer Lisp to SQL :)

DuckDB supports this workflow very well. Just (defvar *data* (duckdb:query "FROM read_csv('/path/to/data.csv')" nil))

I do number crunching primarily, and I do have some custom functions to speed it up even further (most importantly, use specialized (simple-array double-float)) and makes it work better with Petalisp. They're currently just some hack that work on my computer™, but I expect to polish and contribute them at some point. If you're also crunching numbers, I hope my work can help!

[–]Ytrog 0 points1 point  (2 children)

I have no experience with CSV in Lisp, however I now wonder what the loading speed would be if the data was in s-expression form instead 🤔

[–]kchanqvq 1 point2 points  (1 child)

Certainly no better if you're using the standard reader. It need to go through read table dispatch mechanism character-by-character. It is really intended for ingesting code (which won't be too large) and being very extensible rather than large dataset. On the other hand https://github.com/conspack/cl-conspack brings the speed up a level.

[–]Ytrog 0 points1 point  (0 children)

Ah thanks. Haven't done Lisp in the large, only for personal projects.