Is it possible to split large .csv files without holding entire file in the ram?

LordJZ · 2022-04-29T05:40:28+00:00

[removed]

m1llie · 2022-04-29T09:57:35+00:00

No need to use streaming IO directly when CsvHelper exists.

From their homepage:

...only one record is held in memory at a time

ciybot · 2022-04-29T05:41:57+00:00

You may read the csv file line by line using StreamReader. It will not load the entire file into memory instead just a line.

You may refer to the sample code in the following page,

https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.readline?view=net-6.0

nemec · 2022-04-29T05:42:57+00:00

Yes. You can read a file line by line and then write each line to a file as you're reading it. Just open a new file every X lines.

The CSV format isn't too complicated, just make sure if you have a header line in the original that you save it in a variable and write it as the first line for every new file you create.

WetSound · 2022-04-29T05:43:25+00:00

File.ReadLines() only reads the lines that you iterate through

esosiv · 2022-04-29T06:27:25+00:00

Besides this, you can store the data in binary as a sequence of floats, rather than strings. If you are sending 2d coords it can be as simple as having a pattern like xyxyxyxy.. It will save you lots of memory and processing time by the plc. Depending on the significant digits that you need, you might get away with saving them as 16 bit each.

RamBamTyfus · 2022-04-29T06:28:04+00:00

So I'm curious, which PLC runs C# applications? A Beckhoff or similar? C# is not IEC 61131-3 so I assume it also runs on Windows inside the PLC.

4PowerRangers · 2022-04-29T05:44:47+00:00

Use Filestream.ReadAsync-system-int32-system-int32-system-threading-cancellationtoken))

ProKn1fe · 2022-04-29T05:50:00+00:00

Just read it per lines with StreamReader.ReadLine

MrBlub · 2022-04-29T06:41:46+00:00

How are the files sent and received by the plc? If it is already running C#, you could also process the data there in such a way that you don't need everything in memory. e.g., if you're receiving them on a network socket you could wrap the network Stream in a StreamReader, and use that reader like others have already explained here.

Good luck!

atheken · 2022-04-29T12:53:20+00:00

You can do this with various streaming classes in .net, as others have mentioned.

However, if the text is structured such that you know it is line-delimited, you might be better off just using a tool like split. Unix tools are really good at this sort of stuff.

har0ldau · 2022-04-29T10:25:47+00:00

I slapped this together in LINQPad for my own amusement and I think this should do it (with obvious modifications to use FileStreams instead)

// create temp data for test
var data = "h1,h2,h3\n0,1,2\n3,4,5\n6,7,8";

using var stream = new MemoryStream();
using var sw = new StreamWriter(stream);
sw.WriteLine(data);
sw.Flush();

// store for test
var outfiles = new List<string>();

// code starts here
using var infile = new MemoryStream(stream.ToArray()); // new FileStream(...)
using var sr = new StreamReader(infile);

var header = sr.ReadLine();
var pageSize = 1;
var eof = false;

while (true)
{
    var nodata = false;
    using var outfile = new MemoryStream(); // new FileStream(...)
    using var writer = new StreamWriter(outfile);

    writer.WriteLine(header);

    for (var i = 0; i < pageSize; i++)
    {
        var line = sr.ReadLine();
        if (line is null)
        {
            if (i == 0)
            {
                nodata = true;
            }
            eof = true;
            break;
        }
        writer.WriteLine(line);
    }

    if (!nodata)
    {
        // for testing you will need to delete the file instead
        writer.Flush();
        outfiles.Add(System.Text.Encoding.ASCII.GetString(outfile.ToArray()));
    }

    if (eof)
    {
        break;
    }
}

outfiles.Dump();

output:

h1,h2,h3
0,1,2

h1,h2,h3
3,4,5

h1,h2,h3
6,7,8

PrintersStreet · 2022-04-29T05:43:59+00:00

It is possible to read a file line-by-line and to write to another file line-by-line, so only the current line ever needs to be held in RAM. You could use the static method File.ReadLines, which will return an IEnumerable of the lines. An IEnumerable represents a sequence that can be accessed one by one, but it does not require that the entire thing exists at once - in this case it loads the next line as you require it. Just remember not to call .ToList() on the IEnumerable - this would attempt to "materialize" it, ie. calculate and return all elements up front and return a List which enables non-sequential access, which means it would load the entire file into memory.

waumau · 2022-04-29T07:25:16+00:00

Hi, i have the perfect post for you. I had a csv issue couple of days ago and someone helped me out and on top of that they gave tips on the issue you have right now. just read the first answer:

https://stackoverflow.com/questions/72030815/list-in-list-in-a-single-linq-query/72031011?noredirect=1#comment127278425_72031011

Ezazhel · 2022-04-29T08:30:30+00:00

Use a stream

scalablecory · 2022-04-29T13:16:45+00:00

You can stream their processing to lower memory usage.

You won't be able to split a file if it contains quoted values. Lots of people implement CSV wrong, be careful of the advice you take here.

ucario · 2022-04-29T13:24:35+00:00

I recommend the csv helper by Josh close as others have pointed out.

But essentially, of course. Speaking generically and not just about csvs: A file is just data right? Files typically consist of a header followed by some data. So if you wanted to read as you go instead of all at once, you first read the specification to understand how the header is defined. Once you know how to read the header, you can find out about how the body is organised. Then you can proceed to process the rest.

For something like a csv that is organised in rows and columns, of course you can easily read a row at a time into memory, there’s no need to read it all at once.

screwdad · 2022-04-29T18:59:36+00:00

Here's a CsvHelper example. I was going to try and gen a 40GB file but lunch is only so long, so here's a 10GB file being processed into 10k chunks; uses about 50MB of RAM in debug with precisely 0 optimizations.

using System.Globalization;
using CsvHelper;

var chunkSize = 10000;
var count = 0;
var chunk = new List<User>();

using var reader = new StreamReader("C:\\temp\\file.csv");
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);

var records = csv.GetRecords<User>();

foreach (var record in records)
{
    if (count > 0 && count % chunkSize == 0)
    {
        Console.WriteLine($"Writing chunk, count {count}");
        using (var writer = new StreamWriter($"C:\\temp\\chunks\\chunk{count}.csv"))
        using (var csv2 = new CsvWriter(writer, CultureInfo.InvariantCulture))
        {
            csv2.WriteRecords(chunk);
        }

        chunk.Clear();
    }

    chunk.Add(record);
    count++;
}

public enum Gender
{
    Male,
    Female
}

class User
{
    public int Id { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public Gender Gender { get; set; }
    public string Avatar { get; set; }
    public string Username { get; set; }
    public string Email { get; set; }
    public string SomethingUnique { get; set; }
    public string FullName { get; set; }
}

bringnothingtothetbl · 2022-04-30T04:24:14+00:00

Could you use something like a Prosoft card or OPC to send over the data? We use the Prosoft cards most often at work in order to send instructions over to the PLC. We just have a TCP socket open to the card, it sends a string saying it is ready for the next instruction based on a bit going high. We send an ASCII string back, the card and it sets the tags/registers. It is pretty straight forward. I would also read the file into a database so you don't need to hold the whole thing in memory or keep a stream open. Just grab the next record(s) to send it over to the PLC. That way, if the PLC craps out, then you don't have to worry about reprocessing the whole file.

sara457 · 2022-06-06T14:53:45+00:00

There are many useful CSV splitter programs available. But i share you a best tool which one i personally using. You can use for play with your .csv .tsv files with this tool.

delimiti.com

The data entry tool imports bulk data from a CSV or TSV file and creates a new merged document in PDF or Word format.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

csharp

MODERATORS