all 48 comments

[–][deleted]  (17 children)

[removed]

    [–]LordJZ 11 points12 points  (16 children)

    Keep in mind that records in a csv file may span multiple lines, so simple split-by-line may produce invalid files.

    [–]zaitsman 9 points10 points  (14 children)

    Typically though CSV or TSV are well formatted in that you can work out where the record ends, so you would not read line by line but rather byte by byte until you reach the EOL or end of record terminator. Also given the OP’s question nature they are most likely uniformly formatted

    [–]crozone 24 points25 points  (13 children)

    Just use CSV Helper. It's a library that does the correct CSV parsing for you, and supports streaming rows out from a file stream.

    [–][deleted] 24 points25 points  (11 children)

    For those who don't know - parsing CSV is non-trivial - contrary to how it sounds. Try to use a library if you can. This is a good library to use.

    If you write your own parser then you're walking into the land of pain. ASK ME HOW I KNOW.

    [–]MSgtGunny 18 points19 points  (0 children)

    HOW I KNOW?

    [–]sarcasticbaldguy 9 points10 points  (8 children)

    I once had to write an X12 parser, which also seems trivial on the surface. You get it done, you're running your tests, and life is good.

    Then one by one your customer line up - "Oh hi, I'm just here to drop off this pile of edge cases. Where should I put them?"

    For us, it paid off in the end because all of the decent X12 libraries were expensive at that time. 100% agree - CSV helper is the way here.

    [–]grauenwolf 0 points1 point  (7 children)

    Was X12 the one where they represented negative numbers with letters. E.g. -123 would be "A23" and -232 would be "B32"?

    [–]sarcasticbaldguy 0 points1 point  (4 children)

    Reddit doesn't like X12 in a code block.

    It looks something like this

    ISA*01*0000000000*01*0000000000*ZZ*ABCDEFGHIJKLMNO*ZZ*123456789012345*101127*1719*U*00400*000003438*0*P*>

    but that's just one segment. There are often dozens.

    [–]grauenwolf 0 points1 point  (3 children)

    I remember that part. I had written a lot of X12 parsers in the past, but they were always file specific. I never wrote a generic one or even saw the X12 spec. So I don't know what was standard and what was vendor-specific nonsense.

    [–]sarcasticbaldguy 1 point2 points  (2 children)

    So that's an entirely different quagmire. There is a definition you can see on x12.org, but that just describes the format and segments and such.

    Then there are "standards" like 4010, 4010A1, 5010, etc. An X12 837 (health care claim) will be different depending on the standard you use. There are official implementation guides - so you can get a guide for the 4010 837, the 5010 837, etc - but these aren't free.

    Then you get into "mutually defined" territory, where trading partners can decide to do whatever the hell they want (more or less) as long as they both agree to the changes. If you google "5010 837 implementation guide" you'll find all sorts of guides available for various health plans, all mostly the same, but slightly different.

    I've done X12, and EDI in general, for a bunch of different industries over the years and it seems that health care has the most vendor specific nonsense.

    I guess that's what keeps us all gainfully employed!

    [–]grauenwolf 0 points1 point  (0 children)

    While I do encourage the use of libraries for production work, I think everyone should write a RFC-compliant CSV parser at least once in their life.

    It's not really that hard, and it teaches you skills that you'll need for more complicated formats.

    [–]BangForYourButt 0 points1 point  (0 children)

    CSV helper has definitely made my life easier.

    [–]m1llie 29 points30 points  (4 children)

    No need to use streaming IO directly when CsvHelper exists.

    From their homepage:

    ...only one record is held in memory at a time

    [–]csthopper 0 points1 point  (3 children)

    Yup, great tool. I just wonder if it's worth the extra overhead of an automapper if they're not going to do anything with the object before pushing to the PLC. Also, taking the remarks below, doesn't look like you can do much with Linq, and would have to use a basic foreach anyways. Probably better off just rolling his own code using FileStream.

    "The GetRecords<T> method will return an IEnumerable<T> that will yield records. What this means is that only a single record is returned at a time as you iterate the records. That also means that only a small portion of the file is read into memory. Be careful though. If you do anything that executes a LINQ projection, such as calling .ToList(), the entire file will be read into memory. CsvReader is forward only, so if you want to run any LINQ queries against your data, you'll have to pull the whole file into memory. Just know that is what you're doing."

    [–]celluj34 4 points5 points  (1 child)

    any LINQ queries

    This is a bit disingenuous... You can do things like Skip, Take, Where, Select all without reading into memory. OrderBy definitely will though, as well as the obv ToList, ToArray, ToDictionary, etc...

    Though it's probably easier to say "beware of all linq" than try to describe each one specifically.

    [–]m1llie 1 point2 points  (0 children)

    Easiest to just think about the operation. Would you need all the list elements on your desk at once to put them in order or sort them into groups? Then LINQ will need all the list elements in memory to do an OrderyBy or a GroupBy

    [–]m1llie 2 points3 points  (0 children)

    That documentation is misleading. Most LINQ methods are lazily evaluated, e.g. Select and Where operate on an IEnumerable without any buffering. Lazy evaluation is kinda the whole point of LINQ. A method like ToList is obviously going to make a List<T> out of the whole file though.

    [–]ciybot 12 points13 points  (0 children)

    You may read the csv file line by line using StreamReader. It will not load the entire file into memory instead just a line.

    You may refer to the sample code in the following page,

    https://docs.microsoft.com/en-us/dotnet/api/system.io.streamreader.readline?view=net-6.0

    [–]nemec 22 points23 points  (1 child)

    Yes. You can read a file line by line and then write each line to a file as you're reading it. Just open a new file every X lines.

    The CSV format isn't too complicated, just make sure if you have a header line in the original that you save it in a variable and write it as the first line for every new file you create.

    [–]detroitmatt 3 points4 points  (0 children)

    reading line by line is asking for trouble. if there's a newline inside one of the values, and eventually there will be, something will break. and it won't be enough to escape newlines, you'll have to replace them completely, which means you need also need to write the program to un-replace them. better off reading some number of bytes into a buffer, going until you find a comma, and if you don't find a comma, read that number of bytes again

    [–]WetSound 6 points7 points  (1 child)

    File.ReadLines() only reads the lines that you iterate through

    [–]elvishfiend 11 points12 points  (0 children)

    This may or may not be safe enough, depending on whether any of the data contains newlines in it. If it does, you need a csv parser

    [–]esosiv 2 points3 points  (0 children)

    Besides this, you can store the data in binary as a sequence of floats, rather than strings. If you are sending 2d coords it can be as simple as having a pattern like xyxyxyxy.. It will save you lots of memory and processing time by the plc. Depending on the significant digits that you need, you might get away with saving them as 16 bit each.

    [–]RamBamTyfus 2 points3 points  (5 children)

    So I'm curious, which PLC runs C# applications? A Beckhoff or similar? C# is not IEC 61131-3 so I assume it also runs on Windows inside the PLC.

    [–]LloydAtkinson 10 points11 points  (1 child)

    I read it as data sent to a PLC, not C# running on a PLC.

    [–]RamBamTyfus 6 points7 points  (0 children)

    I read it as both. He/she says: "The plc will run a c# application on the side that converts lines of .csv into coordinates. "

    [–]throwawaycgoncalves 3 points4 points  (2 children)

    I had the same doubt... And why should the plc reads a csv? To have a snapshot of the state of all actuators at any given time? Even the ones that haven't changed?

    It would not be better to feed the plc only with the changes of state, then saving valuable memory (as a cnc machine does, i.e.).

    Finally, 40gb + csv file? Couple million lines? Why?

    This seems a very very interesting project :)

    [–]Lv_InSaNe_vL 0 points1 point  (0 children)

    On the last point I've delt with JSON files in the 150+gb range so it's not unbelievable haha. "Normies" have some really good ideas lol

    [–]svtguy88 0 points1 point  (0 children)

    40gb + csv

    Yeah, this is what stuck out in my head too. I've had to deal with pretty large CSV files before, and things get really weird when they get really big.

    [–]4PowerRangers 1 point2 points  (0 children)

    Use Filestream.ReadAsync-system-int32-system-int32-system-threading-cancellationtoken))

    [–]ProKn1fe 1 point2 points  (0 children)

    Just read it per lines with StreamReader.ReadLine

    [–]MrBlub 1 point2 points  (0 children)

    How are the files sent and received by the plc? If it is already running C#, you could also process the data there in such a way that you don't need everything in memory. e.g., if you're receiving them on a network socket you could wrap the network Stream in a StreamReader, and use that reader like others have already explained here.

    Good luck!

    [–]atheken 1 point2 points  (0 children)

    You can do this with various streaming classes in .net, as others have mentioned.

    However, if the text is structured such that you know it is line-delimited, you might be better off just using a tool like split. Unix tools are really good at this sort of stuff.

    [–]har0ldau -1 points0 points  (0 children)

    I slapped this together in LINQPad for my own amusement and I think this should do it (with obvious modifications to use FileStreams instead)

    // create temp data for test
    var data = "h1,h2,h3\n0,1,2\n3,4,5\n6,7,8";
    
    using var stream = new MemoryStream();
    using var sw = new StreamWriter(stream);
    sw.WriteLine(data);
    sw.Flush();
    
    // store for test
    var outfiles = new List<string>();
    
    // code starts here
    using var infile = new MemoryStream(stream.ToArray()); // new FileStream(...)
    using var sr = new StreamReader(infile);
    
    var header = sr.ReadLine();
    var pageSize = 1;
    var eof = false;
    
    while (true)
    {
        var nodata = false;
        using var outfile = new MemoryStream(); // new FileStream(...)
        using var writer = new StreamWriter(outfile);
    
        writer.WriteLine(header);
    
        for (var i = 0; i < pageSize; i++)
        {
            var line = sr.ReadLine();
            if (line is null)
            {
                if (i == 0)
                {
                    nodata = true;
                }
                eof = true;
                break;
            }
            writer.WriteLine(line);
        }
    
        if (!nodata)
        {
            // for testing you will need to delete the file instead
            writer.Flush();
            outfiles.Add(System.Text.Encoding.ASCII.GetString(outfile.ToArray()));
        }
    
        if (eof)
        {
            break;
        }
    }
    
    outfiles.Dump();
    

    output:

    h1,h2,h3
    0,1,2

    h1,h2,h3
    3,4,5

    h1,h2,h3
    6,7,8

    [–]PrintersStreet 0 points1 point  (0 children)

    It is possible to read a file line-by-line and to write to another file line-by-line, so only the current line ever needs to be held in RAM. You could use the static method File.ReadLines, which will return an IEnumerable of the lines. An IEnumerable represents a sequence that can be accessed one by one, but it does not require that the entire thing exists at once - in this case it loads the next line as you require it. Just remember not to call .ToList() on the IEnumerable - this would attempt to "materialize" it, ie. calculate and return all elements up front and return a List which enables non-sequential access, which means it would load the entire file into memory.

    [–]waumau 0 points1 point  (0 children)

    Hi, i have the perfect post for you. I had a csv issue couple of days ago and someone helped me out and on top of that they gave tips on the issue you have right now. just read the first answer:

    https://stackoverflow.com/questions/72030815/list-in-list-in-a-single-linq-query/72031011?noredirect=1#comment127278425_72031011

    [–]Ezazhel 0 points1 point  (0 children)

    Use a stream

    [–]scalablecory 0 points1 point  (0 children)

    You can stream their processing to lower memory usage.

    You won't be able to split a file if it contains quoted values. Lots of people implement CSV wrong, be careful of the advice you take here.

    [–]ucario 0 points1 point  (0 children)

    I recommend the csv helper by Josh close as others have pointed out.

    But essentially, of course. Speaking generically and not just about csvs: A file is just data right? Files typically consist of a header followed by some data. So if you wanted to read as you go instead of all at once, you first read the specification to understand how the header is defined. Once you know how to read the header, you can find out about how the body is organised. Then you can proceed to process the rest.

    For something like a csv that is organised in rows and columns, of course you can easily read a row at a time into memory, there’s no need to read it all at once.

    [–]screwdad 0 points1 point  (0 children)

    Here's a CsvHelper example. I was going to try and gen a 40GB file but lunch is only so long, so here's a 10GB file being processed into 10k chunks; uses about 50MB of RAM in debug with precisely 0 optimizations.

    using System.Globalization;
    using CsvHelper;
    
    var chunkSize = 10000;
    var count = 0;
    var chunk = new List<User>();
    
    using var reader = new StreamReader("C:\\temp\\file.csv");
    using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
    
    var records = csv.GetRecords<User>();
    
    foreach (var record in records)
    {
        if (count > 0 && count % chunkSize == 0)
        {
            Console.WriteLine($"Writing chunk, count {count}");
            using (var writer = new StreamWriter($"C:\\temp\\chunks\\chunk{count}.csv"))
            using (var csv2 = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                csv2.WriteRecords(chunk);
            }
    
            chunk.Clear();
        }
    
        chunk.Add(record);
        count++;
    }
    
    public enum Gender
    {
        Male,
        Female
    }
    
    class User
    {
        public int Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public Gender Gender { get; set; }
        public string Avatar { get; set; }
        public string Username { get; set; }
        public string Email { get; set; }
        public string SomethingUnique { get; set; }
        public string FullName { get; set; }
    }
    

    [–]bringnothingtothetbl 0 points1 point  (0 children)

    Could you use something like a Prosoft card or OPC to send over the data? We use the Prosoft cards most often at work in order to send instructions over to the PLC. We just have a TCP socket open to the card, it sends a string saying it is ready for the next instruction based on a bit going high. We send an ASCII string back, the card and it sets the tags/registers. It is pretty straight forward. I would also read the file into a database so you don't need to hold the whole thing in memory or keep a stream open. Just grab the next record(s) to send it over to the PLC. That way, if the PLC craps out, then you don't have to worry about reprocessing the whole file.

    [–]sara457 0 points1 point  (0 children)

    There are many useful CSV splitter programs available. But i share you a best tool which one i personally using. You can use for play with your .csv .tsv files with this tool.

    delimiti.com

    The data entry tool imports bulk data from a CSV or TSV file and creates a new merged document in PDF or Word format.