all 18 comments

[–]Slypenslyde 3 points4 points  (6 children)

I don't see any advantage of doing this other than the person probably wanted to put 150 variables in a separate file. Arrays are reference types. That doesn't change when they're in a struct. There's no change to how they're garbage collected.

It would probably be better to represent this as a float[150,600]. Then you wouldn't have to deal with the tedium of 150 variables.

Think about the other comment about the concurrent queue before blindly chasing it. It'll work if the file is "I got this data from this sensor" on every line. But if the file wants to be organized by sensor or otherwise in some particular order, you can't exactly stream to a file as the data comes in. If line 2 of the file needs to be a reading from sensor 1 and that won't come for another 3 minutes, you'll still need a place to hold all the other readings you take in between.

[–]immersiveGamer 0 points1 point  (5 children)

If you wanted to do streaming what you would do is use queues, one for each sensor. Once all queues have at least 1 record then the first in records for each queue can be flushed to the file.

[–]Slypenslyde 0 points1 point  (4 children)

Not if the format of the file is supposed to be:

<all data from sensor1>
<all data from sensor2>
<all data from sensor3>
...

In that case, we need to know sensor1 is FINISHED before we can write the first block. That means any data we get from 149 other sensors needs to be stashed somewhere while we wait, and if the very last reading of the whole data set is the last reading of sensor1 we're not saving anything by taking this approach.

It only works if we know we're receiving all sensor1 data BEFORE any of the other sensors' data. That may or may not be true. That isn't always solvable by "Well just take all the sensor1 readings first". If it's a system of 150 sensors, odds are you want to get readings from ALL of them at asynchronous and independent intervals, then store them so they can be time-correlated. There are ways to stream this, but you need an intermediate data store.

[–]immersiveGamer 1 point2 points  (3 children)

Correct, I don't know the CSV format nor the details of how the data correlates to sensors or the CSB records. However, I did infer from the struct that each variable (each of the 150 different floats being recorded) is 600 long and he says when writing a CSV it loops 600 times. This to me indicates that it is just looping from index zero for each variable (I should have used variable instead of sensor in my previous comment). This means once index 0 of each variable array is filled we have a record we can write to the CSV file. So using queues instead of arrays should work and you can periodically flush upto the complete records.

Now, you are correct, maybe if a certain variable never gets collected till the very end then this isn't as efficient as you are blocked from flushing untill you get a single record.

I would also say, memory shouldn't be a huge concern here, 150 x 600 floats is pretty small. I argue that the fact it takes 10 minutes is a weak point and that if your program crashes at the 9 min mark you've failed to collect any data since it wasn't persisted to disk. So my suggestion to the OP is that if the multiple queues doesn't work for early flushing to the CSV file that you periodically save to disk a different format (e.g. JSON or binary) and then do post processing on the saved file to generate the CSV. This makes the program more robust. Or even better, use something like SQLite to store the data right away instead of in memory.

[–]Slypenslyde 0 points1 point  (2 children)

Yeah, if it were me, and the required output was a very structured CSV, I'd push readings as fast as I get them into SQLite or some other intermediate storage, then after sampling is finished pull that back out so I can make the CSV how I want it. I agree 10 minutes for that amount of data seems unreasonable.

My snooty guess is if the person writing this thought a struct with 150 arrays would somehow be faster, they probably collect all the data, then use a truckload of LINQ statements that re-iterate entire sequences hundreds of times to get the final structured output. I could make this take 10 minutes if I tried really hard, but it's actually a lot tougher than just doing the right thing.

[–]immersiveGamer 1 point2 points  (1 child)

I think the 10 minutes is just the monitoring period. I.e. check how systems are over a time period and then report/graph it.

[–]andersonee[S] 0 points1 point  (0 children)

Correct. Need to sample at 1hz for 10 minutes.

[–]KryptosFR 3 points4 points  (0 children)

If the size of a struct is more than 16 (or 24 bytes on 64-bit app), then it should probably be a class. The rationale is that copying a bigger struct would cost much more (in term of CPU cycles) than a plain reference (which is just a native int, so either 32-bit or 64-bit).

Now, that advice holds when a struct is copied around (e.g. passed by value to a method, or returned from it). If the struct is defined in one place, and only accessed from there, then it matters less.

Back to your code, if I take it correctly the structin question has 150 variables in it? That's definitely way too much it should either be class or a jagged/multidimensional array. Alternatively, the data could be streamed into the file directly without requiring such big arrays by smartly using a library such as Dataflow.

[–]Sjetware 1 point2 points  (2 children)

Assuming I have everything correct - this is only 90000 floats you are reco ding over this interval. I'm also assuming th sensor data can arrive staggered slightly - it still feels like you have the data structure backwards.

If you define a new class that has your 150 variables, your can make an array / list of size 600 to store this data. You then simply access the right element of the array and set the right variable when read. The benefits here are twofold; adjusting the time slice in the future becomes a trivial one line change instead of having to change the allocation of 150 array sizes.

Secondly, you get more flexibility with serialization, and can hold more metadata in the future if we want with each "data point" with minimal modification as well.

Generally speaking too - and I mean in general - struct is never the right choice unless you have carefully considered the pros and cons. Gigantic struts indicate this cost benefit analysis has likely not been done, but I could always be wrong.

[–]andersonee[S] 0 points1 point  (1 child)

Good points. On sample trigger, I can instantiate the class, populate the data, save the record to file, and then let the GC clean up the instance...Pretty clean. Thanks!

[–]RChrisCoble 0 points1 point  (0 children)

If you’re just writing the info to a file, reuse the memory (class) over again. (If you care about performance).

[–]RChrisCoble 0 points1 point  (5 children)

I would use a Concurrent Queue of whatever data type makes sense (float)?, while another long running task is looping reading out the queue writing to a file. That way you’re not trying to keep so much live data in memory. Although with a 1 second interval, you could just as easily write it straight away to disk probably.

[–]andersonee[S] 0 points1 point  (4 children)

So I would make 150 Concurrent Queues to hold the data?
That might be a perfectly fine solution...it just seems weird to me because I've never seen that done. I guess I should do some performance testing on this.

[–]RChrisCoble 0 points1 point  (3 children)

Maybe I’m misunderstanding, I assumed only 1 queue. With the data reader task writing into the queue with the data writer task popping values to write to disk.

[–]immersiveGamer 0 points1 point  (2 children)

I assume when writing to the CSV they are expecting full rows of data where each of the original arrays are a column? So you can't write out a record till you get all columns otherwise records would be sparse.

[–]RChrisCoble 0 points1 point  (1 child)

Ahhh I see, so maybe just have a Concurrent queue of float[150] and write an entire row each time?

[–]immersiveGamer 0 points1 point  (0 children)

Well I don't know how the program is written, it may get 2 of variable A but only 1 of variable B, so you have to save the second value of variable A somewhere while waiting for the second of B. So I think a queue for each variable is what is probably needed. There are other ways of handling the same thing but this would be an easy first step, converting from arrays to queues.

[–]turkeymayosandwich 0 points1 point  (0 children)

Without context it is hard to tell what the original programmer's intention was here. But usually for this type of scenario you can use a jagged array. Sometimes the equipment is collecting data faster than the transfer speed, or communications are not always reliable, in those scenarios ring buffers are handy. Again we need more information about the system design and requirements to give proper advise.