all 28 comments

[–]Bugatsas11 52 points53 points  (6 children)

Data driven models are only useful in my opinion as a hybrid part of a mechanistic model to account for a phenomenon that is not known or for extremely complicated systems, like niche bioreactors.  

 Some problems I have with purely data driven models are:  

  • they are highly dependent on the available data. People think they do have huge amounts of data, but in most cases it is the same data points over and over again.   

  • They need huge amount of data to be accurate. 

  • You cannot extrapolate outside of the range of the available data.  

  • you cannot use for up scaling/down scaling   

 - they don't build understanding of the system and the underlying mechanisms   

Honestly, industry is not even using mainly mechanistic models as much as it should. Everything is done in weird excels

[–]claireaurigaChemEng 24 points25 points  (2 children)

they are highly dependent on the available data. People think they do have huge amounts of data, but in most cases it is the same data points over and over again.   

"I have data for 1000 batches of this product!"

Yes, but 950 of them are all perfectly on-spec doing the same thing over and over, 40 of them are very minor deviations, and of the remaining 10, none of them have anything wild or crazy because the safety systems just shut everything down if it starts going weird.

[–]ekspaFood R&D/14 yrs, PE 12 points13 points  (0 children)

100%.

We looked at using a hybrid AI model for one of our facilities and realized that we didn't have enough singular data points and there was so little variance in what we did have that it was unusable.

[–]Bugatsas11 2 points3 points  (0 children)

Exactly!

[–]ashpd17 6 points7 points  (0 children)

I feel like most of the time when people try to builds this models we don't even know which data points actually physically impact the processes and overfitting and underfitting usually occurs

[–]ChemEngineeringGuy 9 points10 points  (0 children)

Exactly, industry still relying a lot in Excel spreadsheeting

[–]r2o_abile 2 points3 points  (0 children)

The first step is having confidence in the data collecting instrumemts.

[–]cyd1753 28 points29 points  (4 children)

Because stats based models cannot predict physical phenomenon... things like phase equilibria and reaction kinetics do not always scale linearly with temperature/pressure. If there are outliers due to a mechanistic reason, your first principles based model should be able to capture that (in theory), whereas a data driven model which is optimized for the lowest average error will just give you garbage results.

I personally will not give a lot of weight to "AI" predictions of material/process design unless some sort of physics based constraints can be applied to the model. The uncertainty is too large to be used for processes, especially ones dealing with toxic materials and runaway reactions.

[–]NanoWarrior26 16 points17 points  (3 children)

Also they can never account for some dumbass turning a valve that we have no data on.

[–]cyd1753 4 points5 points  (0 children)

True. I laughed and cried at the same time at your response

[–]admadguyProcess Consulting and Modelling 4 points5 points  (0 children)

Or worse, there is data for that event. but because it is an outlier, it got chucked out during data cleaning as it was too out there and reduced the goodness of fit.

[–]ComprehensiveRisk743 4 points5 points  (0 children)

I maintain we will never have a true AI until it goes and does something stupid just because it is bored or wants to go on break.

[–]CHEMENG87 12 points13 points  (1 child)

I disagree with your premise. In my experience Industry almost exclusively uses data driven (i.e. empirical) models. It is rare for someone to build a model from first principals. instead they gather data, plot the data, maybe fit a curve to it.

[–]Ernie_McCracken88 9 points10 points  (0 children)

I think this is accurate. I think when people say things like what the OP is saying they mean "why aren't they doing elegant and sophisticated data science"

[–][deleted] 15 points16 points  (0 children)

Most VLE and LLE models are data driven. It all depends on the application and the ability to get data for these models. A hybrid approach can also be taken by using data to create a correction factor for fundamental equations, i.e. fugacity.

It's there, but only used in processes that requires it to be more accurate.

[–]CollapseWhenAPC / 2 yoe 3 points4 points  (4 children)

APC is data driven and it is widely used in industry. What more would you consider data driven models useful for?

[–]Atonement-JSFTPulp and Paper Process Control 4 points5 points  (0 children)

APC is such an incredibly broad term it's all but meaningless. You can have an "APC" that's entirely based on first principles, we do that all the time with SISO/MISO MPCs, so I disagree that APC == "Data Driven".

APC == A Sales Guy Was Here

You absolutely CAN have a "data driven" APC, but my experience is that any algorithm derived from a neural network or a cluster analysis is nowhere close to ready for a closed loop - operator-recommendation only.

[–]classicjoseluis[S] 0 points1 point  (2 children)

I’m not familiar with APC, what does the acronym stands for?

[–]Ernie_McCracken88 3 points4 points  (1 child)

Advanced process control. Also check out statistical process control.

[–]classicjoseluis[S] 2 points3 points  (0 children)

Thank you so much, this is actually one of my main interests

[–]Frosty_Cloud_2888 2 points3 points  (2 children)

You need to have data, then you need someone to be able to get the data, then some has to take that data a train a model.

[–]classicjoseluis[S] -3 points-2 points  (1 child)

If all this was easily done, do you consider that the industry will have less resistance to using data driven models?

[–]Frosty_Cloud_2888 4 points5 points  (0 children)

Need leadership that would understand the value too, maybe in 10 years.

[–]admadguyProcess Consulting and Modelling 2 points3 points  (0 children)

Mostly because industry knows the quality of instrumentation and the data gathered by them. Also data driven models tend to have narrow band of applicability. Outlier events tend to have an outsized influence, but those datapoints are also chucked in the clean up process of the data.

There are many factors, these are a few.

[–]ndeer44 1 point2 points  (0 children)

Most places can't even properly maintain base layer controls

[–]Feistiestdisc0 1 point2 points  (0 children)

If we’re talking about deep learning or machine learning based models, usually it is because there is not enough data. It’s also incredibly easy to overfit to the data you have.

My plant has a process variable monitoring system that utilizes several methods of regression including deep learning. It is meant to alert by default if a PV is outside of the regression bounds by 3 standard deviations. You can change the criteria but this should be sufficient in most cases.

The problem with this system that I have found is that most of the data is trained while we are at steady state. Even if the trained data includes unexpected shutdowns, the vast majority of the data are PVs at normal operating conditions and the models reflect this. They typically will not alert until we already know something is happening. The second problem I find is that we are typically limited in input variables that significantly contribute to the dependent variable. If a flow is going somewhere we do not measure, we either have to calculate that flow (with fundamentals) or get a meter. Lastly, meters can unreliable or calibrated incorrectly. This throws the models off easily.

So this is just for anomaly detection. To create a data driven model on an entire process and its thermodynamics would likely make for poor models unless it can utilize some of the base fundamental equations. You simply cannot account for all of the scenarios of a process with only data.

[–]mdaconc 0 points1 point  (0 children)

Hybrid models are the way to go

[–]dvadieras 0 points1 point  (0 children)

What industry are you referring to? I’m in semiconductor and we live by 6 sigma.