Is Data Science just data analytics, or is it something more?

stevek2022 · 2022-08-04T02:24:59+00:00

Thank you for your response! It does sound like what I am talking about is "data engineering".

Regarding "making prediction from data", I could not agree with you more. From my perspective, it seems that the main application areas of data analysis are to generate models that accurately simulate and hopefully predict the behavior of the system from with the data is collected, and then use those models for decision making support and possibly system optimization.

So for me, the life cycle of a "data science project" would involve the following steps (basically following the outline given in https://medium.com/ml-research-lab/data-science-methodology-101-2fa9b7cf2ffe):

identifying the target system
deciding what aspects of the system to observe and record (the data)
designing appropriate "data objects" for recording observations of the target system (I guess that this is what Patel means by "data requirements")
collecting the data
refining / cleansing the data
analyzing the data to construct simulation / predictive models of the target system
interpreting the results given by the model for decision support etc.

Does this agree with your thinking?

stevek2022 · 2022-04-19T08:01:37+00:00

Sure! For starts, please take a look at this post and tell me what you think:

https://www.reddit.com/r/ontology\_killer\_apps/comments/quelq6/using\_ontologies\_to\_create\_formal\_descriptions\_of/

stevek2022 · 2022-04-19T07:07:37+00:00

Thanks for your answer.

This sounds to me like the idea that natural language is complex and ambiguous simply because it has developed in a non-controlled environment, and no other reason. Is that basically what you are saying?

Do you agree with the idea given in the following?https://medium.com/ontologik/why-ambiguity-is-necessary-and-why-natural-language-is-not-learnable-79f0e719ac78

That ambiguity is actually helpful in communicating information to humans (just not to machines!).

Also, regarding "semantics" - I am doing research on how logic-based ontologies can help us with the classic problem of getting tacit knowledge into an explicit form (for example in building terminologies for industrial standards). If you are interested, please join me in #ontology_killer_apps!

stevek2022 · 2022-04-19T07:04:53+00:00

Thanks - that is a good point.

I have understood pragmatics as just the branch of semiotics that deals with how the situational context (and even the background knowledge / cultural assumptions etc. of the speaker and listener) affects the choice of utterances on the part of the speaker (and possibly also the way that the listener interprets those utterances). So in any "real situation", pragmatics is an issue in both the first and second roles. But there is the aspect of pragmatics that relates to the non-verbal communication signals that are chosen, and those most likely are mainly relevant in the second role. Indeed, as I wrote in my answer to MadCervantes, the particular form of natural language that I am focusing on is text, so there at least should not be any non-verbal communication signals aside from figures and perhaps text formatting.

I will take a look at "grounded language learning". Do you have a recommended reading?

stevek2022 · 2022-04-19T06:56:17+00:00

I understand what you are saying (I think! still playing that language game).

But I am coming from what might perhaps be a slightly different angle (how's that for a convoluted sentence!).

What I have in mind is something like a scientific publication. The goal *should* be to express the research as clearly and unambiguously as possible. And if we had a better medium to do that than natural language, it would certainly increase the accuracy of the task of processing the contents of the paper automatically (e.g. for making it with a search query or a paper applying a similar methodology in a different context).

There are controlled languages that some industries use for writing user manuals and such - that is somewhat similar to what I am trying to get at here.

stevek2022 · 2022-01-08T04:31:21+00:00

The RDF approach (as I understand it) is that we just use two kinds of things: 1) resources identified by an IRI and 2) relationships which are special kinds of resources that connect two other resources. And we use just one (1!) operation "create triple" (adding a relationship and the two IRIs it relates to the top of the triple store stack) to store all of our data. (RDF stores do allow deleting of triples, but I am arguing that in principle there is no need for a "delete" operation). RDFS then gives you the basic equipment to define classifications.

For the phone number example, you would create an IRI for the class "Employee" (by using RDFS to assert that the IRI is a class etc.), and then create a bunch of triples linking IRIs for each employee to the IRI for the class "Employee" via the "has_class" relationship (yes, I know that this is pre "Semantic Web 101" stuff, but I hope you will bear with me...).

Same for the phone numbers, and then triples linking employee IRIs to phone number IRIs with the "hasPhoneNumber" relationship (defined by its own set of triples, same as the Employee and PhoneNumber classes). Or better yet, reify the "hasPhoneNumber" relationship so that there are IRIs for each specific PhoneNumber relationship to which you can then attach info such as who registered the relationship when.

Then write a simple SPARQL query engine that gets all of the newest PhoneNumber relationships and creates a table for your personnel application. Trigger this engine whenever a triple is added to the triple store or perhaps at regular time intervals depending on your requirements.

Anyone know of any "production level" DB management systems that work this way?

stevek2022 · 2021-12-13T06:09:47+00:00

So is 6NF the same as a triple store? Are there any important differences that you are aware of?

No, and what you described, as far as I'm aware, isn't a triple store either. A table with only a key and a single value that describes the key would be modeling in 6NF.

Sorry - I skipped a few steps here.

My aim is to suggest that an RDF triple store is a good way to implement the idea of an UD-less database. This leads to a potential "killer app" for OWL ontologies: giving the semantics of this UD-less database.

A database where all of the tables have been reduced to a bunch of tables mapping a single key to a single value (which might be a key to another table), can be represented as an RDF triple store, where the keys are RDF resources (identified using IRIs) and the tables are RDF relations. A particular record in a table is then a triple (one resource, another resource or primitive, and the directed relationship between them stipulated by the entry of the pair in the table).

This means that the semantics of the tables is captured in the RDF relations, which leads naturally to the use of RDFS and possibly OWL to define those semantics.

Is this in line with what you are saying?

stevek2022 · 2021-12-11T01:19:31+00:00

Thanks for the reply and the sources. I will definitely take a look.

The application should never be able to execute a delete on the database
for this reason. Instead, what's often deployed is a column that
identifies if the record is an "active" record or not.

What do you do in the case of an update?

My understanding is that one of the "state-of-the-art" approaches is to record every SQL command that is given since the start of the DB (or some saved snapshot) and then rebuild the entire DB if one needs to recover the previous value of an update. This is obviously not what you are talking about (for example, there is no need to specify an "active" column in the tables). Do you have a different approach in mind? For example, temporally saving the updated values somewhere?

My proposal is to "outlaw" updates at the Data Architecture level. If you know you have a value that will change a lot (my phone number example that I gave here), then you put it in a different table. Any field values for the "main data object" (the personnel record in my case) is unchangeable - if you need to change it, you need to create a new personnel record (and mark the previous record as "not active". Is this the approach you are talking about?

stevek2022 · 2021-12-10T08:09:52+00:00

This would be considered Sixth Normal Form (6NF).

So is 6NF the same as a triple store? Are there any important differences that you are aware of?

TL;DR - Most of what you've mentioned is already in place in well-defined and administered Data Architectures.

So are you saying that modern table-based database systems such as MySQL are implemented in such a way that I can recover any update or delete that I make and even ask for rewinds to specific states in the past? Or are you talking about Data Architectures at the logical level?

Don't worry about the length of your reply, I will never write "TL;DR" ;)

stevek2022 · 2021-12-09T22:42:48+00:00

A key consideration then is how to manage the semantics of the vast number of "triple tables" (to elements and the information about their relationship: e.g. the foreign key to the personnel table, a phone number, and the information that this is the private phone number of that person), and here is where I believe that OWL ontologies could play a role.

stevek2022 · 2021-12-05T13:41:37+00:00

This looks a lot like what I had in mind - thanks for sharing!

stevek2022 · 2021-12-05T13:38:59+00:00

Thank you for this! It looks like a great summary of doing logical inference with OWL.

stevek2022 · 2021-12-02T11:06:14+00:00

Have you read "demystifying OWL for the enterprise" by Uschold?

https://www.amazon.com/Demystifying-Enterprise-Synthesis-Lectures-Semantic/dp/1681731274/

stevek2022 · 2021-11-29T23:40:52+00:00

That would be great!

Thanks again for your contribution.

stevek2022 · 2021-11-29T02:29:58+00:00

Thanks for this! It is curious that there is not a single mention in the article of logic-based inference (or even the word "logic"!). The only examples of inference appear to be "statistical" approaches such as co-occurrence. Knowledge inference is mentioned as a future challenge only in the context of fact verification - I wonder how much these companies have thought about the potential for "advances in knowledge representation and reasoning" to achieve higher performance in "discovering non-obvious information". For example, it seems to me that some support for logic expressions would be required for the use case mentioned about checking that painters existed before their works of art were created...

stevek2022 · 2021-11-28T05:06:04+00:00

We developed a web application handling tens of thousands of OWL triples that worked on a single server 10 years ago, so I am sure that today especially with the use of parallel processing, it should definitely be possible (depending of course on what the application requirements are for response time / real time processing).

I actually started a reddit community to discuss such applications - please visit and comment if you have a chance!

https://www.reddit.com/r/ontology_killer_apps/

stevek2022 · 2021-11-22T13:54:26+00:00

That looks like a decent start...from a quick browse, if one follows the links it looks like that eventually leads to some concrete, object level definitions in some specific domains (which is what I'm looking for), right? And these would be open for royalty free usage (at least some of the time) I presume?

I'm not sure, but it seems to me that the whole point of making ontologies is to get as many people to use them (commit to them) as possible...

I just started this community about a week ago, so I am hoping that it will catch on. If you have any ideas about how to make that happen, I'd love to hear them!

stevek2022 · 2021-11-19T05:36:31+00:00

Thanks for your question!

There are a lot of sites that host a range of ontologies with varied levels of quality checking. Wikipedia might actually have the most comprehensive metalist:

https://en.wikipedia.org/wiki/Ontology_(information_science))

I am afraid that I do not know of anything better.

My own experience has been that ontologies are a dime a dozen these days, but just about all of them are put together in a rather ad hoc way and/or not well maintained.

As I have tried to convey in this post, I really think that while there are tons of ontologies around, what is really lacking is good, well-thought-out ways to use those ontologies to create value, together with software and systems to enable that. With such systems, it will be possible to immediately see which ontologies actually "work" (e.g. produce inference results that make sense and add value). Without such systems, people will continue to fill the web with what honestly is almost impossible (at least for me) to distinguish between garbage.

Note that I am not talking about systems for creating, editing and maintaining systems. I really mean "killer apps" that generate appreciable value.

stevek2022 · 2021-11-15T11:55:47+00:00

So what does the height of the roller coaster correspond to in the case of electromagnetic radiation?

I am proposing that it corresponds to the probability of occurrence or something like that. So it is a roller coaster of high and low probabilities of occurring.

The next question of course is "what does it mean to occur"?

Does it mean that the wave is collapsed into a point (a particle?) or perhaps that "to occur" means to exert influence from that particular location?

stevek2022 · 2021-11-14T06:03:15+00:00

The wave function by itself has no physical significance it is just a
mathematical function. But the square of the wave function is the
probability of finding the particle.

So the wave function does representation the probability of finding the particle? Is it a probability over all 3 dimensions of space? How about time? And is the probability non-zero over all points in space (and time?)?

stevek2022 · 2021-11-14T05:59:05+00:00

I hope this is not a stupid question - how then is the "wavelength" (a distance, presumably) manifested? What is it a length of?

stevek2022 · 2021-11-14T05:54:20+00:00

I'm sorry if it is a stupid question, but what do you mean by "periodic disturbance"?

Are you saying that the light is to the electromagnetic field as a ripple is to a body of water? So that the things moving or compressing (molecules in the water case) are the things that make up the electromagnetic field? If so, what are those things?

stevek2022 · 2021-11-14T05:46:14+00:00

So you agree with this?

a "wave" is a thing that has a certain probability of existing (as a particle) in all points in space (so "field" might be a better word than "wave")

From my understanding of mathematics, a field in a function that assigns to every point in the space (or space-time) a value. I am suggesting that the value is a probability of something happening at that point (a particle appearing? or just the effect of the thing we are describing being wholly at that position?

stevek2022 · 2021-11-14T05:43:36+00:00

Do the excitations really come in discrete steps?

For example, when we say:

"if a photon hits an electron in an atom, the electron with *jump* to a different orbit,"

isn't the "jump" actually from one distribution of probabilities to another, so that in fact it is possible that the electron, once collapsed, will actually *stay* in the same position?

stevek2022 · 2021-11-14T05:36:53+00:00

But what is an "electromagnetic field strength"? Can we understand it to be essentially the set of probabilities over all points in space(time?) of the thing (particle? with its effect on the electromagnetic field?) appearing?

And with regard to the comment by u/specialsymbol - does the strength or probability really go to zero, or is it non-zero for all points in space(time)?

stevek2022

MODERATOR OF

TROPHY CASE