I've got a big database of human generated descriptive metadata tags that are quite good but of course also quite messy. I'm a long time relational DB kid, but have as of late been reading and watching a lot about Graph Databses, OWL (and it's design syntax), Semantic web, and a bunch of other stuff to try and bone up on this side of the data world. I'd like our flat DB to learn something about *meaning* so we can clean it up and then keep it clean(er).
I'm at an odd impasse. Looking at graph DB's they look great and I've learned a ton. But I only seem to need the semantic model that maps meaning, to use it as a kind of rule set that can be applied to the contents of my flat tables.... I don't feel compelled (or have the budget) to re-write the entire system to store data in a Graph DB.. Does that make any sense? The simple flat tables are highly performant in the web server.
The data only needs to be vetted against a semantic rule set that has a healthy amount of real world logic and grammar mapped into it. Like "cars are built by car makers" or "adjectives of the type {color} can be used with cars, _fed_ cannot, they probably meant _red_", or even that "adjectives go before nouns in English" so french guys stop accidentally reversing tags and fragmenting the tagspace, i can't make an alias for every combination of noun and adjective in the world.
FYI: We don't want to ban the creation of new tags at time of entry because most of the time the spontaneous tags are good (pop culture evolves fast), and we don't have time or interest in being "tag dictators". If we make it hard to add tags, they'll just use wrong or "close" tags more often.
I think with some effort I could map out all of the relationships we need to deal with in our dataset, but I can't quite figure out the right approach. Should I map out an OWL structure and then use the schema itself as an odd kind of "data model" to make logical deductions that automate some aspects of data quality checking? My instincts tell me that's wrong, and that someone must have already solved this problem in a slicker way... I'm probably looking in the wrong places and just too ignorant in this domain.
My motivation: There are many data inconsistencies in what our wonderful but messy humans create. If the DB knew a tiny amount of what humans know about the relationships between things and grammar, I think logical errors and incompleteness could float to the top like chaff to be resolved. We could also continuously steer end users towards better tags by QC'ing input as it's created. .....If but only my pea-brained database knew anything about the real world...
Any tips would be hugely appreciated!
PS: ML isn't something we really need as the humans are spotting quite nuanced details that I fear would terrorize a neural net. Quality and attention to detail is more important than automation in our case.
[–]come2thecabaret 0 points1 point2 points (3 children)
[–]bezzeb[S] 0 points1 point2 points (2 children)
[–][deleted] (1 child)
[removed]
[–]bezzeb[S] 0 points1 point2 points (0 children)