Thursday, 14 May 2015

Semantic Web for the Working Ontologist: chapter 6

This week's chapter – it's nice and short: or, at least, short) is called "RDF and inferencing".

It covers the way that data modelers can ensure that, when someone searches the web overall – or a specific site – the results include the relevant examples of the thing(s) they're searching for, even when these haven't been mentioned, by name, in the search. The example the authors use here is a search for "shirts" that should return results including "Henleys" (which, apparently, are a type of shirt. Who knew?).

What's particularly significant about the Semantic Web approach, is that it enables a data modeler to create/define data so that "the data can describe something about the way they should be used". But, more than that, "given some stated information, we can determine other, related information that we can also  consider as if it had been stated".

This is inference: the ability to model data in such ways that we are "able to add relationships into the data that constrain how the data is used".

?Thing1 rdfs:subClassOf ?Thing2 .
?x rdf:type ?Thing1 .
?x rdf:type ?Thing2 .
In other words (or to be more accurate, in words), if Thing1 is a subset of Thing2, and x is an example of Thing1, then x must also be an example of Thing2.

We won't need to specify anywhere that that x is an example of Thing2. The query engine will infer that itself.

This is not a remotely helpful illustration, as Thing 1 is clearly not a subclass (or subset) of Thing 2, but it is what the inference quoted about made me think of…

Anyway, this is one of the uses of CONSTRUCT (see last week's blog) in SPARQL, which "provides a precise and compact way to express inference rules". All of this means that SPARQL can form "the basis for an inference language", such as SPIN – SPARQL Inferencing Notation which – according to its web page – "has become the de-facto industry standard to represent SPARQL rules and constraints on Semantic Web models.

Overall, "the strategy of basing the meaning of our terms on inferencing, provides a robust solution to understanding the meaning of novel combinations of terms". Overall, it means that any "deployment architecture" will require not merely the functionality of a Query Engine, but of something that functions as an Inference and Query Engine. Which, in other words, will work both with "triples that have been asserted" - ie those specified within any query, but also triples that are inferred from those that have been asserted (see above). Incidentally, when these relationships are represented graphically, the convention is to print asserted triples with unbroken lines, and inferred triples with broken ones.

In some instances, the querying and inferencing are done by the query engine. In other formulations, "the data are preprocessed by an inferencing engine and then queried directly" - as sometimes it's "convenient to think about inferencing and queries as separate processes". This means that inferencing can happen at different points in the storing and querying process, depending on the implementation. The decisions around this have implications: when the inferring is done early in the storing and querying processes for storage and choices of which inferred triples to retain and which to discard to keep when data sources change at. And just in time inferencing" approach – where all inferencing happens only in response to queries, "risks duplicating inference work.

And that is about that for this week. Next week we're moving on to RDF Schema. Which is obviously going to be a challenge, as I'm not sure I even understand the chapter title. Still, I've already made it to page 125, which makes this the longest relationship I've ever sustained with a technical manual…

No comments:

Post a Comment