Problems with combining several data sets

The problem

If you look today at all the data that’s available on the Internet you will come across the following things:

  • A lot of formats for every data set. When we say a lot, we really mean a lot!
  • Several datasets will provide the same type of information without combining their forces.
  • Everyone has its own vocabulary

The point here is simple: no standards are available to structure your data set to achieve interaction with other data sets!

How we tried to tackle this problem

We tackled this problem using Linked Open Data. It’s a standard developed with a simple purpose: structure your data set and link it with other data sets. The Linked Open Data philosophy allows us to publish our data in a format like JSON-LD and easily transform it to other kinds of Linked Open Data formats automatically. This is only possible if we defined our vocabulary first of course! In order to unify the vocabulary, companies and communities created several websites where they publish their standardized vocabulary.

For example:

You can search for the vocabulary you’re looking for on these websites and they will show you which vocabulary you may use for your data sets. They also mention when the vocabulary is applicable.

For example, the building address described by the Flemish OSLO vocabulary: I summarized the text in English, the full description is available in the vocabulary in Dutch.

URI http://data.vlaanderen.be/ns/adres#Adres
Specialisation of http://www.w3.org/2000/01/rdf-schema#Resource
Definition Information that allow us to define in a unique way to street, number, … of a building.
Usage Belgian address of a building

In the table above you can clearly see that the vocabulary is strictly defined. You know exactly when you have to use it, what’s the definition of the vocabulary and so on. This way we can fix the mixed vocabulary and easily combine several datasets! Each data set can extend each other and show any matches between them without writing complex code to map several data sets and check for duplicates in the data.

Written on July 11, 2018