DBpedia Extraction

Introduction

DBpedia was born in 2007 from the collaboration of the Mannheim University, the Free University of Berlin, and the software editor OpenLink. The goal of the project was to query the Wikipédia content as a database through the semantic web standards. Originally made for extracting the English Wikipedia, it was extended to numerous languages, DBpedia is one of many examples of this.

Generic placeholder image

From Wikipedia to DBpedia

Wikipedia articles are structured pages : an article is firstly described by a title, can contain homonyms, links to others' linguistic versions, and can contain geo-coordinates.

Beside the body of it generally contains : an abstract, a table of content, and links... You may also have noticed two types of special inserts : one dedicated to the categorization, and the other called the infobox. This regularity of structure is the ground of DBpedia. Let's illustrate that by the example of the Paris Wikipedia article :

Generic placeholder image

The wikicode

Behind each article is hidden a code called the Wikicode. Every Wikimedia Foundation project employs this markup system. To learn more about its syntax, visit the the dedicated Wiki Help page. This language manages the article layout and could also load data via Lua scripts.

For discovering the wikicode of a page, you need to click on the "Edit" tab on the top of an article. It will give you the following result for the Paris article:

Generic placeholder image

The infobox

The infobox is the insert situated at the top right of an article. This one is very important, because it allows you to quickly figurate what is the topic of an article. It exists plenty of infoboxes : one for each type of object that Wikipedia can describe (art, culture, biography, law concept, economy, management, media, military, politics, science, logistic...). These inserts are determined by the Wikipedian community for each linguistic chapter. You can find here the complete list of infoboxes related to Wikipedia Fr.

Each infobox allows summarizing the content of an article according to the properties attributed to the same type of article. On the side of the wikicode it equates to defining or not values for each potential field of the template used :

Generic placeholder image

The Wikipedia article of Paris contains the Commune de France infobox, hich allows attributing to a city the name of its region, an insee code, a postal code, a mayor, an altitude, a surface, a population... These informations are in our example well filled in.

The DBpedia Ontology

The DBpedia ontology lists and describes the concepts and the object that is possible to find on Wikipedia (classes), as well as the relation that can be employed for associating (properties). The classes are hierarchically organized, you can understand it by exploring this page that represents the organization of the DBpedia classes between them. They are defined in a web semantic format : le format owl. This is possible to contribute to the enrichment of classes and properties. If you wanna do it, please refer to the article "How can I contribute to DBpedia Fr ?" of our documentation.

Generic placeholder image

Concerning our example, by searching into the DBpedia ontology, we can imagine that Paris could be related to the class Settlement.

Mappings

You understood that the Wikipedia articles can be classified and described on the basis of the infoboxes that they contain. It is what is doing DBpedia, by aligning them via the collaborative wiki of DBpedia, previously presented. Where the DBpedia contributors have defined the type of each infoboxe object and attribute to each infobox property a DBpedia ontology property.

As for the ontology, this is possible to contribute to the definition of the french mapping. If you want to invest yourself into this task, in addition to the "How can i contribute to DBpedia Fr ?" article, we recommend you read the "How to create a mapping ?" article of our documentation.

Take a look at the infobox of Commune de France related to Paris article :

Generic placeholder image

DBpedia extraction framework

The DBpedia extraction workflow is conducted by an open-source software platform to which it is also possible to contribute. This one is written in Java and Scala and is available on this github repository.

Generic placeholder image

This one bases the extraction on the dump delivered by the Wikimedia Foundation. These archives contain all the articles of a given linguistic chapter of Wikipedia formatted in wikicode. The DBpedia framework scans an entire dump and performs four types of extractions :

  • generic extraction (made directly on the wikicode)
  • mapping based extraction
  • text extraction
  • wikidata extraction
Developers of the DBpedia community built many extractors. We will quickly present here the list of the main extractors used for DBpedia Fr :

Generic extractors

The generic extractions are conducted directly on the wikicode :

Mapping-based extractors

These extractors are parsing the syntax of the mapping defined here. Explaining how they are working is not informative. For understanding how this transformation is conducted, please refer to the mapping guidelines and explore the mapping defined for DBpedia Fr.

Text extractors

Two extractors allow the text retrieval :

Wikidata extractors

Since 2014, DBpedia extracts also data coming from Wikidata and offers mapping the Wikidata properties with the DBpedia one.

The databus

The DBpedia association process regularly a complete extraction of Wikipedia over 140 languages, which it hosts on the Databus

CHowever, each chapters does not have a Sparql endpoint. Please find here the complete list of the available DBpedia enpoint.

For the moment, the French DBpedia chapter mainly rely on data selection coming from the Databus. We are structuring these data with named graphs, we are computing a large number of statistics on it and we reshaping the Wikidata triples. For more information about it, please read the article "DBpedia Fr structure".