1st International Workshop on
challenges and experiences
from Data Integration to Knowledge Graphs

August 5, 2019

Anchorage, Alaska

Held in conjunction with KDD 2019

About the Workshop

General info about DI2KG Workshop

Data integration and knowledge graph construction are complex processes that have been studied by different communities, including data management, machine learning, statistics, data science, natural language processing and information retrieval, typically in isolation. As holistic solutions are emerging, we claim the need for a more cross-disciplinary community that pushes research toward the creation of the next generation of data integration and knowledge graph construction methods. We aim for DI2KG to be a long-term venue that fosters new research based on the availability of an end-to-end benchmark, designed up-front for dealing with the complexity of every integration task, while building a community that can contribute to the evolution of the benchmark.

In order to stimulate advances in this direction we will also host the DI2KG challenge: a set of fundamental integration tasks leading to the construction of a knowledge graph from a collection of product specifications extracted from the Web with their own manually-checked ground truth.

Papers

Submit your paper

We welcome submissions that can stimulate discussion, including papers by the challenge participants. Topics of interest include but are not limited to the following:

  • Source selection and discovery.
  • Data and information extraction.
  • Data cleaning and fusion.
  • Schema extraction and alignment.
  • Algorithmic and statistical techniques for entity resolution.
  • Machine learning methods for data integration.
  • Benchmarking and performance measurement.
  • Knowledge graph augmentation.
  • Knowledge graph embedding techniques.

Up to 4 pages in length (+ bibliography) in KDD 2019 format

Submission Paper Categories


Challenge Papers

  • Experience papers, which provide new insights into the strengths and weaknesses of existing integration systems, inspired by experimental activities on the benchmark.



  • Read more about the challenge
Research Papers

  • Position papers, which discuss requirements for a benchmark platform and the role of benchmarks in driving integration research.
  • Vision papers, which anticipate new challenges in integration and future research direction.
  • Application papers, which describe challenging use cases of modern data integration and knowledge graphs, with a strong economical and social impact component.
  • Technical papers, which present advances in topics related to integration.



Important Dates

(All deadlines are Alofi Time)

Challenge track


Benchmark
publication

April 21, 2019

Preliminary paper submissions

May 20, 2019

Paper
notifications

June 1, 2019

Benchmark results submission

July 16, 2019



Research track


Paper submissions

May 5, 2019
May 20, 2019

Paper notifications

June 1, 2019



Challenge

Join our challenge

See below for challenge 2019

Overview


We would like to bring together people from different communities because we believe that a more synergistic approach can lead to the definition of more effective integration methods. This year we will release the first version of our benchmark and we will host a challenge on different integration tasks. Attendees are invited to participate in our benchmark-supported DI2KG challenge and submit a paper describing their experience on the activities on the benchmark and new insights into the strengths and weaknesses of existing integration systems.

Tasks Definition


Our end-to-end benchmark will evaluate participants' solutions to a selection of integration tasks leading to the construction of a knowledge graph.

The challenge comprises three main tasks:

  • Entity Resolution
  • Schema Alignment
  • Knowledge Graph Augmentation

Each task requires participants to build a knowledge graph consisting of a set of predefined entities and properties. The main properties are the names of the specifications our benchmark will consider during the score calculation.

Dataset


Participants will be provided with a set of selected HTML pages regarding products from a variety of sources, each page correlated with a JSON file containing the result of an automated process of specifications extraction.

The JSON files consist of a series of key and value pairs extracted from the associated HTML page, e.g.:

                  
{
"<page title>": "Samsung Smart WB50F Digital Camera White Price in India with Offers & Full Specifications | PriceDekho.com",
"additional features": "Color\nWhite",
"brand": "Samsung",
"connectivity system req": "USB\nUSB 2.0",
"dimension": "Dimensions\n101 x 68 x 27.1 mm\nWeight\n157 gms",
"display": "Display Type\nLCD\nScreen Size\n3 Inches",
"general features": "Brand\nSamsung\nAnnounced\n2014, February\nStatus\nAvailable",
"lens": "Auto Focus\nCenter AF, Face Detection, Multi AF\nFocal Length\n4.3 - 51.6 mm (35 mm Equivalent to 24 - 288 mm)",
"media software": "Memory Card Type\nSD, SDHC, SDXC",
"optical sensor resolution in megapixel": "16.2 MP",
"other features": "ISO Rating\nAuto / 80 / 100 / 200 / 400 / 800 / 1600 / 3200\nSelf Timer\n2 sec, 10 sec\nFace Detection\nYes\nImage Stabilizer\nOptical\nMetering\nCenter, Multi, Spot\nExposure Compensation\n1/3 EV Steps, +/-2.0 EV\nMacro Mode (Exposure Mode)\n5 - 80 cm (W)\nRed Eye Reduction\nYes\nWhite Balancing\nAuto\nMicrophone\nBuilt-In Monaural Microphone",
"pixels": "Optical Sensor Resolution (in MegaPixel)\n16.2 MP",
"sensor": "Sensor Type\nCCD Sensor\nSensor Size\n1/2.3 Inches",
"sensor type": "CCD Sensor",
"shutter speed": "Maximum Shutter Speed\n1/2000 sec\nMinimum Shutter Speed\n2 sec",
"zoom": "Optical Zoom\n12x\nDigital Zoom\n2x"
}

Participants will also be provided with a set of records from our ground truth in order to have the possibility of training models; more information about training data will be released at the launch of the challenge.

Download dataset


N.B.: Ground Truth Data download available at the end of the next section



Ground Truth Data and Challenge Instructions


We manually built ground truth data, providing the solution to popular integration tasks in a unified way. Available tasks are:

  • Entity Resolution
  • Schema Alignment
  • Knowledge Graph Augmentation
Users can either focus on a single task or try to solve everything.
We partition our ground truth data in two parts. One part is available in the download for training or testing by the users. The second part will be used by us for evaluating submitted solutions (see section Scoring) and is not disclosed to participants.
Submitted solutions need to use the JSON format described below, although depending on the task of choice, some attributes might be unnecessary, as detailed in the following instructions. Instructions for submitting your JSON file will be available soon.
We consider different classes of resources, specified as value of the key “resource_class”, and most of them were assigned a global unique ID (“resource_id”). Available classes are:

  • source, that is, a website (e.g., www.camerashop.com)
  • json_file, that is, a json file in our dataset (e.g., 100.json) thus corresponding to a HTML file displaying the specification of a product
  • source_attribute, that is, a property name used in a website (e.g., battery in www.camerashop.com)
  • target_attribute, that is, a property of interest (e.g., battery)
  • provenance, that is, a key/value specification in a certain json file (e.g., battery: AAA in 100.json)
  • entity, that is, a certain product (e.g. Canon EOS 400d) that can appear in different HTML pages of our dataset.

Each class is described below, highlighting which attributes are available in the download and which attributes are left to challenging participants, depending on the task of choice.

Class source


Each resource of this class represents a single source -- i.e., website -- from our dataset.
                  
{
"resource_class": "source",
"resource_id": "SOURCE#1",
"source_name": "www.camerashop.com"
}

Class json_file


Each resource of this class corresponds to a json file of our dataset, that is, a set of extracted key-value pairs from a single HTML page. Note that we omit the value, which can be retrieved from the original json file in our dataset.
                  
{
"resource_class": "json_file",
"resource_id": "JSON#1",
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"json_number": 100
}

Class target_attribute


Each resource of this class represents a different property of interest. Such properties can be thought of attributes in the integrated schema or predicates in the knowledge graph. All our target attributes are included in the download.
                  
{
"resource_class": "target_attribute",
"resource_id": "TARGETATTRIBUTE#1",
"target_attribute_name": "battery_type"
}

Class source_attribute


Each resource of this class represents the attributes’ names at source level, that can correspond to a set of target attributes (in bold green). All the source attributes are available in the download, but only some of them come with their own target_attribute_ids. Completion of target_attribute_ids for every source attribute is left to the participants to the Schema Alignment task. We note that some source attributes can correspond to multiple target attributes, as in the example below, where the attribute battery of www.camerashop.com provides information about both the battery type (e.g. AA) and the chemistry (e.g. Li-Ion). Other source attributes, instead, can correspond to none of our target attributes (e.g., whether a product is used or new).
                  
{
"resource_class": "source_attribute",
"resource_id": "SOURCEATTRIBUTE#1",
"source_attribute_name": "battery"
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"target_attribute_ids": [ "TARGETATTRIBUTE#1", "TARGETATTRIBUTE#2" ]
}

Class provenance


Each resource of this class represents an attribute name from a specific json file of a specific source. All the provenance resources are included in the download.
                  
{
"resource_class": "provenance",
"resource_id": "PROVENANCE#1",
"json_id": "JSON#¹"
"json_number": 100
"source_id": "SOURCE#1"
"source_name": "www.camerashop.com"
"source_attribute_id": "SOURCEATTRIBUTE#¹"
"source_attribute_name": "battery"
}

Class entity


Each resource of this class represents a real world product, that can appear in a set of json files (in bold green) and that can be associated to a set of target attributes (claims). Only some entities are available in the download. Completion of all the entities, together with corresponding json files is left to the participants to the Entity Resolution task.
It is worth noticing that:

  • multiple jsons (even in the same source) can correspond to the same entity, as in the classic “dirty Entity Resolution” setting
  • a json file corresponds to one and only one entity
Depending on which sources each entity appears in, target attributes can correspond to different json attributes. The entity in the example below consists of two json files, and the fact that the first json file contains the battery type (that is, TARGETATTRIBUTE#1) in the “battery” attribute is represented by the resource PROVENANCE#1 in the claims. Some entities come already with their own provenance claims. Completion of provenance claims for every entity is left to the participants to the Knowledge Graph Augmentation task.
                  
{
"resource_class": "entity",
"claims": [ { "target_attribute_id": "TARGETATTRIBUTE#1", "target_attribute_name": "battery_type", "provenances": [ "PROVENANCE#1", "PROVENANCE#2" ] } ],
"instances": [ "JSON#1", "JSON#2" ]
}

It is important to note that Schema Alignment and Knowledge Graph Augmentation are related tasks: an attribute A of a source S -- that we refer to as S.A -- corresponds to a target attribute T iff there exists an entity whose T value is included in the attribute A of a json J in S -- that we refer to as S.J.A. However the opposite does not always hold. For instance, if S.A corresponds to T1 and T2, a certain S.J.A can be relevant to T1 only. In other words, the solution to the Knowledge Graph Augmentation task yields the solution to the Schema Alignment task (and the Entity Resolution task as well), but not viceversa.

For any question, you can contact us at the e-mail address: di2kg@inf.uniroma3.it

N.B.: Please, download the latest available version of the ground truth data!



Submission & Scoring


We use classic precision and recall for evaluating submitted solutions. In addition, in the spirit of the workshop, we invite partecipants to propose their own evaluation practices in their challenge papers.



Committees

Program Committee Co-Chairs and Organizers

Organizers


  • Valter Crescenzi, Roma Tre University
  • Xin Luna Dong, Amazon
  • Donatella Firmani, Roma Tre University
  • Paolo Merialdo, Roma Tre University
  • Divesh Srivastava, AT&T Labs-Research
  • Andrea De Angelis, Roma Tre University
  • Maurizio Mazzei, Roma Tre University

Program Chairs


  • Donatella Firmani, Roma Tre University
  • Divesh Srivastava, AT&T Labs-Research

Program Committee


  • Denilson Barbosa, University of Alberta
  • Valter Crescenzi, Roma Tre University
  • Xin Luna Dong, Amazon
  • Laura Haas, University of Massachusetts
  • Colin Lockard, University of Washington
  • Paolo Merialdo, Roma Tre University
  • Renée Miller, Northeastern University
  • Mourad Ouzzani, Qatar Computing Research Institute
  • Themis Palpanas, Paris Descartes University

Challenge Leaders


  • Andrea De Angelis, Roma Tre University
  • Maurizio Mazzei, Roma Tre University

Speakers

Keynote speakers

AnHai Doan - University of Wisconsin.

"Toward a System Building Agenda for Knowledge Graph Construction"

Abstract. I worked on knowledge graph construction in academia from 2005 to 2009, and then in industry from 2010 to 2014. Based on that experience, I concluded that building systems and engaging with real users are critical for advancing the field (and more broadly the field of data integration). So in the past five years, my group at Wisconsin has embarked on extensive system building efforts. In this talk, I will first describe these efforts. I describe the Magellan project that builds an ecosystem of tools for entity matching (EM). Magellan targets both on-prem and cloud settings, and provides self-service tools for lay users as well as sophisticated tools for power users. These tools exploit techniques from the fields of databases, machine learning, big data scaling, efficient user interaction, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. I describe how we are developing case studies as well as a Wisconsin benchmark for EM. While Magellan focuses only on EM, Columbus is a recent project (joint with many other groups) that builds on Magellan to develop an ecosystem of tools for other tasks in data exploration, cleaning, and integration. Finally, I discuss the lessons learned, and suggestions for a future agenda for knowledge graph construction (and more generally for data integration), touching on the role of system building, engaging with real users, benchmarks, and challenge problems.

Bio. AnHai Doan is Vilas Distinguished Achievement Professor of Computer Science at the University of Wisconsin-Madison. His interests cover databases, AI, and Web, with a current focus on data integration, data science, and machine learning. AnHai received the ACM Doctoral Dissertation Award in 2003, a CAREER Award in 2004, and a Sloan Fellowship in 2007. He co-authored "Principles of Data Integration", a textbook by Morgan-Kaufmann in 2012. AnHai was on the Advisory Board of Transformic, a Deep Web startup acquired by Google in 2005, and was Chief Scientist of Kosmix, a social media startup acquired by Walmart in 2011. From 2011 to 2014 he was Chief Scientist of WalmartLabs, the R&D arm of Walmart devoted to analyzing and integrating data for e-commerce. AnHai serves on the SIGMOD Advisory Board and co-chairs SIGMOD 2020.

Andrew McCallum - University of Massachusetts Amherst

"Embedded Representation and Reasoning in KBs and Natural Language"

Abstract. Work in knowledge representation has long struggled to design schemas of entity- and relation-types that capture the desired balance of specificity and generality while also supporting reasoning and information integration from various sources of input evidence. In our "universal schema" approach to knowledge representation we operate on the union of all input schemas (from structured KBs to textual patterns) while also supporting integration and generalization by learning vector embeddings whose neighbhorhoods capture semantic implicature. In this talk I will briefly review our past work on a knowledge graph with universal schema relations and entity types, then describe new research in (1) chains of reasoning, using reinforcement learning to guide the efficient search for meaningful chains, (2) aligning taxonomies and representing common sense with box-shaped embeddings, and (3) entity resolution with large-scale non-greedy clustering via Poincare embeddings.

Bio. Andrew McCallum is a Distinguished Professor and Director of the Information Extraction and Synthesis Laboratory, as well as Director of Center for Data Science in the College of Information and Computer Science at University of Massachusetts Amherst. He has published over 300 papers in many areas of AI, including natural language processing, machine learning and reinforcement learning; his work has received over 60,000 citations. He obtained his PhD from University of Rochester in 1995 with Dana Ballard and a postdoctoral fellowship from CMU with Tom Mitchell and Sebastian Thrun. In the early 2000's he was Vice President of Research and Development at at WhizBang Labs, a 170-person start-up company that used machine learning for information extraction from the Web. He is a AAAI Fellow, ACM Fellow, the recipient of the UMass Chancellor's Award for Research and Creative Activity, the UMass NSM Distinguished Research Award, the UMass Lilly Teaching Fellowship, and research awards from Google, IBM, Microsoft, Yahoo, and others. He was the General Chair for the International Conference on Machine Learning (ICML) 2012, and from 2014 to 2017 served the President of the International Machine Learning Society. He is a member of the editorial board of the Journal of Machine Learning Research. For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, entity resolution, social network analysis, structured prediction, semi-supervised learning, and deep neural networks for knowledge representation. His work on open peer review can be found at http://openreview.net. McCallum's web page is http://www.cs.umass.edu/~mccallum.

Program

Half day workshop

Tentative schedule

Welcome

Keynote AnHai Doan.

"Toward a System Building Agenda for Knowledge Graph Construction".

Paper Session 1


Daniel Obraczka, Alieh Saeedi and Erhard Rahm.

"Knowledge Graph Completion with FAMER"


Mengshu Liu, Jingya Wang, Kareem Abdelfatah and Mohammed Korayem

"Tripartite Vector Representations for Better Job Recommendation"


Behnam Rahdari and Peter Brusilovsky

"Building a Knowledge Graph for Recommending Experts"


Kobkaew Opasjumruskit, Sirko Schindler, Philipp Matthias Schaefer and Laura Thiele

"Towards Learning from User Feedback for Ontology-based Information Extraction"

Break

Keynote Andrew McCallum.

"Embedded Representation and Reasoning in KBs and Natural Language"

Paper Session 2


Bahar Ghadiri Bashardoost, Renée Miller and Kelly Lyons

"Towards a Benchmark for Knowledge Base Exchange"


Daniel Caminhas, Daniel Cones, Natalie Hervieux and Denilson Barbosa

"Detecting and Correcting Typing Errors in DBpedia"


Tomoya Yamazaki, Kentaro Nishi, Takuya Makabe, Mei Sasaki, Chihiro Nishimoto, Hiroki Iwasawa, Masaki Noguchi and Yukihiro Tagami

"A Scalable and Plug-in Based System to Construct A Production-Level Knowledge Base"

Panel Denilson Barbosa, Xin Luna Dong, Renée Miller, AnHai Doan, Andrew McCallum. Moderator: Paolo Merialdo.

"The role of benchmarking in DI and KG"

Closing

Networking Drinks Place TBD in proximity of the conference building

Brenden Legros

Libero corrupti explicabo itaque. Brenden Legros

Facere provident incidunt quos voluptas.

Hubert Hirthe

Et voluptatem iusto dicta nobis. Hubert Hirthe

Maiores dignissimos neque qui cum accusantium ut sit sint inventore.

Cole Emmerich

Explicabo et rerum quis et ut ea. Cole Emmerich

Veniam accusantium laborum nihil eos eaque accusantium aspernatur.

Jack Christiansen

Qui non qui vel amet culpa sequi. Jack Christiansen

Nam ex distinctio voluptatem doloremque suscipit iusto.

Alejandrin Littel

Quos ratione neque expedita asperiores. Alejandrin Littel

Eligendi quo eveniet est nobis et ad temporibus odio quo.

Willow Trantow

Quo qui praesentium nesciunt Willow Trantow

Voluptatem et alias dolorum est aut sit enim neque veritatis.

Hubert Hirthe

Et voluptatem iusto dicta nobis. Hubert Hirthe

Maiores dignissimos neque qui cum accusantium ut sit sint inventore.

Cole Emmerich

Explicabo et rerum quis et ut ea. Cole Emmerich

Veniam accusantium laborum nihil eos eaque accusantium aspernatur.

Brenden Legros

Libero corrupti explicabo itaque. Brenden Legros

Facere provident incidunt quos voluptas.

Jack Christiansen

Qui non qui vel amet culpa sequi. Jack Christiansen

Nam ex distinctio voluptatem doloremque suscipit iusto.

Alejandrin Littel

Quos ratione neque expedita asperiores. Alejandrin Littel

Eligendi quo eveniet est nobis et ad temporibus odio quo.

Willow Trantow

Quo qui praesentium nesciunt Willow Trantow

Voluptatem et alias dolorum est aut sit enim neque veritatis.

Proceedings

Proceedings

Contacts

Contact us