Expert Systems with Applications, volume 51, pages 259-275
Predicate enrichment of aligned XPaths for wrapper induction
Publication type: Journal Article
Publication date: 2016-06-01
Journal:
Expert Systems with Applications
Q1
Q1
SJR: 1.875
CiteScore: 13.8
Impact factor: 7.5
ISSN: 09574174, 18736793
Computer Science Applications
General Engineering
Artificial Intelligence
Abstract
Proposed XPath predicate enrichments for wrapper induction approach.Built on generalisation strategy that aligns and merge XPaths.Focus on taking full advantage of XPath syntax for wrapper construction.Test data (PostgreSQL db) supplied, based on the work of Hao et?al. (2011).Method can be used to merge data from various heterogeneous sources. Extracting data from various semi-structured sources is a topic that has received a lot of attention. Wrapper induction specifically has been studied extensively, where users annotate a couple of data sources with examples of the data they want, after which a procedure (wrapper) is constructed that can optimally extract similar data as well. In this paper a novel wrapper induction approach is proposed, exploiting the premise of the general applicability of the XPath query language, studied specifically within the context of web pages. After a user annotates a limited set of web pages with the required data, a generalised XPath is constructed that is capable of extracting the examples and, optimally, similar data as well. This generalised baseline XPath is then enriched with predicates, based on context and structure of the data sources, to optimise the precision/recall balance of the data extraction capability of the wrapper. Six variations of such limiting predicates are introduced and investigated. In this paper, it is shown that the baseline approach often generalises the samples too much, leading to a decreased precision. Enriching the baseline wrapper by the addition of predicates limits the generalisation power of the queries in an intelligent manner. Experimental results show that there is a significant improvement in the overall precision of the generalised query, without an excessive loss in recall. Documented tests and real world experience with a large amount of data show that the technique is flexible, easily understood and applicable in a broad range of applications. It is not only of interest in the fields of web information retrieval, but can also be used in the contexts of, e.g., reverse engineering of databases, ontology expansion and deep web data mining, as both simple lists of data and complex structures can be extracted.
Found
Found
Top-30
Journals
1
|
|
Journal of Computer Information Systems
1 publication, 25%
|
|
Cybernetics and Information Technologies
1 publication, 25%
|
|
1
|
Publishers
1
|
|
Taylor & Francis
1 publication, 25%
|
|
1 publication, 25%
|
|
1
|
- We do not take into account publications without a DOI.
- Statistics recalculated only for publications connected to researchers, organizations and labs registered on the platform.
- Statistics recalculated weekly.
Are you a researcher?
Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
Cite this
GOST |
RIS |
BibTex
Cite this
GOST
Copy
Nielandt J., Bronselaer A., De Tré G. Predicate enrichment of aligned XPaths for wrapper induction // Expert Systems with Applications. 2016. Vol. 51. pp. 259-275.
GOST all authors (up to 50)
Copy
Nielandt J., Bronselaer A., De Tré G. Predicate enrichment of aligned XPaths for wrapper induction // Expert Systems with Applications. 2016. Vol. 51. pp. 259-275.
Cite this
RIS
Copy
TY - JOUR
DO - 10.1016/j.eswa.2015.12.040
UR - https://doi.org/10.1016/j.eswa.2015.12.040
TI - Predicate enrichment of aligned XPaths for wrapper induction
T2 - Expert Systems with Applications
AU - Nielandt, Joachim
AU - Bronselaer, Antoon
AU - De Tré, Guy
PY - 2016
DA - 2016/06/01
PB - Elsevier
SP - 259-275
VL - 51
SN - 0957-4174
SN - 1873-6793
ER -
Cite this
BibTex (up to 50 authors)
Copy
@article{2016_Nielandt,
author = {Joachim Nielandt and Antoon Bronselaer and Guy De Tré},
title = {Predicate enrichment of aligned XPaths for wrapper induction},
journal = {Expert Systems with Applications},
year = {2016},
volume = {51},
publisher = {Elsevier},
month = {jun},
url = {https://doi.org/10.1016/j.eswa.2015.12.040},
pages = {259--275},
doi = {10.1016/j.eswa.2015.12.040}
}