A Dataset of Vietnamese Documents for Text Detection

Anh Le 1
Dang Tran Hai Mai 2
Thanh Lam 2
Publication typeBook Chapter
Publication date2023-11-17
scimago Q4
SJR0.182
CiteScore1.1
Impact factor
ISSN18650929, 18650937
Abstract
Document analysis and recognition is a crucial technique for automating the input process of forms, receipts, documents at banks, governments, companies. With demands in both research and industry, there are available datasets for Document Analysis and Recognition in English, Chinese, Arabic, and Indic. However, there is no publicly datasets for Vietnamese Document Analysis and Recognition. In this paper, we introduce a new dataset for Vietnamese Document analysis named VNDoc, which aims to set up a standard dataset for researching and developing Vietnamese Document Analysis Systems. The dataset contains 226 documents scanned from mobile phones and scan machines. The documents are collected from diverse categories such as legal and administrations, invoices, resumes, handwriting forms, and so on, which target various applications. At the first stage, we provide ground truth for text lines, which allow performing research in text detection and layout analysis. Moreover, we describe a statistical analysis of text length and bounding box in the dataset and initial experiments for the existing methods for text detection. We are going to provide text transcriptions and available for research communities.
Found 

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
0
Share
Cite this
GOST |
Cite this
GOST Copy
Le A., Mai D. T. H., Lam T. A Dataset of Vietnamese Documents for Text Detection // Communications in Computer and Information Science. 2023. pp. 418-429.
GOST all authors (up to 50) Copy
Le A., Mai D. T. H., Lam T. A Dataset of Vietnamese Documents for Text Detection // Communications in Computer and Information Science. 2023. pp. 418-429.
RIS |
Cite this
RIS Copy
TY - GENERIC
DO - 10.1007/978-981-99-8296-7_30
UR - https://doi.org/10.1007/978-981-99-8296-7_30
TI - A Dataset of Vietnamese Documents for Text Detection
T2 - Communications in Computer and Information Science
AU - Le, Anh
AU - Mai, Dang Tran Hai
AU - Lam, Thanh
PY - 2023
DA - 2023/11/17
PB - Springer Nature
SP - 418-429
SN - 1865-0929
SN - 1865-0937
ER -
BibTex
Cite this
BibTex (up to 50 authors) Copy
@incollection{2023_Le,
author = {Anh Le and Dang Tran Hai Mai and Thanh Lam},
title = {A Dataset of Vietnamese Documents for Text Detection},
publisher = {Springer Nature},
year = {2023},
pages = {418--429},
month = {nov}
}