Communications in Computer and Information Science

, pages 418-429

A Dataset of Vietnamese Documents for Text Detection

Anh Le ¹

Dang Tran Hai Mai ²

Thanh Lam ²

Hide authors affiliations Show authors affiliations: 2 affiliations

NTT Hi-tech Institute, Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam |

Deep Learning and Applications, Ho Chi Minh City, Vietnam |

Publication type: Book Chapter

Publication date: 2023-11-17

Springer Nature

Communications in Computer and Information Science

scimago Q4

SJR: 0.182

CiteScore: 1.1

Impact factor: —

ISSN: 18650929, 18650937

DOI: 10.1007/978-981-99-8296-7_30

Copy DOI

Abstract

Document analysis and recognition is a crucial technique for automating the input process of forms, receipts, documents at banks, governments, companies. With demands in both research and industry, there are available datasets for Document Analysis and Recognition in English, Chinese, Arabic, and Indic. However, there is no publicly datasets for Vietnamese Document Analysis and Recognition. In this paper, we introduce a new dataset for Vietnamese Document analysis named VNDoc, which aims to set up a standard dataset for researching and developing Vietnamese Document Analysis Systems. The dataset contains 226 documents scanned from mobile phones and scan machines. The documents are collected from diverse categories such as legal and administrations, invoices, resumes, handwriting forms, and so on, which target various applications. At the first stage, we provide ground truth for text lines, which allow performing research in text detection and layout analysis. Moreover, we describe a statistical analysis of text length and bounding box in the dataset and initial experiments for the existing methods for text detection. We are going to provide text transcriptions and available for research communities.

Found

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.

Metrics

Cite this

GOST |

Cite this

GOST Copy

Le A., Mai D. T. H., Lam T. A Dataset of Vietnamese Documents for Text Detection // Communications in Computer and Information Science. 2023. pp. 418-429.

GOST all authors (up to 50) Copy

Le A., Mai D. T. H., Lam T. A Dataset of Vietnamese Documents for Text Detection // Communications in Computer and Information Science. 2023. pp. 418-429.

RIS |

Cite this

RIS Copy

TY - GENERIC

DO - 10.1007/978-981-99-8296-7_30

UR - https://doi.org/10.1007/978-981-99-8296-7_30

TI - A Dataset of Vietnamese Documents for Text Detection

T2 - Communications in Computer and Information Science

AU - Le, Anh

AU - Mai, Dang Tran Hai

AU - Lam, Thanh

PY - 2023

DA - 2023/11/17

PB - Springer Nature

SP - 418-429

SN - 1865-0929

SN - 1865-0937

ER -

BibTex

Cite this

BibTex (up to 50 authors) Copy

@incollection{2023_Le,

author = {Anh Le and Dang Tran Hai Mai and Thanh Lam},

title = {A Dataset of Vietnamese Documents for Text Detection},

publisher = {Springer Nature},

year = {2023},

pages = {418--429},

month = {nov}

}

Publisher

Springer Nature

Journal

Communications in Computer and Information Science

scimago Q4

SJR

0.182

CiteScore

1.1

Impact factor

—

ISSN

18650929 (Print)

18650937 (Electronic)