Constructing a public meeting corpus

May 2020

概要

In this paper, we propose a method for constructing a large corpus about a century of public meetings in historical Australian newspapers, and analyze the constructed corpus. The corpus construction method is based on image processing and Optical Character Recognition (OCR). We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 71.5% with a high recall of 97.5% for corpus construction. This allows us to feed a content search tool for temporal and semantic content analysis.

タイプ

収録

Proceedings - the 12th International Conference on Language Resources and Evaluation (LREC 2020)