pyocr の環境構築および使用法

pyocrでtesseract-ocrを使用するための環境構築および使用方法についての記事．

Windowsでの環境構築

環境

Win10でAnacondaを使用．

> python -V
Python 3.7.13
> conda -V
conda 4.13.0

インストール方法

次のコマンドを実行し，tesseract, pyocr をインストール．

conda install -c conda-forge tesseract
conda install -c conda-forge pyocr

インストールの確認を行う．

> tesseract --version
tesseract 5.1.0

日本語への対応

Win10+Anacondaの場合，おそらく初期設定では日本語の認識には対応していない．
使用できる言語は次のコマンドで確認できる．

> tesseract --list-langs
List of available languages in "C:\Users\{Username}\Anaconda3\Library\bin/tessdata/" (1):
eng

※{Username}にはユーザーの名前が入る．
※Anacondaのインストール場所によって異なる．

日本語やその他の言語の文字認識をさせたいときは，上のコマンドで出てきたディレクトリに，
gitにある学習済みデータをダウンロードし入れる．

学習済みデータを入れるディレクトリ

C:/Users/{Username}/Anaconda3/Library/bin/tessdata

tesseractのgitリポジトリ
https://github.com/tesseract-ocr/tessdata/blob/main/jpn.traineddata

私の環境下では，次の3つをダウンロードした．

eng.traineddata
jpn.traineddata
jpn_vert.traineddata

その後，もう一度 tesseract --list-langs を実行すると

List of available languages in "C:\Users\{Username}\Anaconda3\Library\bin/tessdata/" (3):
eng
jpn
jpn_vert

と日本語が追加された．

linuxでの環境構築

環境

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian

インストール方法

pipでpyocrとtesseract-ocrを入れる．

pip3 install pyocr
pip3 install tesseract-ocr

インストールの確認をする．

$ pip3 list | grep pyocr
pyocr       0.8.3
$ tesseract --version
tesseract 4.0.0-beta.1

日本語対応

Win10のときと同様に，gitから学習済みデータをダウンロードし入れる．

学習済みデータを入れるディレクトリ /usr/share/tesseract-ocr/4.00/tessdata/ ※verなどにより数値は異なる
tesseractのgitリポジトリ
https://github.com/tesseract-ocr/tessdata/blob/main/jpn.traineddata

対応している言語の確認を行う

$ tesseract --list-langs 
List of available languages (3):
jpn
eng
osd

pyocr+tesseract-ocr のサンプルコード

import pyocr
import pyocr.builders
from PIL import Image

### 使用できるツールを取得
tools = pyocr.get_available_tools()

### tesseract-ocrを選択
tool = tools[0]

### 文字認識 
txt = tool.image_to_string(
    Image.open('sample.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=7)
)
print(txt)

入力画像

出力結果は次で，認識できていることが分かる．

光学的文字認識

builerの設定

builder の種類

Builder の種類は次．

Builder	機能
TextBuilder	文字数字を認識
WordBoxBuilder	単語の位置と内容を認識
LineBoxBuilder	行の位置と内容を認識
DigitBuilder	数字と記号を認識
DigitLineBoxBuilder	数字と記号の行の位置と位置を認識

Boxがついているものは位置まで返してくれる．
数字のみの場合は DigitBuilder を用いるとよい．

tesseract_layout の設定の意味

tesseract_layoutの設定の意味は，次のコマンドで閲覧できる．
読み取り対象によって，次を使い分けることで認識精度が大きく変わる．

> tesseract --help-extra
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

pyocr の環境構築および使用法

Windowsでの環境構築

環境

インストール方法

日本語への対応

linuxでの環境構築

環境

インストール方法

日本語対応

pyocr+tesseract-ocr のサンプルコード

builerの設定

builder の種類

tesseract_layout の設定の意味

コメントを残すコメントをキャンセル

MOSFETの相互コンダクタンス gm

draw.ioで回路図

窓関数

ソース接地回路

Verilog テストベンチの書き方

Windowsでの環境構築

環境

インストール方法

日本語への対応

linuxでの環境構築

環境

インストール方法

日本語対応

pyocr+tesseract-ocr のサンプルコード

builerの設定

builder の種類

tesseract_layout の設定の意味

コメントを残す コメントをキャンセル

関連記事

pyocr の環境構築および使用法

ホモグラフィ変換の実装(python)

コメントを残すコメントをキャンセル