【Python】画像から特定の情報だけを抜き出すOCR処理｜領収書・明細対応

領収書や請求書、明細書などの画像から「日付」「金額」「店舗名」など特定の情報だけを抽出したいケースは多くあります。PythonとOCRエンジンを組み合わせれば、画像を解析して必要なデータだけを取り出す自動化が可能です。

この記事では、Tesseract OCRと正規表現を活用して、画像から特定情報だけを抽出する方法を解説します。

前提：Tesseract OCRの導入
OCRによる全文テキスト抽出
正規表現で特定情報を抽出
前処理で精度を上げる方法
応用：複数画像の一括処理
まとめ

前提：Tesseract OCRの導入

Windows： Windows版Tesseract をインストールし、パスを通す
macOS： brew install tesseract
Pythonライブラリ： pip install pytesseract Pillow

OCRによる全文テキスト抽出

import pytesseract
from PIL import Image

# 必要に応じてパス指定（Windows）
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = Image.open("receipt.jpg")
text = pytesseract.image_to_string(image, lang="jpn")
print("全文OCR結果：\n", text)

正規表現で特定情報を抽出

import re

# 金額抽出（例：¥1,200 または 1200円）
price_match = re.search(r"(¥\s?\d{1,3}(,\d{3})*|\d{1,3}(,\d{3})*円)", text)
if price_match:
    print("金額：", price_match.group())

# 日付抽出（例：2025/07/18 または 2025年7月18日）
date_match = re.search(r"\d{4}[/-年]\d{1,2}[/-月]\d{1,2}[日]?", text)
if date_match:
    print("日付：", date_match.group())

# 店舗名候補の抽出（1行目や「株式会社」「店」などを含む行）
lines = text.split("\n")
for line in lines:
    if "株式会社" in line or "店" in line:
        print("店舗名（推定）：", line.strip())
        break

前処理で精度を上げる方法

OCRの精度を高めるには、画像の前処理が効果的です。以下のような方法が有効です：

グレースケール化
二値化（閾値処理）
ノイズ除去
リサイズ（拡大）

import cv2
import numpy as np

img = cv2.imread("receipt.jpg", cv2.IMREAD_GRAYSCALE)
_, thresh = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY)
cv2.imwrite("preprocessed.jpg", thresh)

応用：複数画像の一括処理

import os

for file in os.listdir("./receipts"):
    if file.endswith(".jpg") or file.endswith(".png"):
        image = Image.open(os.path.join("./receipts", file))
        text = pytesseract.image_to_string(image, lang="jpn")
        # 抽出処理を繰り返す