qshinoの日記

Powershell関係と徒然なこと

pdf text 抽出

pdf text抽出ソフト

  1. PyPDF / 英文のみ
  2. PDFminer.six / 和文対応
  3. Apache tika / java
  4. Tesseract / OCR

PyPDF

import PyPDF2

with open("sample.pdf", "rb") as f: 
  reader = PyPDF2.PdfFileReader(f) 
  page = reader.getPage(0)   
  print(page.extractText())

PDFminer.six

  • pip install pdfminer.six
import sys 
from pathlib import Path 
from subprocess import call 

# pdf2txt.py のパス 
py_path = Path(sys.exec_prefix) / "Scripts" / "pdf2txt.py" 
# pdf2txt.py の呼び出し 
call(["py", str(py_path), "-o extract-sample.txt", "-p 1", "extract-sample.pdf"])

Apache tika

  • pip install tika
from tika import parser 

file_data = parser.from_file("extract-sample.pdf") 
text = file_data["content"] 
print(text)

Tesseract

  • pip install pyocr

参考

https://gammasoft.jp/blog/python-parse-pdf-contents/