pdf text抽出ソフト
- PyPDF / 英文のみ
- PDFminer.six / 和文対応
- Apache tika / java
- Tesseract / OCR
PyPDF
import PyPDF2
with open("sample.pdf", "rb") as f:
reader = PyPDF2.PdfFileReader(f)
page = reader.getPage(0)
print(page.extractText())
PDFminer.six
import sys
from pathlib import Path
from subprocess import call
py_path = Path(sys.exec_prefix) / "Scripts" / "pdf2txt.py"
call(["py", str(py_path), "-o extract-sample.txt", "-p 1", "extract-sample.pdf"])
from tika import parser
file_data = parser.from_file("extract-sample.pdf")
text = file_data["content"]
print(text)
Tesseract
参考
https://gammasoft.jp/blog/python-parse-pdf-contents/