Magic-PDF Setup with GPU Support
Create and Activate a Conda Environment
conda create -n py3.10_mineru python=3.10
conda activate py3.10_mineru
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
Install PaddlePaddle with GPU Support
To enable GPU support, install PaddlePaddle as follows:
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
Download Models
wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
Edit the Magic-PDF Configuration File
Create and edit a configuration file:
/root/magic-pdf.json
{
"bucket_info": {
"bucket-name-1": [
"ak",
"sk",
"endpoint"
],
"bucket-name-2": [
"ak",
"sk",
"endpoint"
]
},
"models-dir": "/mnt/data1/hf/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/38e484355b9acf5654030286bf72490e27842a3c/models",
"layoutreader-model-dir": "/mnt/data1/hf/hub/models--hantian--layoutreader/snapshots/641226775a0878b1014a96ad01b9642915136853",
"device-mode": "cuda",
"layout-config": {
"model": "layoutlmv3"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": true
},
"table-config": {
"model": "rapid_table",
"enable": true,
"max_time": 400
},
"config_version": "1.0.0"}
Verify Installation
Download a demo PDF and run Magic-PDF to test:
# 配置LD_LIBRARY_PATH 路径export LD_LIBRARY_PATH=/opt/data1/app/anaconda3/envs/py3.10_mineru/lib:$LD_LIBRARY_PATH# 下载测试文件wget https://gitee.com/myhloli/MinerU/raw/master/demo/small_ocr.pdf# 执行测试magic-pdf --path small_ocr.pdf --output-dir /tmp/ --method auto
GPU Parallel Processing Setup
For running on multiple GPUs, follow the instructions below:
pip install -U litserve python-multipart filetype
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com# CUDA 是12 的版本也是用下面的命令pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118# 本地英伟达显卡配置如下nvidia-smi
+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 ||-----------------------------------------+------------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 NVIDIA GeForce RTX 4090 Off | 00000000:18:00.0 Off | Off || 30% 26C P8 7W / 450W | 18MiB / 24564MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------+| 1 NVIDIA GeForce RTX 4090 Off | 00000000:51:00.0 Off | Off || 30% 24C P8 11W / 450W | 18MiB / 24564MiB | 0% Default || | | N/A |+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| 0 N/A N/A 2925 G /usr/lib/xorg/Xorg 4MiB || 1 N/A N/A 2925 G /usr/lib/xorg/Xorg 4MiB |+-----------------------------------------------------------------------------------------+
Run server and client scripts:
# git hub 地址如下; 原理是按照文件级别的并行,分发到不同的显卡上。如果单个文件,还是一个显卡上执行# https://github.com/opendatalab/MinerU/tree/master/projects/multi_gpupython server.py
python client.py
The operations are executed in parallel based on file granularity.
Image Operations Directory
The directory for image operations is:
/opt/data1/app/anaconda3/envs/py3.10_mineru/lib/python3.10/site-packages/magic_pdf/tools
CLI Script (cli.py
)
import osfrom pathlib import Pathimport clickfrom loguru import loggerimport pymupdfimport uuidimport magic_pdf.model as model_configfrom magic_pdf.libs.version import __version__from magic_pdf.rw.AbsReaderWriter import AbsReaderWriterfrom magic_pdf.rw.DiskReaderWriter import DiskReaderWriterfrom magic_pdf.tools.common import do_parse, parse_pdf_methods@click.command()@click.version_option(__version__,
'--version',
'-v',
help='display the version and exit')@click.option(
'-p',
'--path',
'path',
type=click.Path(exists=True),
required=True,
help='local pdf filepath or directory',)@click.option(
'-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='output local directory',)@click.option(
'-m',
'--method',
'method',
type=parse_pdf_methods,
help="""the method for parsing pdf.ocr: using ocr technique to extract information from pdf.txt: suitable for the text-based pdf only and outperform ocr.auto: automatically choose the best method for parsing pdf from ocr and txt.without method specified, auto will be used by default.""",
default='auto',)@click.option(
'-l',
'--lang',
'lang',
type=str,
help=""" Input the languages in the pdf (if known) to improve OCR accuracy. Optional. You should input "Abbreviation" with language form url: https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations """,
default=None,)@click.option(
'-d',
'--debug',
'debug_able',
type=bool,
help='Enables detailed debugging information during the execution of the CLI commands.',
default=False,)@click.option(
'-s',
'--start',
'start_page_id',
type=int,
help='The starting page for PDF parsing, beginning from 0.',
default=0,)@click.option(
'-e',
'--end',
'end_page_id',
type=int,
help='The ending page for PDF parsing, beginning from 0.',
default=None,)def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
model_config.__use_inside_model__ = True
model_config.__model_mode__ = 'full'
os.makedirs(output_dir, exist_ok=True)
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def to_pdf(file_path):
with pymupdf.open(file_path) as f:
if f.is_pdf:
return file_path
else:
pdf_bytes = f.convert_to_pdf()
# 将pdfbytes 写入到uuid.pdf中
# 生成唯一的文件名
unique_filename = f"{uuid.uuid4()}.pdf"
# 构建完整的文件路径
tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
# 将字节数据写入文件
with open(tmp_file_path, 'wb') as tmp_pdf_file:
tmp_pdf_file.write(pdf_bytes)
return tmp_file_path
def parse_doc(doc_path: str):
try:
doc_path_new = to_pdf(doc_path)
file_name = str(Path(doc_path_new).stem)
pdf_data = read_fn(doc_path_new)
do_parse(
output_dir,
file_name,
pdf_data,
[],
method,
debug_able,
start_page_id=start_page_id,
end_page_id=end_page_id,
lang=lang
)
except Exception as e:
logger.exception(e)
if os.path.isdir(path):
for doc_path_new in Path(path).glob('*'):
parse_doc(doc_path_new)
else:
parse_doc(path)if __name__ == '__main__':
cli()
Image To Text
magic-pdf --path /home/ubuntu/123.jpg --output-dir /tmp/ --method ocr
推荐本站淘宝优惠价购买喜欢的宝贝:
本文链接:https://hqyman.cn/post/9558.html 非本站原创文章欢迎转载,原创文章需保留本站地址!
休息一下~~