21
2025
03
01:27:02

【LLM|Action】实战操作:用MinerU处理最难识别的PDF,看看OCR能做到什么程度!

Magic-PDF Setup with GPU Support

Create and Activate a Conda Environment

conda create -n py3.10_mineru python=3.10
conda activate py3.10_mineru
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple

Install PaddlePaddle with GPU Support

To enable GPU support, install PaddlePaddle as follows:

python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

Download Models

wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py

Edit the Magic-PDF Configuration File

Create and edit a configuration file:

/root/magic-pdf.json

{
    "bucket_info": {
        "bucket-name-1": [
            "ak",
            "sk",
            "endpoint"
        ],
        "bucket-name-2": [
            "ak",
            "sk",
            "endpoint"
        ]
    },
    "models-dir": "/mnt/data1/hf/hub/models--opendatalab--PDF-Extract-Kit-1.0/snapshots/38e484355b9acf5654030286bf72490e27842a3c/models",
    "layoutreader-model-dir": "/mnt/data1/hf/hub/models--hantian--layoutreader/snapshots/641226775a0878b1014a96ad01b9642915136853",
    "device-mode": "cuda",
    "layout-config": {
        "model": "layoutlmv3"
    },
    "formula-config": {
        "mfd_model": "yolo_v8_mfd",
        "mfr_model": "unimernet_small",
        "enable": true
    },
    "table-config": {
        "model": "rapid_table",
        "enable": true,
        "max_time": 400
    },
    "config_version": "1.0.0"}

Verify Installation

Download a demo PDF and run Magic-PDF to test:

# 配置LD_LIBRARY_PATH 路径export LD_LIBRARY_PATH=/opt/data1/app/anaconda3/envs/py3.10_mineru/lib:$LD_LIBRARY_PATH# 下载测试文件wget https://gitee.com/myhloli/MinerU/raw/master/demo/small_ocr.pdf# 执行测试magic-pdf --path small_ocr.pdf --output-dir /tmp/ --method auto

GPU Parallel Processing Setup

For running on multiple GPUs, follow the instructions below:

pip install -U litserve python-multipart filetype
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com# CUDA 是12 的版本也是用下面的命令pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118# 本地英伟达显卡配置如下nvidia-smi

+-----------------------------------------------------------------------------------------+| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     ||-----------------------------------------+------------------------+----------------------+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC || Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. ||                                         |                        |               MIG M. ||=========================================+========================+======================||   0  NVIDIA GeForce RTX 4090        Off |   00000000:18:00.0 Off |                  Off || 30%   26C    P8              7W /  450W |      18MiB /  24564MiB |      0%      Default ||                                         |                        |                  N/A |+-----------------------------------------+------------------------+----------------------+|   1  NVIDIA GeForce RTX 4090        Off |   00000000:51:00.0 Off |                  Off || 30%   24C    P8             11W /  450W |      18MiB /  24564MiB |      0%      Default ||                                         |                        |                  N/A |+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+| Processes:                                                                              ||  GPU   GI   CI        PID   Type   Process name                              GPU Memory ||        ID   ID                                                               Usage      ||=========================================================================================||    0   N/A  N/A      2925      G   /usr/lib/xorg/Xorg                              4MiB ||    1   N/A  N/A      2925      G   /usr/lib/xorg/Xorg                              4MiB |+-----------------------------------------------------------------------------------------+

Run server and client scripts:

# git hub 地址如下; 原理是按照文件级别的并行,分发到不同的显卡上。如果单个文件,还是一个显卡上执行# https://github.com/opendatalab/MinerU/tree/master/projects/multi_gpupython server.py
python client.py
  • The operations are executed in parallel based on file granularity.

Image Operations Directory

The directory for image operations is:

/opt/data1/app/anaconda3/envs/py3.10_mineru/lib/python3.10/site-packages/magic_pdf/tools

CLI Script (cli.py)

import osfrom pathlib import Pathimport clickfrom loguru import loggerimport pymupdfimport uuidimport magic_pdf.model as model_configfrom magic_pdf.libs.version import __version__from magic_pdf.rw.AbsReaderWriter import AbsReaderWriterfrom magic_pdf.rw.DiskReaderWriter import DiskReaderWriterfrom magic_pdf.tools.common import do_parse, parse_pdf_methods@click.command()@click.version_option(__version__,
                      '--version',
                      '-v',
                      help='display the version and exit')@click.option(
    '-p',
    '--path',
    'path',
    type=click.Path(exists=True),
    required=True,
    help='local pdf filepath or directory',)@click.option(
    '-o',
    '--output-dir',
    'output_dir',
    type=click.Path(),
    required=True,
    help='output local directory',)@click.option(
    '-m',
    '--method',
    'method',
    type=parse_pdf_methods,
    help="""the method for parsing pdf.ocr: using ocr technique to extract information from pdf.txt: suitable for the text-based pdf only and outperform ocr.auto: automatically choose the best method for parsing pdf from ocr and txt.without method specified, auto will be used by default.""",
    default='auto',)@click.option(
    '-l',
    '--lang',
    'lang',
    type=str,
    help="""    Input the languages in the pdf (if known) to improve OCR accuracy.  Optional.    You should input "Abbreviation" with language form url:    https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations    """,
    default=None,)@click.option(
    '-d',
    '--debug',
    'debug_able',
    type=bool,
    help='Enables detailed debugging information during the execution of the CLI commands.',
    default=False,)@click.option(
    '-s',
    '--start',
    'start_page_id',
    type=int,
    help='The starting page for PDF parsing, beginning from 0.',
    default=0,)@click.option(
    '-e',
    '--end',
    'end_page_id',
    type=int,
    help='The ending page for PDF parsing, beginning from 0.',
    default=None,)def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
    model_config.__use_inside_model__ = True
    model_config.__model_mode__ = 'full'
    os.makedirs(output_dir, exist_ok=True)

    def read_fn(path):
        disk_rw = DiskReaderWriter(os.path.dirname(path))
        return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)

    def to_pdf(file_path):
        with pymupdf.open(file_path) as f:
            if f.is_pdf:
                return file_path
            else:
                pdf_bytes = f.convert_to_pdf()
                # 将pdfbytes 写入到uuid.pdf中
                # 生成唯一的文件名
                unique_filename = f"{uuid.uuid4()}.pdf"

                # 构建完整的文件路径
                tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)

                # 将字节数据写入文件
                with open(tmp_file_path, 'wb') as tmp_pdf_file:
                    tmp_pdf_file.write(pdf_bytes)

                return tmp_file_path

    def parse_doc(doc_path: str):
        try:
            doc_path_new = to_pdf(doc_path)
            file_name = str(Path(doc_path_new).stem)
            pdf_data = read_fn(doc_path_new)
            do_parse(
                output_dir,
                file_name,
                pdf_data,
                [],
                method,
                debug_able,
                start_page_id=start_page_id,
                end_page_id=end_page_id,
                lang=lang
            )

        except Exception as e:
            logger.exception(e)

    if os.path.isdir(path):
        for doc_path_new in Path(path).glob('*'):
            parse_doc(doc_path_new)
    else:
        parse_doc(path)if __name__ == '__main__':
    cli()

Image To Text

magic-pdf --path /home/ubuntu/123.jpg --output-dir /tmp/ --method ocr

【LLM|Action】实战操作:用MinerU处理最难识别的PDF,看看OCR能做到什么程度!




推荐本站淘宝优惠价购买喜欢的宝贝:

【LLM|Action】实战操作:用MinerU处理最难识别的PDF,看看OCR能做到什么程度!

本文链接:https://hqyman.cn/post/9558.html 非本站原创文章欢迎转载,原创文章需保留本站地址!

分享到:
打赏





休息一下~~


« 上一篇 下一篇 »

发表评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

请先 登录 再评论,若不是会员请先 注册

您的IP地址是: