IT博客汇 | PDF 进阶之印章识别

PDF 进阶之印章识别

obaby发表于 2024-08-22 07:22:29

说是pdf 印章识别，其实准确来说是图片印章识别。当然，这个功能还是要继续前面的话题。流程自动化，简言之就是需要在用户上传完盖章之后的所有文档图片之后将图片拼接为 pdf，并且，还要检测上传的图片是否已经盖章。之所以要自动检测是因为：看了下现在用户上传的图片，有很多并没盖章，企图蒙混过关。虽然后续还有审核功能，但是与其增加审核的工作量，不如直接在源头就卡死，如果没有盖章禁止结束流程。

github 上搜索印章识别也能搜到一些项目，但是，注意哈，我要说然鹅了。很多开源项目开源了一半，这就离谱，例如下面这个：

代码拉下来，哼哧哼哧部署好环境，结果在运行的时候提示 data 目录不存在，也就是说训练之后的权重文件没有，给的一堆没用的代码。

与近期放出，这个近期现在看已经近了四个月了，但是依然没放出，这就很棒。

找了另外一套代码：https://github.com/lian112233/OCR-seal

这套代码比上一套代码相对来说诚意多了一些，最起码公开了那个权重文件的下载链接，这个就是最大的进步了。

整个项目一共 3 个文件，看了下代码还以为是基于 tourch 实现的，后来发现里面集成了飞桨以及 yolov5 的相关功能，关键是没有给出虚拟环境要求。这就很麻烦，并且我只需要检测是否包含印章，对于印章文字不关注，所以也就没必要引入飞桨的 ocr 功能。

至于 yolov5，之前也写过几篇文章，感兴趣的可以自己搜，这里之所以不想自己训练了，主要还是懒。

https://github.com/ultralytics/yolov5

克隆三个文件代码之后，克隆 yolov5：

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

到这里 yolov5 的依赖就 ok 了。

至于其他的环境依赖，参考下面的 requirement:

aliyun-python-sdk-core==2.14.0
aliyun-python-sdk-imm==1.24.0
aliyun-python-sdk-kms==2.16.2
Babel==2.14.0
backports.tarfile==1.2.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
ci-info==0.3.0
click==8.1.7
configobj==5.0.8
configparser==7.1.0
contourpy==1.2.1
crcmod==1.7
cryptography==42.0.4
cycler==0.12.1
docutils==0.21.2
docxcompose==1.4.0
docxtpl==0.16.7
etelemetry==0.3.1
filelock==3.15.4
fonttools==4.53.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
httplib2==0.22.0
idna==3.6
importlib_metadata==8.4.0
importlib_resources==6.4.3
isodate==0.6.1
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.0.2
Jinja2==3.1.3
jmespath==0.10.0
keyring==25.3.0
kiwisolver==1.4.5
looseversion==1.3.0
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplot==0.1.9
matplotlib==3.9.2
mdurl==0.1.2
more-itertools==10.4.0
mpmath==1.3.0
networkx==3.2.1
nh3==0.2.18
nibabel==5.2.1
nipype==1.8.6
numpy==1.26.4
opencv-python==4.10.0.84
oss2==2.18.4
packaging==24.1
pandas==2.2.2
pathlib==1.0.1
pillow==10.4.0
pkginfo==1.10.0
prov==2.0.1
psutil==6.0.0
py-cpuinfo==9.0.0
pycparser==2.21
pycryptodome==3.20.0
pydot==3.0.1
Pygments==2.18.0
pyloco==0.0.139
PyMuPDF==1.24.9
PyMuPDFb==1.24.9
pyparsing==3.1.2
PyPDF2==3.0.1
python-dateutil==2.9.0.post0
python-docx==1.1.0
pytz==2024.1
pyxnat==1.6.2
PyYAML==6.0.2
rdflib==6.3.2
readme_renderer==44.0
requests==2.31.0
requests-toolbelt==1.0.0
rfc3986==2.0.0
rich==13.7.1
scipy==1.13.1
seaborn==0.13.2
simplejson==3.19.3
SimpleWebSocketServer==0.1.2
six==1.16.0
smmap==5.0.1
sympy==1.13.2
thop==0.1.1.post2209072238
torch==2.4.0
torchvision==0.19.0
tqdm==4.66.5
traits==6.3.2
twine==5.1.1
typing==3.7.4.3
typing_extensions==4.9.0
tzdata==2024.1
ultralytics==8.2.79
ultralytics-thop==2.0.5
urllib3==2.2.1
ushlex==0.99.1
websocket-client==1.8.0
zipp==3.20.0

而至于检测部分，也没必要那么复杂，直接新写个方法：

model = torch.hub.load(repo, 'custom', path=model_path,
                           source='local')  # local repo


def predict(source='train',
        repo=repo,
        img_size=640):
    files = []

    if os.path.isdir(source):
        files = sorted([os.path.join(source, x) for x in os.listdir(source)])  # dir
    elif os.path.isfile(source):
        files = [source]

    images = [x for x in files if x.split('.')[-1].lower() in IMG_FORMATS]

    for path in images:
        print("Current pic: " + path)
        img = resize_img(cv2.imread(path), img_size)
        img_name = path.split('/')[-1].split('.')[0]
        result = model(img)
        result_pd = result.pandas()

        xywh = result_pd.xywh[0]
        xyxy = result_pd.xyxy[0]
        # print(result.pandas)
        print('result=', result)
        print(result_pd.names)
        print('xy=', xyxy)
        print('count=',len(xyxy))

实际检测效果：

/Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/venv/bin/Python /Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/stamp_detection.py 
/Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/venv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
YOLOv5  v7.0-356-g2070b303 Python-3.9.6 torch-2.4.0 CPU

Fusing layers... 
YOLOv5m summary: 308 layers, 21037638 parameters, 0 gradients
Adding AutoShape... 
/Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/yolov5/models/common.py:869: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(autocast):
Current pic: /Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/yolov5/test/20240103-182329.jpeg
result= image 1/1: 640x1137 (no detections)
Speed: 1.8ms pre-process, 153.8ms inference, 0.2ms NMS per image at shape (1, 3, 384, 640)
{0: 'stamp'}
count= 0
Current pic: /Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/yolov5/test/WechatIMG5.jpg
/Users/zhongling/PycharmProjects/djangoProject/LaoshanReport/yolov5/models/common.py:869: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with amp.autocast(autocast):
result= image 1/1: 640x905 1 stamp
Speed: 1.8ms pre-process, 202.5ms inference, 0.6ms NMS per image at shape (1, 3, 480, 640)
{0: 'stamp'}
count= 1