1. 函数
#!/usr/local/bin/python3
# -*- coding: utf-8 -*-
'''
Date: Nov.10.2020
Author: Xiao-Le Deng
Function: find out doi from the input text
Input: text
Output: dois
'''
import re
def extract_doi(text):
DOI_REGEX = r"""\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b"""
res = re.findall(DOI_REGEX, text, re.IGNORECASE)
return res
2. doi正则化通配符
\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b
\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])[[:graph:]])+)\b
/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
(10.(\d)+/([^(\s\>\"\<)])+)
(10[.][0-9]{4,}[^\s"/<>]*/[^\s"<>]+)
\b10\.(\d+\.*)+[\/](([^\s\.])+\.*)+\b
3. 参考