IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    自然语言处理工具包spaCy介绍

    52nlp发表于 2016-11-12 15:23:55
    love 0

    spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

    安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

    sudo apt-get install build-essential python-dev git
    sudo pip install -U spacy

    不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用”all”参数下载所有的数据:

    sudo python -m spacy.en.download all

    或者可以分别下载相关的模型和用glove训练好的词向量数据:

    # 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型
    python -m spacy.en.download parser

    # 这个过程下载glove训练好的词向量数据
    python -m spacy.en.download glove

    下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

    textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *
    776M    en-1.1.0
    774M    en_glove_cc_300_1m_vectors-1.0.0

    进入到英文数据模型下:

    textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *
    424M    deps
    8.0K    meta.json
    35M ner
    12M pos
    84K tokenizer
    300M    vocab
    6.3M    wordnet

    可以用如下命令检查模型数据是否安装成功:

    textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"
    OK

    也可以用pytest进行测试:

    # 首先找到spacy的安装路径:
    python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
    /usr/local/lib/python2.7/dist-packages/spacy

    # 再安装pytest:
    sudo python -m pip install -U pytest

    # 最后进行测试:
    python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow
    ============================= test session starts ==============================
    platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0
    rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:
    collected 318 items

    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........
    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....
    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....
    ......
    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx
    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............
    ../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

    ============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

    现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:

    textminer@textminer:~$ ipython
    Python 2.7.12 (default, Jul  1 2016, 15:12:24)
    Type "copyright", "credits" or "license" for more information.

    IPython 2.4.1 -- An enhanced Interactive Python.
    ?         -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help      -> Python'
    s own help system.
    object?   -> Details about 'object', use 'object??' for extra details.

    In [1]: import spacy          

    # 加载英文模型数据,稍许等待
    In [2]: nlp = spacy.load('en')

    Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

    In [3]: test_doc = nlp(u"it's word tokenize test for spacy")            

    In [4]: print(test_doc)
    it's word tokenize test for spacy

    In [5]: for token in test_doc:                                          
        print(token)
       ...:    
    it
    '
    s
    word
    tokenize
    test
    for
    spacy

    英文断句:

    In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

    In [7]: for sent in test_doc.sents:
        print(sent)
       ...:    
    Natural language processing (NLP) deals with the application of computational models to text or speech data.
    Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.
    NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.
    From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.


    词干化(Lemmatize):

    In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

    In [9]: for token in test_doc:                                                      
        print(token, token.lemma_, token.lemma)
       ...:    
    (you, u'you', 472)
    (are, u'be', 488)
    (best, u'good', 556)
    (., u'.', 419)
    (it, u'it', 473)
    (is, u'be', 488)
    (lemmatize, u'lemmatize', 1510296)
    (test, u'test', 1351)
    (for, u'for', 480)
    (spacy, u'spacy', 173783)
    (., u'.', 419)
    (I, u'i', 570)
    (love, u'love', 644)
    (these, u'these', 642)
    (books, u'book', 1011)

    词性标注(POS Tagging):

    In [10]: for token in test_doc:                                                    
        print(token, token.pos_, token.pos)
       ....:    
    (you, u'PRON', 92)
    (are, u'VERB', 97)
    (best, u'ADJ', 82)
    (., u'PUNCT', 94)
    (it, u'PRON', 92)
    (is, u'VERB', 97)
    (lemmatize, u'ADJ', 82)
    (test, u'NOUN', 89)
    (for, u'ADP', 83)
    (spacy, u'NOUN', 89)
    (., u'PUNCT', 94)
    (I, u'PRON', 92)
    (love, u'VERB', 97)
    (these, u'DET', 87)
    (books, u'NOUN', 89)

    命名实体识别(NER):

    In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

    In [12]: for ent in test_doc.ents:
        print(ent, ent.label_, ent.label)
       ....:    
    (Rami Eid, u'PERSON', 346)
    (Stony Brook University, u'ORG', 349)
    (New York, u'GPE', 350)

    名词短语提取:

    In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')


    In [14]: for np in test_doc.noun_chunks:
        print(np)
       ....:    
    Natural language processing
    Natural language processing (NLP) deals
    the application
    computational models
    text
    speech
    data
    Application areas
    NLP
    automatic (machine) translation
    languages
    dialogue systems
    a human
    a machine
    natural language
    information extraction
    the goal
    unstructured text
    structured (database) representations
    flexible ways
    NLP technologies
    a dramatic impact
    the way
    people
    computers
    the way
    people
    the use
    language
    the way
    people
    the vast amount
    linguistic data
    electronic form
    a scientific viewpoint
    NLP
    fundamental questions
    formal models
    example
    natural language phenomena
    algorithms
    these models

    基于词向量计算两个单词的相似度:

    In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

    In [16]: apples = test_doc[0]

    In [17]: print(apples)
    Apples

    In [18]: oranges = test_doc[2]

    In [19]: print(oranges)
    oranges

    In [20]: boots = test_doc[6]

    In [21]: print(boots)
    Boots

    In [22]: hippos = test_doc[8]

    In [23]: print(hippos)
    hippos

    In [24]: apples.similarity(oranges)
    Out[24]: 0.77809414836023805

    In [25]: boots.similarity(hippos)
    Out[25]: 0.038474555379008429

    当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

    参考:
    spaCy官方文档
    Getting Started with spaCy

    注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:http://www.52nlp.cn

    本文链接地址:自然语言处理工具包spaCy介绍 http://www.52nlp.cn/?p=9386

    相关文章:

    1. HMM相关文章索引
    2. Python 网页爬虫 & 文本处理 & 科学计算 & 机器学习 & 数据挖掘兵器谱
    3. 斯坦福大学深度学习与自然语言处理第三讲:高级的词向量表示
    4. HMM在自然语言处理中的应用一:词性标注3
    5. Python自然语言处理实践: 在NLTK中使用斯坦福中文分词器
    6. MIT自然语言处理第一讲:简介和概述(第一部分)
    7. Google’s Python Class
    8. Beautiful Data-统计语言模型的应用三:分词3
    9. Beautiful Data-统计语言模型的应用三:分词7
    10. 深度学习主机环境配置: Ubuntu16.04+GeForce GTX 1080+TensorFlow



沪ICP备19023445号-2号
友情链接