development

nltk.data.load로 english.pickle을 (를) 불러 오지 못했습니다.

big-blog 2020. 7. 2. 07:16
반응형

nltk.data.load로 english.pickle을 (를) 불러 오지 못했습니다.


punkt토크 나이저 를로드하려고 할 때 ...

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

... LookupError가 제기되었습니다.

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

나는이 같은 문제가 있었다. 파이썬 쉘로 이동하여 다음을 입력하십시오.

>>> import nltk
>>> nltk.download()

그런 다음 설치 창이 나타납니다. '모델'탭으로 이동하여 '식별자'열에서 '펑크'를 선택하십시오. 그런 다음 다운로드를 클릭하면 필요한 파일이 설치됩니다. 그런 다음 작동합니다!


import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

토크 나이저 사용 :)


이것이 바로 지금 저에게 효과적이었습니다.

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

votes_tokenized는 토큰 목록의 목록입니다.

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

"Mining the Social Web, 2nd Edition"책과 함께 제공되는 예제 ipython 노트북 에서 문장을 가져 왔습니다.


bash 명령 행에서 다음을 실행하십시오.

$ python -c "import nltk; nltk.download('punkt')"

이것은 나를 위해 작동합니다 :

>>> import nltk
>>> nltk.download()

Windows에서는 NLTK 다운로더도 제공됩니다.

NLTK 다운로더


Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture!1


nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.

E.g. 1 tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

E.g. 2 nltk.download('punkt')

If you call above sentence in your code, Make sure you have internet connection without any firewall protections.

I would like to share some more better alter-net way to resolve above issue with more better deep understandings.

Please follow following steps and enjoy english word tokenization using nltk.

Step 1: First download the "english.pickle" model following web path.

Goto link "http://www.nltk.org/nltk_data/" and click on "download" at option "107. Punkt Tokenizer Models"

Step 2: Extract the downloaded "punkt.zip" file and find the "english.pickle" file from it and place in C drive.

Step 3: copy paste following code and execute.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

Let me know, if you face any problem


On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:

python -m nltk.downloader punkt

여기에 이미지 설명을 입력하십시오


i came across this problem when i was trying to do pos tagging in nltk. the way i got it correct is by making a new directory along with corpora directory named "taggers" and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.


In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to 'Models' tab in this window and click on 'punkt' and download 'punkt'

창문


Check if you have all NLTK libraries.


The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.

If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle files.

If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).

Steps

  1. Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. Somewhere in your environment create the folders: nltk_data/tokenizers/punkt, if using python 3 add another folder PY3 so that your new directory structure looks like nltk_data/tokenizers/punkt/PY3. In my case I created these folders at the root of my project.
  3. Extract the zip and move the .pickle files for the languages you want to support into the punkt folder you just created. Note: Python 3 users should use the pickles from the PY3 folder. With your language files loaded it should look something like: example-folder-stucture
  4. Now you just need to add your nltk_data folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variable NLTK_DATA='path/to/your/nltk_data'. You can also add a custom path at runtime in python by doing:
from nltk import data
data.path += ['/path/to/your/nltk_data']

참고 : 런타임시 데이터를로드하거나 코드와 함께 데이터를 번들로 묶을 필요가없는 경우 nltk가 찾는 내장 위치에nltk_data 폴더 를 만드는 것이 가장 좋습니다 .


nltk.download()이 문제는 해결되지 않습니다. 나는 아래를 시도했고 그것은 나를 위해 일했다 :

'...AppData\Roaming\nltk_data\tokenizers'폴더를 추출 다운로드 punkt.zip같은 위치에 폴더.

참고 URL : https://stackoverflow.com/questions/4867197/failed-loading-english-pickle-with-nltk-data-load

반응형