Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev -


i'm using snowball stemmer stem words in documents shown in below code snippet.

    stemmer = englishstemmer()     # stem, lowercase, substitute punctuations, remove stopwords.     attribute_names = [stemmer.stem(token.lower()) token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')] 

when run on documents using pydev in eclipse, receive no errors. when run in terminal (mac osx) receive below error. can please help?

file "data_processing.py", line 171, in __filter__ attribute_names = [stemmer.stem(token.lower()) token in   wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower()     not in stopwords.words('english')]  file "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line   694, in stem word = (word.replace(u"\u2019", u"\x27")  unicodedecodeerror: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128) 

this works in pydev because configures python work in encoding of console (which utf-8).

you can reproduce same error in pydev if go run configuration (run > run configurations) on 'common' tab want encoding ascii.

this happens because word string , you're replacing unicode chars.

i hope code below sheds light you:

this considering ascii default encoding:

>>> 'íã'.replace(u"\u2019", u"\x27") traceback (most recent call last):   file "<stdin>", line 1, in <module> unicodedecodeerror: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128) 

but if in unicode, works (you may need encode afterwards encoding expect if expect deal strings , not unicode).

>>> u'íã'.replace(u"\u2019", u"\x27") u'\xed\xe3' 

so, can make string unicode before replace

>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27") u'\xed\xe3' 

or can encode replace chars

>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8')) '\xa1\xc6' 

note must know what's actual encoding you're working on in place (so, although i'm using cp850 or utf-8 in examples, may different encodings have use)


Comments

Popular posts from this blog

python - How to create a legend for 3D bar in matplotlib? -

java - Multi-Label Document Classification -

php - Dynamic url re-writing using htaccess -