Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev -

February 15, 2012

i'm using snowball stemmer stem words in documents shown in below code snippet.

    stemmer = englishstemmer()     # stem, lowercase, substitute punctuations, remove stopwords.     attribute_names = [stemmer.stem(token.lower()) token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]

when run on documents using pydev in eclipse, receive no errors. when run in terminal (mac osx) receive below error. can please help?

file "data_processing.py", line 171, in __filter__ attribute_names = [stemmer.stem(token.lower()) token in   wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower()     not in stopwords.words('english')]  file "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line   694, in stem word = (word.replace(u"\u2019", u"\x27")  unicodedecodeerror: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)

this works in pydev because configures python work in encoding of console (which utf-8).

you can reproduce same error in pydev if go run configuration (run > run configurations) on 'common' tab want encoding ascii.

this happens because word string , you're replacing unicode chars.

i hope code below sheds light you:

this considering ascii default encoding:

>>> 'íã'.replace(u"\u2019", u"\x27") traceback (most recent call last):   file "<stdin>", line 1, in <module> unicodedecodeerror: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128)

but if in unicode, works (you may need encode afterwards encoding expect if expect deal strings , not unicode).

>>> u'íã'.replace(u"\u2019", u"\x27") u'\xed\xe3'

so, can make string unicode before replace

>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27") u'\xed\xe3'

or can encode replace chars

>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8')) '\xa1\xc6'

note must know what's actual encoding you're working on in place (so, although i'm using cp850 or utf-8 in examples, may different encodings have use)

Search This Blog

KHS

Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev -

Comments

Post a Comment

Popular posts from this blog

user interface - Python attempting to create a simple gui, getting "AttributeError: 'MainMenu' object has no attribute 'intro_screen'" -

jquery - Common JavaScript snippet to share files on Google Drive, Dropbox, Box.net or SkyDrive -

Android Gson.fromJson error -