Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev -
i'm using snowball stemmer stem words in documents shown in below code snippet.
stemmer = englishstemmer() # stem, lowercase, substitute punctuations, remove stopwords. attribute_names = [stemmer.stem(token.lower()) token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')] when run on documents using pydev in eclipse, receive no errors. when run in terminal (mac osx) receive below error. can please help?
file "data_processing.py", line 171, in __filter__ attribute_names = [stemmer.stem(token.lower()) token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')] file "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line 694, in stem word = (word.replace(u"\u2019", u"\x27") unicodedecodeerror: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)
this works in pydev because configures python work in encoding of console (which utf-8).
you can reproduce same error in pydev if go run configuration (run > run configurations) on 'common' tab want encoding ascii.
this happens because word string , you're replacing unicode chars.
i hope code below sheds light you:
this considering ascii default encoding:
>>> 'íã'.replace(u"\u2019", u"\x27") traceback (most recent call last): file "<stdin>", line 1, in <module> unicodedecodeerror: 'ascii' codec can't decode byte 0xa1 in position 0: ordinal not in range(128) but if in unicode, works (you may need encode afterwards encoding expect if expect deal strings , not unicode).
>>> u'íã'.replace(u"\u2019", u"\x27") u'\xed\xe3' so, can make string unicode before replace
>>> 'íã'.decode('cp850').replace(u"\u2019", u"\x27") u'\xed\xe3' or can encode replace chars
>>> 'íã'.replace(u"\u2019".encode('utf-8'), u"\x27".encode('utf-8')) '\xa1\xc6' note must know what's actual encoding you're working on in place (so, although i'm using cp850 or utf-8 in examples, may different encodings have use)
Comments
Post a Comment