django - Python doesn't interpret UTF8 correctly -

i know similar questions have been asked million times, despite reading through many of them can't find solution applies situation.

i have django application, in i've created management script. script reads text files, , outputs them terminal (it more useful stuff contents later, i'm still testing out) , characters come out escape sequences \xc3\xa5 instead of intended å. since escape sequence means Ã¥, common misinterpretation of å because of encoding problems, suspect there @ least 2 places going wrong. however, can't figure out - i've checked possible culprits can think of:

the terminal encoding utf-8; echo $lang gives en_us.utf-8
the text files encoded in utf-8; file * in directory reside results in entries being listed "utf-8 unicode text" except one, not contain non-ascii characters , listed "ascii text". running iconv -f ascii -t utf8 thefile.txt > utf8.txt on file yields file ascii text encoding.
the python scripts utf-8 (or, in several cases, ascii no non-ascii characters). tried inserting comment in management script special characters force save utf-8, did not change behavior. above observations on text files apply on python script files well.
the python script handles text files has # -*- encoding: utf-8 -*- @ top; line preceding #!/usr/bin/python3, i've tried both changing .../python python 2.7 or removing entirely leave django, without results.
according the documentation, "django natively supports unicode data", "can safely pass around unicode strings" anywhere in application.

i can't think of anywhere else non-utf-8 link in chain. possibly have missed setting change utf-8?

for completeness: i'm reading files lines = file.readlines() , printing standard print() function. no manual encoding or decoding happens @ either end.

update:

in response quiestions in comments:

print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'utf-8', none) files.
i started compiling sscce, , found problem there if try print value in tuple. in other words, print(lines[0].strip()) works fine, print(lines[0].strip(), lines[1].strip()) not. adding .decode('utf-8') yields tuple both strings marked prepending u , \xe5 (the correct escape sequence å) instead of odd characters before - can't figure out how print them regular strings, no escape characters. i've tested call .decode('utf-8') wrapping in str() both fail unicodeencodeerror complaining \xe5 can't encoded in ascii. since single string works correctly, don't know else test.

sscce:

# -*- coding: utf-8 -*-  import os, sys  root,dirs,files in os.walk('txt-songs'):     filename in files:         open(os.path.join(root,filename)) f:             print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)              lines = f.readlines()             print(lines[0].strip()) # works             print(lines[0].strip(), lines[1].strip()) # not work

the big problem here you're mixing python 2 , python 3. in particular, you've written python 3 code, , you're trying run in python 2.7. there few other problems along way. so, let me try explain that's going wrong.

i started compiling sscce, , found problem there if try print value in tuple. in other words, print(lines[0].strip()) works fine, print(lines[0].strip(), lines[1].strip()) not.

the first problem here str of tuple (or other collection) includes repr, not str, of elements. simple way solve problem not print collections. in case, there no reason print tuple @ all; reason have 1 you've built printing. this:

print '({}, {})'.format(lines[0].strip(), lines[1].strip())

in cases have collection in variable, , want print out str of each element, have explicitly. can print repr of str of each this:

print tuple(map(str, my_tuple))

… or print str of each directly this:

print '({})'.format(', '.join(map(str, my_tuple)))

notice i'm using python 2 syntax above. that's because if used python 3, there no tuple in first place, , there no need call str.

you've got unicode string. in python 3, unicode , str same type. in python 2, it's bytes , str same type, , unicode different one. so, in 2.x, don't have str yet, why need call str.

and python 2 why print(lines[0].strip(), lines[1].strip()) prints tuple. in python 3, that's call print function 2 strings arguments, print out 2 strings separated space. in python 2, it's print statement 1 argument, tuple.

if want write code works same in both 2.x , 3.x, either need avoid ever printing more 1 argument, or use wrapper six.print_, or from __future__ import print_function, or careful ugly things adding in parentheses make sure tuples tuples in both versions.

so, in 3.x, you've got str objects , print them out. in 2.x, you've got unicode objects, , you're printing out repr. can change print out str, or avoid printing tuple in first place… still won't anything.

why? well, printing anything, in either version, calls str on , passes sys.stdio.write. in 3.x, str means unicode, , sys.stdio textiowrapper; in 2.x, str means bytes, , sys.stdio binary file.

so, pseudocode happens is:

sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))  sys.stdio.write(s.encode(sys.getdefaultencoding()))

and, saw, different things, because:

print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'utf-8', none)

you can simulate python 3 here using io.textiowrapper or codecs.streamwriter , using print >>f, … or f.write(…) instead of print, or can explicitly encode unicode objects this:

print '({})'.format(', '.join(element.encode('utf-8') element in my_tuple)))

but really, best way deal of these problems run existing python 3 code in python 3 interpreter instead of python 2 interpreter.

if want or need use python 2.7, that's fine, have write python 2 code. if want write python 3 code, that's great, have run python 3.3. if want write code works in both, can, it's work, , takes lot more knowledge.

for further details, see what's new in python 3.0 (the "print function" , "text vs. data instead of unicode vs. 8-bit" sections), although that's written point of view of explaining 3.x 2.x users, backward need. 3.x , 2.x versions of unicode howto may help.

Search This Blog

KHS

django - Python doesn't interpret UTF8 correctly -

update:

Comments

Post a Comment

Popular posts from this blog

user interface - Python attempting to create a simple gui, getting "AttributeError: 'MainMenu' object has no attribute 'intro_screen'" -

java - Multi-Label Document Classification -

php - Dynamic url re-writing using htaccess -