django - Python doesn't interpret UTF8 correctly -
i know similar questions have been asked million times, despite reading through many of them can't find solution applies situation.
i have django application, in i've created management script. script reads text files, , outputs them terminal (it more useful stuff contents later, i'm still testing out) , characters come out escape sequences \xc3\xa5 instead of intended å. since escape sequence means Ã¥, common misinterpretation of å because of encoding problems, suspect there @ least 2 places going wrong. however, can't figure out - i've checked possible culprits can think of:
- the terminal encoding utf-8;
echo $langgivesen_us.utf-8 - the text files encoded in utf-8;
file *in directory reside results in entries being listed "utf-8 unicode text" except one, not contain non-ascii characters , listed "ascii text". runningiconv -f ascii -t utf8 thefile.txt > utf8.txton file yields file ascii text encoding. - the python scripts utf-8 (or, in several cases, ascii no non-ascii characters). tried inserting comment in management script special characters force save utf-8, did not change behavior. above observations on text files apply on python script files well.
- the python script handles text files has
# -*- encoding: utf-8 -*-@ top; line preceding#!/usr/bin/python3, i've tried both changing.../pythonpython 2.7 or removing entirely leave django, without results. - according the documentation, "django natively supports unicode data", "can safely pass around unicode strings" anywhere in application.
i can't think of anywhere else non-utf-8 link in chain. possibly have missed setting change utf-8?
for completeness: i'm reading files lines = file.readlines() , printing standard print() function. no manual encoding or decoding happens @ either end.
update:
in response quiestions in comments:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)yields('ascii', 'utf-8', none)files.- i started compiling sscce, , found problem there if try print value in tuple. in other words,
print(lines[0].strip())works fine,print(lines[0].strip(), lines[1].strip())not. adding.decode('utf-8')yields tuple both strings marked prependingu,\xe5(the correct escape sequenceå) instead of odd characters before - can't figure out how print them regular strings, no escape characters. i've tested call.decode('utf-8')wrapping instr()both failunicodeencodeerrorcomplaining\xe5can't encoded in ascii. since single string works correctly, don't know else test.
sscce:
# -*- coding: utf-8 -*- import os, sys root,dirs,files in os.walk('txt-songs'): filename in files: open(os.path.join(root,filename)) f: print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) lines = f.readlines() print(lines[0].strip()) # works print(lines[0].strip(), lines[1].strip()) # not work
the big problem here you're mixing python 2 , python 3. in particular, you've written python 3 code, , you're trying run in python 2.7. there few other problems along way. so, let me try explain that's going wrong.
i started compiling sscce, , found problem there if try print value in tuple. in other words,
print(lines[0].strip())works fine,print(lines[0].strip(), lines[1].strip())not.
the first problem here str of tuple (or other collection) includes repr, not str, of elements. simple way solve problem not print collections. in case, there no reason print tuple @ all; reason have 1 you've built printing. this:
print '({}, {})'.format(lines[0].strip(), lines[1].strip()) in cases have collection in variable, , want print out str of each element, have explicitly. can print repr of str of each this:
print tuple(map(str, my_tuple)) … or print str of each directly this:
print '({})'.format(', '.join(map(str, my_tuple))) notice i'm using python 2 syntax above. that's because if used python 3, there no tuple in first place, , there no need call str.
you've got unicode string. in python 3, unicode , str same type. in python 2, it's bytes , str same type, , unicode different one. so, in 2.x, don't have str yet, why need call str.
and python 2 why print(lines[0].strip(), lines[1].strip()) prints tuple. in python 3, that's call print function 2 strings arguments, print out 2 strings separated space. in python 2, it's print statement 1 argument, tuple.
if want write code works same in both 2.x , 3.x, either need avoid ever printing more 1 argument, or use wrapper six.print_, or from __future__ import print_function, or careful ugly things adding in parentheses make sure tuples tuples in both versions.
so, in 3.x, you've got str objects , print them out. in 2.x, you've got unicode objects, , you're printing out repr. can change print out str, or avoid printing tuple in first place… still won't anything.
why? well, printing anything, in either version, calls str on , passes sys.stdio.write. in 3.x, str means unicode, , sys.stdio textiowrapper; in 2.x, str means bytes, , sys.stdio binary file.
so, pseudocode happens is:
sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors)) sys.stdio.write(s.encode(sys.getdefaultencoding())) and, saw, different things, because:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)yields('ascii', 'utf-8', none)
you can simulate python 3 here using io.textiowrapper or codecs.streamwriter , using print >>f, … or f.write(…) instead of print, or can explicitly encode unicode objects this:
print '({})'.format(', '.join(element.encode('utf-8') element in my_tuple))) but really, best way deal of these problems run existing python 3 code in python 3 interpreter instead of python 2 interpreter.
if want or need use python 2.7, that's fine, have write python 2 code. if want write python 3 code, that's great, have run python 3.3. if want write code works in both, can, it's work, , takes lot more knowledge.
for further details, see what's new in python 3.0 (the "print function" , "text vs. data instead of unicode vs. 8-bit" sections), although that's written point of view of explaining 3.x 2.x users, backward need. 3.x , 2.x versions of unicode howto may help.
Comments
Post a Comment