freepascal - how to get the real file contents using TFilestream? -
i try file contents using tfilestream:
procedure showfilecont(myfile : string); var tr : string; fs : tfilestream; begin fs := tfilestream.create(myfile, fmopenread or fmsharedenynone); setlength(tr, fs.size); fs.read(tr[1], fs.size); showmessage(tr); fs.free; end;
i little text file contents only: aaaaaaaj“њРЉtщЂ®8ЈЏvд"Ј¦aИaaaaaaa
- and save file (using akelpad) 1251 (ansi) codepege
- save 65001 (utf8) codepage.
these files has different size there contents equal - oped them both in notepad , both has same contents
but when run showfilecont proc shows me different results:
- aaaaaaaj?Њt?8?v?"?a?aaaaaaa
- aaaaaaaj“њРЉtщЂ®8ЈЏvд"Ј¦aИaaaaaaa
questions:
- how real file contents using tfilestream?
- how explain these 2 files has different size content (in notepad) equeal?
add: sorry, didn't use lazarus fpc , string = utf8string
why files have different size?
because use different encodings. 1251 encoding maps each character single byte. utf-8 uses variable numbers of bytes each character.
how true file contents?
you need use string type matches encoding used in file. so, example, if content utf-8 encoded, best choice, load content utf-8 string. using fpc in mode string
utf-8 encoded. in case code in question need.
loading mbcs encoded file code page of 1251, say, more tricky. can load ansistring
variable , long system's locale 1251 conversions performed correctly.
but code behave differently when run on machine different locale. , if wanted load text using different mbcs encodings, example 1252, cannot use approach. need load byte array , convert 1252, say, utf-8 store utf-8 in string
variable.
in order can use lconvencoding
unit lcl. example, can use cp1251toutf8
, cp1252toutf8
etc. convert mbcs utf-8.
how can determine file encoding used?
you cannot. can make guess accurate in many cases. in general, impossible identify encoding of array of bytes meant represent text.
it possible take file , rule out encodings. example, not byte streams valid utf-8 or utf-16 text. , can rule out such files. encodings 1251, 1252 etc. byte stream valid. there's no way tell 1251 encoded streams apart 1252 encoded streams 100% accuracy.
the lconvencoding
unit has guessencoding
sounds may of use.
Comments
Post a Comment