hadoop - How to handle multiline record for inputsplit? -
i have text file of 100 tb , has multiline records. , not given each records takes how many lines. 1 records can of size 5 lines, other may of 6 lines may 4 lines. not sure line size may vary each record.
so cannot use default textinputformat, have written own inputformat , custom record reader confusion : when splits happening, not sure if each split contain full record. part of record can go in split 1 , in split 2. wrong.
so, can suggest how handle scenario guarantee full record goes in single inputsplit ?
thanks in advance -je
you need know if records delimited known sequence of characters.
if know can set textinputformat.record.delimiter
config parameter separate records.
if records aren't character delimited, you'll need logic that, example, counts known number of fields (if there known number of fields) , presents record. makes things more complex, prone error , slow there's lot of text processing going on.
try determining if records delimited. perhaps posting short example of few records help.
Comments
Post a Comment