hadoop - How to handle multiline record for inputsplit? -

February 15, 2011

i have text file of 100 tb , has multiline records. , not given each records takes how many lines. 1 records can of size 5 lines, other may of 6 lines may 4 lines. not sure line size may vary each record.

so cannot use default textinputformat, have written own inputformat , custom record reader confusion : when splits happening, not sure if each split contain full record. part of record can go in split 1 , in split 2. wrong.

so, can suggest how handle scenario guarantee full record goes in single inputsplit ?

thanks in advance -je

you need know if records delimited known sequence of characters.

if know can set textinputformat.record.delimiter config parameter separate records.

if records aren't character delimited, you'll need logic that, example, counts known number of fields (if there known number of fields) , presents record. makes things more complex, prone error , slow there's lot of text processing going on.

try determining if records delimited. perhaps posting short example of few records help.

Search This Blog

KHS

hadoop - How to handle multiline record for inputsplit? -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

delphi - Dynamic file type icon -