python - Process very large (>20GB) text file line by line -

February 15, 2014

i have number of large text files need process, largest being 60gb.

each line has 54 characters in 7 fields , want remove last 3 characters each of first 3 fields - should reduce file size 20%.

i brand new python , have code want @ 3.4 gb per hour, worthwhile exercise need getting @ least 10 gb/hr - there way speed up? code doesn't come close challenging processor, making uneducated guess limited read , write speed internal hard drive?

processlargetextfile():     r = open("filepath", "r")     w = open("filepath", "w")     l = r.readline()     while l:         x = l.split(' ')[0]         y = l.split(' ')[1]         z = l.split(' ')[2]         w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))         l = r.readline() r.close() w.close()

any appreciated. using idle python gui on windows 7 , have 16gb of memory - perhaps different os more efficient?.

edit: here extract of file processed.

70700.642014 31207.277115 -0.054123 -1585 255 255 255 70512.301468 31227.990799 -0.255600 -1655 155 158 158 70515.727097 31223.828659 -0.066727 -1734 191 187 180 70566.756699 31217.065598 -0.205673 -1727 254 255 255 70566.695938 31218.030807 -0.047928 -1689 249 251 249 70536.117874 31227.837662 -0.033096 -1548 251 252 252 70536.773270 31212.970322 -0.115891 -1434 155 158 163 70533.530777 31215.270828 -0.154770 -1550 148 152 156 70533.555923 31215.341599 -0.138809 -1480 150 154 158

it's more idiomatic write code this

def processlargetextfile():     open("filepath", "r") r, open("outfilepath", "w") w:         line in r:             x, y, z = line.split(' ')[:3]             w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

the main saving here split once, if cpu not being taxed, make little difference

it may save few thousand lines @ time , write them in 1 hit reduce thrashing of harddrive. million lines only 54mb of ram!

def processlargetextfile():     bunchsize = 1000000     # experiment different sizes     bunch = []     open("filepath", "r") r, open("outfilepath", "w") w:         line in r:             x, y, z = line.split(' ')[:3]             bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))             if len(bunch) == bunchsize:                 w.writelines(bunch)                 bunch = []         w.writelines(bunch)

suggested @janne, alternative way generate lines

def processlargetextfile():     bunchsize = 1000000     # experiment different sizes     bunch = []     open("filepath", "r") r, open("outfilepath", "w") w:         line in r:             x, y, z, rest = line.split(' ', 3)             bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))             if len(bunch) == bunchsize:                 w.writelines(bunch)                 bunch = []         w.writelines(bunch)

Search This Blog

KHS

python - Process very large (>20GB) text file line by line -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -