python - Process very large (>20GB) text file line by line -
i have number of large text files need process, largest being 60gb.
each line has 54 characters in 7 fields , want remove last 3 characters each of first 3 fields - should reduce file size 20%.
i brand new python , have code want @ 3.4 gb per hour, worthwhile exercise need getting @ least 10 gb/hr - there way speed up? code doesn't come close challenging processor, making uneducated guess limited read , write speed internal hard drive?
processlargetextfile(): r = open("filepath", "r") w = open("filepath", "w") l = r.readline() while l: x = l.split(' ')[0] y = l.split(' ')[1] z = l.split(' ')[2] w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])) l = r.readline() r.close() w.close()
any appreciated. using idle python gui on windows 7 , have 16gb of memory - perhaps different os more efficient?.
edit: here extract of file processed.
70700.642014 31207.277115 -0.054123 -1585 255 255 255 70512.301468 31227.990799 -0.255600 -1655 155 158 158 70515.727097 31223.828659 -0.066727 -1734 191 187 180 70566.756699 31217.065598 -0.205673 -1727 254 255 255 70566.695938 31218.030807 -0.047928 -1689 249 251 249 70536.117874 31227.837662 -0.033096 -1548 251 252 252 70536.773270 31212.970322 -0.115891 -1434 155 158 163 70533.530777 31215.270828 -0.154770 -1550 148 152 156 70533.555923 31215.341599 -0.138809 -1480 150 154 158
it's more idiomatic write code this
def processlargetextfile(): open("filepath", "r") r, open("outfilepath", "w") w: line in r: x, y, z = line.split(' ')[:3] w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
the main saving here split
once, if cpu not being taxed, make little difference
it may save few thousand lines @ time , write them in 1 hit reduce thrashing of harddrive. million lines only 54mb of ram!
def processlargetextfile(): bunchsize = 1000000 # experiment different sizes bunch = [] open("filepath", "r") r, open("outfilepath", "w") w: line in r: x, y, z = line.split(' ')[:3] bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])) if len(bunch) == bunchsize: w.writelines(bunch) bunch = [] w.writelines(bunch)
suggested @janne, alternative way generate lines
def processlargetextfile(): bunchsize = 1000000 # experiment different sizes bunch = [] open("filepath", "r") r, open("outfilepath", "w") w: line in r: x, y, z, rest = line.split(' ', 3) bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest))) if len(bunch) == bunchsize: w.writelines(bunch) bunch = [] w.writelines(bunch)
Comments
Post a Comment