Python memory error for a large data set -
i want generate 'bag of words' matrix containing documents corresponding counts words in document. in order run below code initialising bag of words matrix. unfortunately receive memory error after x amounts of documents in line read document. there better way of doing this, can avoid memory error? please aware process large amount of documents ~ 2.000.000 8 gb of ram.
def __init__(self, paths, words_count, normalize_matrix = false ,trainingset_size = none, validation_set_words_list = none): ''' open documents given path. initialize variables needed in order construct word matrix. parameters ---------- paths: paths documents. words_count: number of words in bag of words. trainingset_size: proportion of data should set training set. validation_set_words_list: attributes validation. ''' print '################ data processing started ################' self.max_words_matrix = words_count print '________________ reading docs file system ________________' timer = time() folder in paths: self.class_names.append(folder.split('/')[len(folder.split('/'))-1]) print '____ dataprocessing category '+folder if trainingset_size == none: docs = os.listdir(folder) elif not trainingset_size == none , validation_set_words_list == none: docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)] else: docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):] count = 1 length = len(docs) doc in docs: if doc.endswith('.txt'): d = open(folder+'/'+doc).read() # append filtered version of document document list. self.docs_list.append(self.__filter__(d)) # append name of document list containing document names. self.docs_names.append(doc) # increase class indices counter. self.class_indices.append(len(self.class_names)-1) print 'processed '+str(count)+' of '+str(length)+' in category '+folder count += 1
what you're asking isn't possible. also, python doesn't automatically space benefits you're expecting bow. plus, think you're doing key piece wrong in first place. let's take in reverse order.
whatever you're doing in line:
self.docs_list.append(self.__filter__(d))
… wrong.
all want store each document count vector. in order count vector, need append single dict of words seen. unless __filter__
modifying hidden dict in-place, , returning vector, it's not doing right thing.
the main space savings in bow model come not having store copies of string keys each document, , being able store simple array of ints instead of fancy hash table. integer object big (short) string object, , there's no way predict or guarantee when new integers or strings vs. additional references existing ones. so, really, advantage 1/hash_fullness
; if want of other advantages, need array.array
or numpy.ndarray
.
for example:
a = np.zeros(len(self.word_dict), dtype='i2') word in split_into_words(d): try: idx = self.word_dict[word] except keyerror: idx = len(self.word_dict) self.word_dict[word] = idx np.resize(a, idx+1) a[idx] = 1 else: a[idx] += 1 self.doc_vectors.append(a)
but still won't enough. unless have on order of 1k unique words, can't fit counts in memory.
for example, if have 5000 unique words, you've got 2m arrays, each of has 5000 2-byte counts, compact possible representation take 20gb.
since documents won't have words, benefit using sparse arrays (or single 2d sparse array), there's benefit can get. and, if things happened ordered in such way absolutely perfect rle compression, if average number of unique words per doc on order of 1k, you're still going run out of memory.
so, can't store of document vectors in memory.
if can process them iteratively instead of @ once, that's obvious answer.
if not, you'll have page them in , out disk (whether explicitly, or using pytables or database or something).
Comments
Post a Comment