gate - How to randomly divide a huge corpus into 3? -
i have corpus(held in jserial datastore) of thousands of documents annotations. need divide 3 smaller ones, random picking. easiest way in gate?
a piece of running code or detailed guide welcomed!
i use groovy console (load "groovy" plugin, start console tools menu).
the following code assumes that
- you have opened datastore in gate developer
- you have loaded source corpus, , name "fullcorpus"
- you have created 3 (or many need) other empty corpora and saved them (empty) same datastore. these receive partitions
- you have no other corpora open in gate developer apart these four
- you have no documents open
then can run following in groovy console:
def rnd = new random() def fullcorpus = corpora.find { it.name == 'fullcorpus' } def parts = corpora.findall {it.name != 'fullcorpus' } fullcorpus.each { doc -> def targetcorpus = parts[rnd.nextint(parts.size())] targetcorpus.add(doc) targetcorpus.unloaddocument(doc) } return null
the way works iterate on documents , pick corpus @ random each document added to. target sub-corpora should end (but not exactly) same size.
the script not save final sub-corpora, if messes can close them , re-open them (empty) original datastore, fix , re-run script. once you're happy final result, right click on each sub-corpus in turn in left hand tree , "save datastore" write disk.
Comments
Post a Comment