gate - How to randomly divide a huge corpus into 3? -


i have corpus(held in jserial datastore) of thousands of documents annotations. need divide 3 smaller ones, random picking. easiest way in gate?

a piece of running code or detailed guide welcomed!

i use groovy console (load "groovy" plugin, start console tools menu).

the following code assumes that

  • you have opened datastore in gate developer
  • you have loaded source corpus, , name "fullcorpus"
  • you have created 3 (or many need) other empty corpora and saved them (empty) same datastore. these receive partitions
  • you have no other corpora open in gate developer apart these four
  • you have no documents open

then can run following in groovy console:

def rnd = new random()  def fullcorpus = corpora.find { it.name == 'fullcorpus' } def parts = corpora.findall {it.name != 'fullcorpus' }  fullcorpus.each { doc ->   def targetcorpus = parts[rnd.nextint(parts.size())]   targetcorpus.add(doc)   targetcorpus.unloaddocument(doc) }  return null 

the way works iterate on documents , pick corpus @ random each document added to. target sub-corpora should end (but not exactly) same size.

the script not save final sub-corpora, if messes can close them , re-open them (empty) original datastore, fix , re-run script. once you're happy final result, right click on each sub-corpus in turn in left hand tree , "save datastore" write disk.


Comments

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -