Create a term frequency matrix using 2 columns from a csv file, in R? -

April 15, 2012

i'm new r. i'm mining data present in csv file - summaries of reports in 1 column, date of report in column , report's agency in thrid column. need investigate how terms associated ‘fraud’ have changed on time or vary agency. i've filtered rows containing term 'fraud' , created new csv file.

how can create term freq matrix years rows , terms columns can top freq terms , clustering?

basically, need create term frequency matrix of terms against year

input data: (csv) **year**    **summary** (around 300 words each)     1945             <text> 1985             <text> 2011             <text>  desired 0utput : (term frequency matrix)         term1     term2    term3  term4 ....... 1945     3         5        7       8 ..... 1985     1         2        0       7  ..... 2011      .            .   .      appreciated.

in future please provide minimal working example.

this isn't using tm qdap instead fits data type better:

library(qdap) #create fake data set (please in future yourself)  dat <- data.frame(year=1945:(1945+10), summary=data$state)   ##    year                               summary ## 1  1945         computer fun. not fun. ## 2  1946               no it's not, it's dumb. ## 3  1947                    should do? ## 4  1948                  liar, stinks! ## 5  1949               telling truth! ## 6  1950                how can certain? ## 7  1951                      there no way. ## 8  1952                       distrust you. ## 9  1953           talking about? ## 10 1954         shall move on?  then. ## 11 1955 i'm hungry.  let's eat.  already?

now create word frequency matrix (similar term document matrix):

t(with(dat, wfm(summary, year)))  ##      ... ## 1945     0       0  0   0  0       0 ## 1946     0       0  0   0  0       0 ## 1947     0       0  0   0  0       0 ## 1948     0       0  0   0  0       1 ## 1949     0       0  1   0  0       0 ## 1950     0       0  0   0  1       0 ## 1951     0       0  0   0  0       0 ## 1952     0       0  0   0  0       1 ## 1953     1       0  0   1  0       1 ## 1954     0       0  0   0  0       0 ## 1955     0       1  0   0  0       1

or can create tru documenttermmatrix of qdap version 1.1.0:

with(dat, dtm(summary, year))  ## > with(dat, dtm(summary, year)) ## document-term matrix (11 documents, 41 terms) ##  ## non-/sparse entries: 51/400 ## sparsity           : 89% ## maximal term length: 8  ## weighting          : term frequency (tf)

Search This Blog

KHS

Create a term frequency matrix using 2 columns from a csv file, in R? -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -