Create a term frequency matrix using 2 columns from a csv file, in R? -
i'm new r. i'm mining data present in csv file - summaries of reports in 1 column, date of report in column , report's agency in thrid column. need investigate how terms associated ‘fraud’ have changed on time or vary agency. i've filtered rows containing term 'fraud' , created new csv file.
how can create term freq matrix years rows , terms columns can top freq terms , clustering?
basically, need create term frequency matrix of terms against year
input data: (csv) **year** **summary** (around 300 words each) 1945 <text> 1985 <text> 2011 <text> desired 0utput : (term frequency matrix) term1 term2 term3 term4 ....... 1945 3 5 7 8 ..... 1985 1 2 0 7 ..... 2011 . . . appreciated.
in future please provide minimal working example.
this isn't using tm qdap instead fits data type better:
library(qdap) #create fake data set (please in future yourself) dat <- data.frame(year=1945:(1945+10), summary=data$state) ## year summary ## 1 1945 computer fun. not fun. ## 2 1946 no it's not, it's dumb. ## 3 1947 should do? ## 4 1948 liar, stinks! ## 5 1949 telling truth! ## 6 1950 how can certain? ## 7 1951 there no way. ## 8 1952 distrust you. ## 9 1953 talking about? ## 10 1954 shall move on? then. ## 11 1955 i'm hungry. let's eat. already?
now create word frequency matrix (similar term document matrix):
t(with(dat, wfm(summary, year))) ## ... ## 1945 0 0 0 0 0 0 ## 1946 0 0 0 0 0 0 ## 1947 0 0 0 0 0 0 ## 1948 0 0 0 0 0 1 ## 1949 0 0 1 0 0 0 ## 1950 0 0 0 0 1 0 ## 1951 0 0 0 0 0 0 ## 1952 0 0 0 0 0 1 ## 1953 1 0 0 1 0 1 ## 1954 0 0 0 0 0 0 ## 1955 0 1 0 0 0 1
or can create tru documenttermmatrix of qdap version 1.1.0:
with(dat, dtm(summary, year)) ## > with(dat, dtm(summary, year)) ## document-term matrix (11 documents, 41 terms) ## ## non-/sparse entries: 51/400 ## sparsity : 89% ## maximal term length: 8 ## weighting : term frequency (tf)
Comments
Post a Comment