data munging in python compared with R (from an excel sheet) -
i have hypothetical example here file attached (excel file link) i'm loading in file excel , formatting can work either analyse or store more permanently.
in r use following few lines make usable:
library(gdata) tmp = read.xls('~/desktop/sample.xlsx',1,stringsasfactors=f) tmp = tmp[,!sapply(tmp,function(x)all(is.na(x)))] tmp$date = tmp[1,2] for(i in 2:ncol(tmp)){ifelse(tmp[2,i]=='',tmp[2,i]<-tmp[2,i-1],tmp[2,i]<-tmp[2,i])} tmp[2,1]=tmp[1,1] names(tmp)=paste(tmp[2,],tmp[3,],sep='.') tmp=tmp[-c(1:3),] names(tmp)[10]='date'
in python - i've gotten far
import xlrd myf='/home/me/desktop/sample.xlsx' sheet = xlrd.open_workbook(myf).sheet_by_index(0) s1=[] r = sheet.nrows c = sheet.ncols row in range(0,r): t1=[] col in range(0,c): t1.append(sheet.cell_value(row,col)) s1.append(t1)
but i've had little success getting rid of empty rows , columns. of following fail.
>>> s1[0] ['', '', '', '', '', '', '', '', '', '', '', ''] >>> s1[0] == [] false >>> s1[0] == '' false >>> s1[0] == all('') false
so i'm not clear how check entire list empty.
i can zip rows 2 & 3 (5 & 6 in python)
>>> zip(s1[5],s1[6]) [('', ''), (u'cat1', u'a'), ('', u'b'), ('', ''), (u'cat2', u'c'), ('', u'd'), ('', ''), (u'cat3', u'e'), ('', u'f'), ('', ''), (u'cat4', u'g'), ('', u'h')]
but don't know how i'd paste forwards in for-next loop.
very n00b question reflective of understanding of python is. thoughts welcome. submitted trepidation because recognize question has 'homework' feel though personal learning exercise. thanks
after bit of messing i've worked rough , ready working example below: grateful pointers on how more efficiently.
i had try @ pandas found learning curve quite steep. if post working mwe i'd pleased mark answered.
import os import xlrd import pandas pd import pprint import re import csv ''' create few helper functions save time finding things, picking empties , selecting items ''' def nulls(x): g = lambda r: all(i == '' in r) out = [i i,j in enumerate(x) if g(j)] return(out) def fill(x): in range(1,len(x)): if x[i] == '': x[i] = x[i-1] return(x) def which(x,y): out = [i i,j in enumerate(x) if y(j) ] return(out) def t(x): out = map(list,zip(*x)) return(out) def rbind(x,y,sep=none): if sep none: sep='.' out = [str(i[0]) + sep + str(i[1]) in zip(x,y)] return(out) # def csvit(x): # tmp = next(key key,val in globals().items() if val == x , not key.startswith('_')) # f=open('/home/me/desktop/csvs/'+tmp+'.csv','wb') # writer = csv.writer(f,quotechar='"', quoting=csv.quote_all,dialect=csv.excel) # [writer.writerow(i) in x] # f.close() # # load spreadsheet file , convert python list # sheet = xlrd.open_workbook('/home/me/downloads/sample.xlsx').sheet_by_index(0) s = [sheet.row_values(i) in range(sheet.nrows)] # # rid of unnecessary excel formatting , spacing # # rows first s = [j i,j in enumerate(s) if not in nulls(s)] # transpose & columns (surely there more elegant way?) s = t(s) s = [j i,j in enumerate(s) if not in nulls(s)] # title primary category column title = s[0][0] # date secondary category column date = [j[1] j in s if str(j[0]) == 'date'] # # combine columns single category variable (could have left them separate) # cols=['category'] s = t(s) cols.extend(rbind(fill(s[2]),s[3])[1:]) s = s[4:len(s)] s=t(s) category = [str(i) in s[0]] s=s[1:len(s)] c1=[date in range(len(s[0]))] #create date column c2=[title in range(len(s[0]))] #create title column cols.insert(0,'title') cols.insert(1,'date') s.insert(0,c2) s.insert(1,c1) s = t(s)
this suggestion: if it's possible can export excel worksheet csv, might want have @ numpy.genfromtxt
: http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
it seems have similar capabilities pandas without steep learning curve. has delimiter
, autostrip
, missing_values
, filling_values
, column names
, , columns in numpy.array
form.
Comments
Post a Comment