Sunday, May 29, 2011

Re: [Biostat Blog] 2011 O'Reilly Strata Conf

On May 28, 2011, at 1:53 PM, Yarko wrote:
> Has anyone heard of Data Wrangler before?

Interesting -- they clearly have spent a lot of time thinking about
the interface (whether you agree with their approach or not). My
question is: who is their intended audience? It's possible that
people working in Excel (or similar) without any programming
experience might find something like this beneficial (this is an
empirical question). However, I doubt that anyone with any
programming experience (or solid knowledge of a data analysis package)
would find this useful. For instance, consider the example they
describe in their technical report. This dataset can easily be read
into Stata with

insheet using <filename>
gen state = regexs(1) if regexm(v1,"Reported crime in ([A-z]+)")
replace state = state[_n-1] if mi(state)
drop if missing(v2)
destring v1, replace

or Python with

data = []
with open('<filename>') as f:
for line in f:
items = line.split(',')
if items[0].startswith('Reported crime'):
state = items[0].replace('Reported crime in ','')
elif items[0]:

Now, compare this to the code generate by Data Wrangler:

extract('Year').on(/.*/).after(/in /)
delete('Year starts with "Reported"')

which, I would argue, is both longer and considerably more difficult
to read (moreover, this code does not even read in the data file, nor
handle transferring the data into an environment (e.g., Stata) where
they can be analyzed). Now, I suppose that the whole point here is
that with Data Wrangler this can be done via a GUI, however in my case
I could definitely do this in Stata or Python faster, and when I'm
done, I have a routine that can more easily be used/extended to handle
subsequent (similar) datasets.

This strikes me as similar in some ways to Applescript. The idea was
to simplify scripting so that anyone could do it, but in the end non-
programmers still find it too difficult, and programmers prefer to use
a standard, more capable scripting language (e.g., bash, Python, Ruby,

-- Phil

Saturday, May 28, 2011

2011 O'Reilly Strata Conf

Strata 2011 | Exploring the Data Universe
Report on Feb. Conference.

Has anyone heard of Data Wrangler before?

I've just tried this, and would like to hear from non-engineering/CS staff on their experience.

In a nutshell:

In particular, they are looking at samples of how people are using their tool (so clean your data appropriately).

Also, note the "large data" comment:

  • You should not (really) depend on the actual, bulk data conversion (see the "Export" link near the script that is developed, lower left)
  • You want to really use a sample of your incoming data so Data Wrangler can generate a python script for you to locally filter your actual data

RCG staff may be able to help you get started with this, if you need.

NOTES for RCG staff

  • This application is a full-on javascript application (runs in your browser).
  • It will export CSV results (not so interesting);
  • It will export your particular filtering steps as a program (very interesting)
    • Either javascript, or python code  (python library available for download)

I recommend a look at the design paper:

On programming, statistics, & developing expertise

Friday, May 13, 2011

Analysis of PCR array data

I found what seems like a very useful Bioconductor package for the analysis of PCR data, in particular high-throughput, i.e. PCR arrays. The package is called HTqPCR.  It has data management capabilities (presumably can read raw data files), normalization and visualization methods, and limma-type models for analysis (which I presume could replace the one-gene-at-a-time mixed models approach of Yuan). The reference manual is here.

Subscribe via email

Enter your email address:

Delivered by FeedBurner