Sunday, May 29, 2011

Re: [Biostat Blog] 2011 O'Reilly Strata Conf

On May 28, 2011, at 1:53 PM, Yarko wrote:
> Has anyone heard of Data Wrangler before?


Interesting -- they clearly have spent a lot of time thinking about
the interface (whether you agree with their approach or not). My
question is: who is their intended audience? It's possible that
people working in Excel (or similar) without any programming
experience might find something like this beneficial (this is an
empirical question). However, I doubt that anyone with any
programming experience (or solid knowledge of a data analysis package)
would find this useful. For instance, consider the example they
describe in their technical report. This dataset can easily be read
into Stata with


insheet using <filename>
gen state = regexs(1) if regexm(v1,"Reported crime in ([A-z]+)")
replace state = state[_n-1] if mi(state)
drop if missing(v2)
destring v1, replace


or Python with


data = []
with open('<filename>') as f:
for line in f:
items = line.split(',')
if items[0].startswith('Reported crime'):
state = items[0].replace('Reported crime in ','')
elif items[0]:
data.append((state,int(items[0]),float(items[1])))


Now, compare this to the code generate by Data Wrangler:


split('data').on(NEWLINE).max_splits(NO_MAX)
split('split').on(COMMA).max_splits(NO_MAX)
columnName().row(0)
delete(isEmpty())
extract('Year').on(/.*/).after(/in /)
columnName('extract').to('State')
fill('State').method(COPY).direction(DOWN)
delete('Year starts with "Reported"')
unfold('Year').above('Property_crime_rate')


which, I would argue, is both longer and considerably more difficult
to read (moreover, this code does not even read in the data file, nor
handle transferring the data into an environment (e.g., Stata) where
they can be analyzed). Now, I suppose that the whole point here is
that with Data Wrangler this can be done via a GUI, however in my case
I could definitely do this in Stata or Python faster, and when I'm
done, I have a routine that can more easily be used/extended to handle
subsequent (similar) datasets.

This strikes me as similar in some ways to Applescript. The idea was
to simplify scripting so that anyone could do it, but in the end non-
programmers still find it too difficult, and programmers prefer to use
a standard, more capable scripting language (e.g., bash, Python, Ruby,
etc.).


-- Phil

1 comment:

  1. I think that doing this visually - when you are not sure what you might want to do - is beneficial.

    A agree with your comment about the utter unreadability / unmaintainability of the generated code, as far as human consumption goes.

    ReplyDelete

Subscribe via email

Enter your email address:

Delivered by FeedBurner

Followers

google analytics