Wednesday, December 21, 2011

SAS: Merging Log and Output in a single file

SAS typically displays the code, log and procedure output in separate windows.  You can use PROC PRINTTO to re-direct log and procedure output into a single file. Since the code run is typically mirrored in the log file, you have a nice single file documenting your code and output, along with any warnings and other useful information from the log file.  The trick is to specify the same file in the log= and print= options in PROC PRINTTO

For example:

proc printto log="I:\sasods\printto_test.txt" print="I:\sasods\printto_test.txt";
data numbers;
   input x y z;
 14.2   25.2   96.8
 10.8   51.6   96.8
  9.5   34.2  138.2
  8.8   27.6   83.2
 11.5   49.4  287.0
  6.3   42.0  170.7
proc print data=numbers;
   title 'Listing of NUMBERS Data Set';

proc printto;

creates the following file:

NOTE: PROCEDURE PRINTTO used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

3    data numbers;
4       input x y z;
5       datalines;

NOTE: The data set WORK.NUMBERS has 6 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

12   ;
14   proc print data=numbers;
15      title 'Listing of NUMBERS Data Set';
16   run;

Listing of NUMBERS Data Set

Obs      x       y       z

 1     14.2    25.2     96.8
 2     10.8    51.6     96.8
 3      9.5    34.2    138.2
 4      8.8    27.6     83.2
 5     11.5    49.4    287.0
 6      6.3    42.0    170.7
NOTE: There were 6 observations read from the data set WORK.NUMBERS.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

18   proc printto;
19   run;

 The last PROC PRINTTO resets to the default output and log destinations.

Wednesday, December 14, 2011

Running Stata in batch from SAS

Admit it - some statistical software packages have an easier way of doing something than others. That's why one may want to use two or more packages on the same data.  Here is an easy way to run Stata code from SAS.

1.  Output the data set from SAS using PROC EXPORT.
proc export data=out

2.  Make a Stata .do file that would read in data1.csv and perform the desired analysis.  Save it as 'C:\Clients\StataFromSAS\".  For example:
insheet using 'C:\Clients\StataFromSas\data1.csv', clear
set linesize 100
summarize var1
tab var1

3.  In SAS, the X statement issues an operating-environment command from within a SAS session (e.g. performs a DOS command under Windows).  Since Stata can be run from a DOS prompt in batch mode, we can call its .do file from SAS using X statement.  First, change DOS directory to where the Stata log file should go, and then execute in batch mode (for more on Stata in batch mode, see appendix C.5 of Stata [GS] manual).  The following SAS statements accomplish this:
x "cd C:\Clients\StataFromSAS\";
x '"C:\Program Files (x86)\Stata12\StataSE-64.exe" /e do "C:\Clients\StataFromSAS\"';

4. Enjoy :)

Tuesday, December 13, 2011

Spam & HIV share behaviors?

A perhaps interesting curiosity:

Statistical Computing Platform for Python

I came over this talk proposal for the 2012 Python Conference:


pandas is a Python library providing fast, expressive data structures for working with structured or relational data sets. In addition to being used for general purpose data manipulation and data analysis, it has also been designed to enable Python to become a competitive statistical computing platform.
In contrast to other tools in domain-specific data analysis languages like R, it features deeply integrated array axis indexing which enables intuitive data alignment, pivoting and reshaping, joining and merging, and other kinds of standard relational data manipulations. 
 While I've not looked at nor used the library at this point, if anyone would like to explore or try it out for their work, please let me know - I'd be happy to dive into it with you.

- Yarko

Sunday, December 11, 2011

Monday, November 14, 2011

Re: [Biostat Blog] Find and Navigation Pane in Word 2010 does not work with Mendeley Word plugin

On Nov 14, 2011, at 6:09 PM, MK wrote:
> Is that a pun? Word 2011 is only available for Mac :)

Like I said, why not just upgrade?

-- Phil

URL for REDCap at U of C

As requested, here is the URL that people should be referred to for information about REDCap:

-- Phil

Re: [Biostat Blog] Find and Navigation Pane in Word 2010 does not work with Mendeley Word plugin

On Nov 13, 2011, at 6:30 PM, Masha Kocherginsky wrote:
> Find and Navigation Pane in Word 2010 does not work with Mendeley Word plugin

It seems to work fine for me with Word 2011. Why not just upgrade?

-- Phil

Sunday, November 13, 2011

"Find" in Word 2010 does not work with Mendeley Word plugin

I've been having issues with using Find in Word 2010 - the results show up in the navigation pane for only a second, then disappear. And then there is no way to move to the next search result.

I searched online a little, and found it it's because of the Mendeley Word plugin! I uninstalled it, and Find in Word now works. Sad but true.  See here:

Monday, November 7, 2011

How to deal with hard-coded file paths in Windows 7

Problem: you inherit a project from a colleague where the file paths to data in source code files are hard-coded. You don't want to update the paths and re-save the file because that would change the time stamp. The Windows 7 solution is to create a symbolic link that will point to the directory where you now have the data. For example, your colleague stored the data in "C:\Documents and Settings\OldUser\My Documents\Project1\".  You store it in "I:\Clients\Client1\Project1\".  To set up the symbolic link you need to first create the directory with the same path as the path that your colleague had - this is where the symbolic link will reside. In Windows 7, the directory "Documents and Settings" is not accessible directly, but here is a work-around.

1. Open an administrator Command Prompt.  Enter “command” in the Start menu search, right-click on Command Prompt, and select “Run as administrator”.

2. If the directory does not exist, make it using DOS mkdir command. E.g.
     cd C:\Documents and Settings
     mkdir OldUser
     cd OldUser
     mkdir "My Documents"
     cd "My Documents"

3. Now you're ready to create the link; remember the syntax is mklink /D locationOfLink targetDir.
     mklink /D "C:\Documents and Settings\OldUser\My Documents\" "I:\Clients\Client1\Project1\"

4. Go ahead, run your Stata command:
     use "C:\Documents and Settings\OldUser\My Documents\Project1\data.dta", clear

For other ways to create symbolic links in Windows 7, see here:

Thanks, PS!!!

Friday, October 28, 2011

Stata: curve labels for sts graph

I'm plotting KM curves in Stata using

sts graph, by(iGroup)

and here is the plot that I get:
Note that the default curve labels in the legend are of the type "varname=label" for each group. Clearly, this isn't the desired labeling because there is no need for the "varname=" part.  I didn't find much on help on the web regarding how to deal with this (except this post which is cumbersome, to say the least).  There also does not appear to be an option in sts graph or in legend() to change this behavior. Also, it's not the behavior of other plot types (e.g. graph bar age, over(iGroup) labels everything with just the value label).

So here is my solution - it is to define the value labels using global macros:

global iGroup1 "Group -/-"
global iGroup2 "Group -/+"
global iGroup3 "Group +/+"
label def iGroup 1 "$iGroup1" 2 "$iGroup2" 3 "$iGroup3", modify
label val iGroup iGroup
stset DFS, fail(Relapse)
sts graph, by(iGroup) legend(label(1 "$iGroup1") label(2 "$iGroup2") label(3 "$iGroup3"))

And voilà!

Any thoughts?

Monday, October 17, 2011

Mendeley Workshop Tuesday, 10/18 noon

I really like using Mendeley for reference management, and as soon as they add the capability to deal with PMCID's, it will be absolutely perfect in my view.  Our library has a training workshop on Mendeley - here is the link.

Monday, October 10, 2011

Waterfall plot in Stata

*** Waterfall plot
gsort - recist_pct
gen i=_n
tw bar recist_pct i , ytitle("RECIST %Change") xtitle("")

Stata code that produces this plot:

Gaming & Research (decoding aids protein)

For more than 20 years, I have stubbornly preached about the importance of teamwork, functional relationships.  It has been intuitively clear to me for years that non-experts are an essential contributors in a properly structured team environment.  In my engineering work I have also demonstrated the parallels between the structure of functioning human social systems and good technical architecture (as well as taught teams to see, and apply this).

Here is an example from a different field which, while it doesn't surprise me, does thrill me.

It's about the foldit game developed at the University of Washington, and how non-experts decoded an aids protein.

A lay overview appears here:

A paper outlining how experts structured involving others in a way which worked (and which may have some simple lessons for broader application):

Sunday, October 9, 2011

Source Tree (Easy Repositories)

I realize this isn't for everyone, so please filter if it doesn't apply to you.

If you:

  • use a macintosh  (Snow Leopard or newer OS/X);
  • use source repositories (any of SVN,  git, or mercurial / hg )
then you might want to have a look at a free (for the time being) desktop tool to manage, view your repositories:
It's from Atlassian, the people who run   At first look, it seems to be reasonable.


- Yarko

Oh, Fun!

Just thought I'd share this curiosity:

And for those not interested in mapping their own, this related note is also interesting:

Tuesday, September 13, 2011


Even though this is a preview app, it's rather interesting already:

Be sure to use the limits and calendar icons.

Mobile app versions are available from the "full featured" site (which Google recently acquired):

Thursday, July 21, 2011

Fermi Lab Discovers New Particle

This isn't about BioStat, but it is about a cousin organization, and the shear numbers are interesting (25 observations in 500 Trillion?  really?  :-)


Tuesday, June 28, 2011

New Journal for Biomed Research (online)

This sounds potentially interesting
As the journal will only exist online, it offers an opportunity to create a journal and article format that will exploit the potential of new technologies to allow for improved data presentation.
which is consistent with the research site structure we are only just beginning to explore.
I expect this won't be the first journal heading in this direction, so we're keeping our eyes open.

- Yarko

Sunday, May 29, 2011

Re: [Biostat Blog] 2011 O'Reilly Strata Conf

On May 28, 2011, at 1:53 PM, Yarko wrote:
> Has anyone heard of Data Wrangler before?

Interesting -- they clearly have spent a lot of time thinking about
the interface (whether you agree with their approach or not). My
question is: who is their intended audience? It's possible that
people working in Excel (or similar) without any programming
experience might find something like this beneficial (this is an
empirical question). However, I doubt that anyone with any
programming experience (or solid knowledge of a data analysis package)
would find this useful. For instance, consider the example they
describe in their technical report. This dataset can easily be read
into Stata with

insheet using <filename>
gen state = regexs(1) if regexm(v1,"Reported crime in ([A-z]+)")
replace state = state[_n-1] if mi(state)
drop if missing(v2)
destring v1, replace

or Python with

data = []
with open('<filename>') as f:
for line in f:
items = line.split(',')
if items[0].startswith('Reported crime'):
state = items[0].replace('Reported crime in ','')
elif items[0]:

Now, compare this to the code generate by Data Wrangler:

extract('Year').on(/.*/).after(/in /)
delete('Year starts with "Reported"')

which, I would argue, is both longer and considerably more difficult
to read (moreover, this code does not even read in the data file, nor
handle transferring the data into an environment (e.g., Stata) where
they can be analyzed). Now, I suppose that the whole point here is
that with Data Wrangler this can be done via a GUI, however in my case
I could definitely do this in Stata or Python faster, and when I'm
done, I have a routine that can more easily be used/extended to handle
subsequent (similar) datasets.

This strikes me as similar in some ways to Applescript. The idea was
to simplify scripting so that anyone could do it, but in the end non-
programmers still find it too difficult, and programmers prefer to use
a standard, more capable scripting language (e.g., bash, Python, Ruby,

-- Phil

Saturday, May 28, 2011

2011 O'Reilly Strata Conf

Strata 2011 | Exploring the Data Universe
Report on Feb. Conference.

Has anyone heard of Data Wrangler before?

I've just tried this, and would like to hear from non-engineering/CS staff on their experience.

In a nutshell:

In particular, they are looking at samples of how people are using their tool (so clean your data appropriately).

Also, note the "large data" comment:

  • You should not (really) depend on the actual, bulk data conversion (see the "Export" link near the script that is developed, lower left)
  • You want to really use a sample of your incoming data so Data Wrangler can generate a python script for you to locally filter your actual data

RCG staff may be able to help you get started with this, if you need.

NOTES for RCG staff

  • This application is a full-on javascript application (runs in your browser).
  • It will export CSV results (not so interesting);
  • It will export your particular filtering steps as a program (very interesting)
    • Either javascript, or python code  (python library available for download)

I recommend a look at the design paper:

On programming, statistics, & developing expertise

Friday, May 13, 2011

Analysis of PCR array data

I found what seems like a very useful Bioconductor package for the analysis of PCR data, in particular high-throughput, i.e. PCR arrays. The package is called HTqPCR.  It has data management capabilities (presumably can read raw data files), normalization and visualization methods, and limma-type models for analysis (which I presume could replace the one-gene-at-a-time mixed models approach of Yuan). The reference manual is here.

Friday, April 22, 2011

"Death by PowerPoint"

Given all the presentation preparation going on this week, I thought some might appreciate (or get some comic releif) from this.   Drawing by The Hiking Artist.

- Yarko

Wednesday, April 20, 2011

Sample Size for Proportional Odds Models

Over the summer we needed to determine the sample size for a SNP-related grant where response is 0/1/2. Turns out Hmisc package in R has a function "popower". Here is the descripton:

"popower computes the power for a two-tailed two sample comparison of ordinal outcomes under the proportional odds ordinal logistic model. The power is the same as that of the Wilcoxon test but with ties handled properly. posamsize computes the total sample size needed to achieve a given power. Both functions compute the efficiency of the design compared with a design in which the response variable is continuous. print methods exist for both functions. Any of the input arguments may be vectors, in which case a vector of powers or sample sizes is returned. These functions use the methods of Whitehead (1993)."

Whitehead J (1993): Sample size calculations for ordered categorical data. Stat in Med 12:2257–

R in Action - discussion on Java Ranch

Some might find this useful (a discussion with author Rob Kabacoff):

R in Action's  author, Rob Kabacoff, also runs the site, Quick R referenced in Hacky's post:

Tuesday, April 19, 2011

Social Rejection Hurting? Pain Meds & Brain Activity

This is what I'd mentioned at lunch, but I am not at all surprised - teams, our striving for sense of belonging, punishment by banishment - these are all items related to our health, just as you would expect for social animals (i.e. it's always playing some real role).


Friday, April 1, 2011

R lmer: how to to extract fixed effects with p-values

coeffun = function(fit)
  vc <- vcov(fit, useScale = FALSE)
  b <- fixef(fit)
  se <- sqrt(diag(vc))
  z <- b / sqrt(diag(vc))
  P <- 2 * (1 - pnorm(abs(z)))
  return(cbind(b, se, z, P))

This assumes normality of estimated coefficients

Wednesday, March 30, 2011

MacOS: bash command line editing

This will let you use emacs or vi capabilities in the command line

set -o vi # for vi mode
set -o emacs # for emacs mode

Emacs command line editing cheat-sheet:

Tuesday, March 29, 2011

MacOS: How to change terminal directory to finder

To open the current directory in a graphical user window (finder) on Mac, from a terminal you simply use the command "open ."  (replace '.' with whatever directory name or file you want).   Mac will then use the default program (the finder) to open the target.  [Note: in windows, you can do a similar thing from the command shell,  cmd, with "explorer ."]

But how do you do the complementary operation - change directories of your terminal to the finder location?

Put this in your ~/.bash_login  file, and then you can  use "cdf" to change terminal directory to the last finder which had focus:

# cdf: cd's to frontmost window of Finder
cdf ()
    currFolderPath=$( /usr/bin/osascript <<"    EOT"
        tell application "Finder"
                set currFolder to (folder of the front window as alias)
            on error
                set currFolder to (path to desktop folder as alias)
            end try
            POSIX path of currFolder
        end tell
    # ignore the cd if it's my desktop
    if [ "$currFolderPath" != "$HOME/Desktop/" ]
        echo "cd to \"$currFolderPath\""
        cd "$currFolderPath"
        echo "no finder path to cd to..."

One final piece to round this out:   to open a terminal in the current window, you can add "Go2Shell" to your finder - see for more info.

Thursday, March 24, 2011

R: how to plot histogram with normal density

histnorm = function(xx,...)
  h<-hist(xx, ...)
  yfit <- yfit*diff(h$mids[1:2])*length(xx)
  lines(xfit, yfit, col="blue", lwd=2)

Sunday, February 20, 2011

Regular Expressions in SAS

A few days ago we were discussing which statistical packages have regular expressions. Here is a quick cheat-sheet for SAS:

Friday, February 11, 2011

Friday, January 14, 2011

Stringr package in R

A new package from Hadley Wickham - a string processing package that "provides string functions that are simpler and more consistent [than existing functions in R], and also fixes some functionality that R is missing compared to other programming languages".

Thursday, January 13, 2011

R: how to extract parameter from AR1 fit in gls/nlme

fit = gls(y~pop,data=data,corr = corAR1(.5, form = ~ 1 | trio))

> coef(fit$model$corStruct,unconstrained = FALSE)

** without the unconstrained = FALSE option, it gives some other (?) number

Alternatively, intervals(fit)$corStruct gives a vector with confidence intervals 

> intervals(fit)$corStruct
      lower   est. upper
Phi1 -0.421 0.0384 0.482
[1] "Correlation structure:"

Subscribe via email

Enter your email address:

Delivered by FeedBurner


Blog Archive