Friday, December 17, 2010

SAS Keyboard Shortcuts

Comment out the selection (line by line with /* ... */) :  Ctrl + /
Uncomment the selection                                       :  Ctrl + Shift + /

Here is the full list of other shortcuts.

Friday, December 10, 2010

LaTeX in Google Docs

There is a new feature (or an update of a once-available feature, I cant quite tell) in Google Docs that allows you to type equations using LaTeX shortcuts.  The only caveat is that is that you can't see the LaTeX code and edit itdirectly -- maybe in the next version?

Here is the description of LaTeX equations in Google Docs:

Friday, December 3, 2010

R: Sweave Tutorials

Sweave provides a flexible framework for mixing text and S code for automatic report generation. The basic idea is to replace the S (or R) code with its output, such that the final document only contains the text and the output of the statistical analysis. It is great for automatic report generation and mixing word processing (text, or LaTeX) and R code, output and graphics. Here are some tutorials on Sweave:

From Jeromy Anglim (University of Melbourne):
Tutorial 1: Using Sweave, R, and Make to Generate a PDF of Multiple Choice Questions
Tutorial 2: Batch Individual Personality Reports using R, Sweave, and LaTeX
Tutorial 3: Console Input and Output - Multiple Choice Test Analysis

Monday, November 1, 2010

Webinar: Sample Size Re-estimation for Time-to-Event Trials

This webinar by Cyrus Mehta (President and Co-Founder of Cytel Inc) looks interesting. It is tomorrow (November 2) at 8am CDT.

Monday, October 11, 2010

Sample Size calculations for Longitudinal and Mixed Effects Models (RMASS)

Here is a link to RMASS online (much easier to use than the DOS version):

Directions to the Biostatistics Consulting Laboratory

We are located on the third floor of the R corridor in the Medical Center. 

Take the stairs next to the Student Care Center to the third floor. The R corridor on the third floor is disconnected from the rest of the building so these are the only stairs that will lead you to our offices.

Thursday, October 7, 2010

how to drop unused levels in R

problem.factor <- problem.factor[, drop = TRUE]

To drop unused levels from all factors in a data.frame, use this function:

drop.levels <- function(dat){
  # Drop unused factor levels from all factors in a data.frame
  # Author: Kevin Wright.  Idea by Brian Ripley.
  dat[] <- lapply(dat, function(x) x[,drop=TRUE])


Wednesday, October 6, 2010

how to paste data into R

copy the data into the clipboard (Ctrl-C or Command-C)

in the R console run: mydata = read.table(file="clipboard")

For some versions of R, this does not work. Try

mydata = read.table(pipe("pbpaste"))

EndNote Training

I've been using Mendeley reference management software quite a bit, but it's just too early to give up EndNote completely.  Fortunately, Crerar Library offers EndNote Training session next week - I went to one of those last year, and it eased some of  frustration.

Monday, October 4, 2010

Special Characters in Stata Graphs

Here is a code snippet showing how to insert symbols (e.g. Greek letters or subscripts) in Stata graphs:

local xt2="Max({&Delta}T{subscript:1}-{&Delta}T{subscript:0})"
scatter avegr a1diff1, xt(`xt2') yt("Ave Grade") ///
  ti("Ave Grade vs. Max({&Delta}T{subscript:1}-{&Delta}T{subscript:0})")

Friday, October 1, 2010

Quantile Regression

Just came across this non-technical introduction to quantile regression by Roger Koenker and Kevin Hallock.

R working environment in Windows: Eclipse + StatET plugin

I currently use Tinn-R as my main editor, but it has limitation and there are some things that I don't like about it.  Here is a description of how to set up an R working environment in Windows using Eclipse and the StatET plugin (I haven't tried it yet, but it looks very promising).

What is Eclipse, you ask? Here is an FAQ on the subject.

Wednesday, September 29, 2010

World Statistics Day - 20 October 2010

On 20 October 2010, the World will celebrate the first World Statistics Day, to raise awareness of the many achievements of official statistics premised on the core values of service, professionalism and integrity.

The World Statistics Day (20 October 2010) logo is a stylized statistical chart encircled by five coloured wreaths symbolizing a dynamic, fast changing world as represented by the five continents. The three blue and green bars at the center represent the three core principles of official statistics: Service, Professionalism and Integrity. Together and intertwined, they lay a strong foundation for the national and global statistical system.

Efficient Programming in R

Martin Morgan's slides on "Efficient R Programming"

(reposting from Revolutions blog:

codebook function for R

codebook is a nice and convenient function in Stata to get a snapshot of a dataset. I couldn't find a counterpart in R within the standard packages. But Frank Harrell's Hmisc package has describe function which pretty much does the same thing.

## install
## load the library


Thursday, September 23, 2010

How to merge pdf files into one from command line

Put all the pdf files into a directory
I named them as ch00.pdf, ch01.pdf, etc.
Run on linux command(mac, cygwin, you need Ghostscript installed):
  gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf *.pdf

This method preserves the table of contents of each pdf file

How to add text at specified coordinates using ggplot2

Suppose we want to plot x1 vs. y1 contained in a data.frame data, and then add text directly on the graph.

> head(data[,c(1, 10:11)])
  ptid          x1       y1
1    1  0.51666667 2.142857
2    2  0.59999999 2.400000
3    3  1.35000000 3.000000
4    4 -0.08333333 2.142857
5    5          NA 2.500000
6    6  0.96666668 2.666667

Here is a code snippet:

ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(method="lm") +
       opts(title = "Approach 1") +
       xlab("X1") + ylab("Y1") +
       geom_text(aes(x2,y2,label = texthere), 
          data.frame(x2=2, y2=2.8, texthere="Text Here"))

Note, that we use two different data frames in ggplot() and in geom_text(). The data frame in geom_text() specifies the coordinates (x2=2 and y2=2.8) and the label to place at that point. 

Wednesday, September 22, 2010

How to combine multiple plots with ggplot2 graphics

Here is a great article on how to combine multiple plots (no, these are not facets) when using plots in ggplot2.  Using the arrange() function is then equivalent to using par(mfrow=c(nr,nc)) with regular graphics.

Deleting columns in a data.frame in R

To delete a single column: 
data["colname"] <-NULL
data <- data[, -grep("colname",names(data))]

To delete multiple columns:

# by column number (e.g. columns with indices c1, c2 and c3 )
data <- data[,-c(c1,c2,c3)] 

# by column name:
data <- data[-match(c("var1","var2"), names(data))]
data <- data[-which(names(data) %in% c("muc3", "muc4"))]
data[c("var1","var2")] = list()

# all columns with the same root (e.g. variables "time1", "time2", ..., "timeN")
data <- data[-grep("time",names(data))]

Monday, September 13, 2010

post-mortem debugging in R

## to start debugging (browser) right after an error
## set error option to recover

options(error=recover) ## default is NULL

after the error, select which function you want to debug (there will be more than one if the error happened in a function called from the main one)

where: tells you where in the function you are
ls(): lists all the local variables

## to start debugging at a given point in the function
## insert browser()

myfunction = function(x)

Friday, September 10, 2010

standard deviation given MAF using Hardy Weinberg Equilibrium

## compute variance of X with maf
## assuming Hardy Weinberg equilibrium

funvarx = function(maf)
    paa = maf^2
    pAA = (1-maf)^2
    pq2 = 2*maf*(1-maf)
    varx = pAA*(1-pAA) + paa*(1-paa) + 2*pAA*paa

funsdx = function(maf) return(sqrt(funvarx(maf)))

maf = 0.25

Tuesday, September 7, 2010

get rid of annoying online ads

Try the bookmarklet from here. It shows you just the main content of the page without the annoying ads around it.

Friday, September 3, 2010

Apoptosis (or should we say apotosis?)

according to this introductory book (Molecular biology made simple and  
fun by David Clark):

apoptosis in Greek for "dropping off" and if you want to fool other  
people into thinking you are educated, you must not pronounce the  
second "p". 

Apparently, that's the mechanism by which tadpoles lose their tails when becoming frogs.

biostatistics vs lab research - you have to look at this

Thursday, August 5, 2010

Mind Mapping tool

I found that this is quite useful to organize complex problems/projects.
It will let you quickly generate flow diagrams like the one below

Wednesday, August 4, 2010

PROC Power in SAS

Most of us use PASS for sample size calculations, but because PASS is menu-driven, calculations cannot be reproduced without typing in all the parameters from scratch. PROC POWER in SAS is now available and can do many sample size calculations that we commonly use PASS for.  There is also a menu-driven GUI similar to PASS that creates SAS code for later use (just go to Start->Programs->SAS-> SAS Power and Sample Size).

Here is a code snippet for a two-sample t-test and a paired t-test. 

proc power;
   TwoSampleMeans test=diff_satt
      Alpha = 0.05
      Sides = 2
      GroupMeans = (73 -85)    (30 -30)
      GroupStdDevs = (58 70.5) (55.8 68.3) (53.4 66)
      Power = 0.8 0.9
      NPerGroup = .

proc power;
   PairedMeans test = diff
      Alpha = 0.05
      Sides = 2
      PairedMeans = (161.6 106)
      PairedStdDevs = (10.6 73)
      Corr = 0.1 0.3 0.5 0.7
      Power = 0.8 0.9
      NPairs = .


quote on brevity

It came up in yesterday's meeting with Helena and Nikolai. It was Pascal.

"Je N'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte"
"I have only made this letter rather long because I have not had time to make it shorter"
Pascal. Lettres provinciales, 16, Dec.14,1656.

Monday, August 2, 2010

Midwest SAS Users Group Meeting (in Milwaukee)

MWSUG is happening October 10-12 in Milwaukee. Early registration before September 3 is $185, not too bad. For more details and the program see here:

Friday, July 30, 2010

Duke Suspends Researcher and Halts Cancer Studies - NYtimes


As I was doing some literature searching, I came across an article in The Lancet discussing the use of NSAIDs and the risk of oral cancer. The article had been retracted! Here is the summary of this fraudulent work in Wikipedia. Certainly an interesting read, and a relevant topic for the hn spore.

Thursday, July 29, 2010

Obama on "The View"

President Obama, who appeared on "The View" today, said:

     "Let's not assume [the worst about other people]; let's make sure we get the facts straight."

I think he deserves an honorary degree in Statistics.

Wednesday, July 21, 2010

statistician questioned by the New York Times

This statistician was questioned by the New York Times about her analysis of a Science paper on genetics of old age. We should think twice before putting our names in some papers.

Thursday, July 15, 2010

Subscribing to this blog

There is now a widget at the bottom of this screen that allows readers to subscribe to this blog via email.  Scroll down, and subscribe away! :)

Friday, July 9, 2010

RECORD Trial scandal

Fascinating story about the RECORD clinical trial, involving statistics, conspiracy, secret tape recordings, and industry (GSK) intimidation.  Will there be a summer blockbuster based on this true story? 

NYTimes article about the FDA review of drug safety (Avandia, a diabetes drug)

Letter from Steven Nissen (Cleveland Clinic) in JAMA:

And of course the trial publication itself (Stuart Pocock is the statistician):


Thursday, July 1, 2010

supercomputing resources

If you need some serious computing power, we have these resources available. Among other clusters they have a windows network with 16CPU and 128GB RAM. It has SAS, SPSS, S-Plus, Matlab, R, Stata, Mathematica, etc. Another linux cluster has 256GB RAM with bioinformatic and statistical packages preinstalled.

You can find additional information at

Friday, June 25, 2010

debug() in R

While I'm on a (posting) roll, here is one more useful function, one that I'm a little embarrassed about discovering only now. It's debug() function in R. You tag you favorite function, foo(), with debug(foo). The next time foo() is executed, either by itself or is called from another function, debug executes foo() line by line, and provides an interactive prompt at each step so you can check values of the variables at the current evaluation, plot things, define things, or do whatever else you want. There are some built-in commands to be used at the prompt (if you can't think of something better):

n - advance to the nextstep
c - continue to the end of the current context: e.g. to the end of the loop if within a loop or to the end of the function
Q - quit debugging

In the end you can run undebug() to stop tracking the function.

For more, see here:

Taste of Chicago (off-topic)

Yeah, those of us who live here think it's lame, but I think I'm going to go anyway. They do have quite a few interesting restaurants (Ukrainian Shokolad Pastry & Cafe anyone?), so why not!

Here is a map (from Chicago Tribune):

fancyvrb package in LaTeX

In order to insert a code snippet (e.g. from R) into a LaTeX document, one could use a "verbatim" environment (e.g. with \begin{verbatim} and \end{verbatim}). This works, but isn't very customizable, e.g. cant' change font or fontsize easily.

Instead, download a package called fancyvrb. The manual is at  For example (and note capital V in Verbatim):


  function(X, tol = sqrt(.Machine$double.eps)) {
    ## Generalized Inverse of a Matrix
    dnx <- dimnames(X)
    if(is.null(dnx)) dnx <- vector("list", 2)
    s <- svd(X)
    nz <- s$d > tol * s$d[1]
      if(any(nz)) s$v[, nz] %*% (t(s$u[, nz])/s$d[nz]) else X,
      dimnames = dnx[2:1])


To install a LaTeX package (at least in my setup of WinEdt + MikTeX under Windows), do the following:
- Got to MikTeX Package Manager
- Wait for it to update the list of available packages
- Find the one you want to install, and double click on it.

Wednesday, June 23, 2010

Stata: stset and missing values

If a censoring indicator variable has missing values (but the actual time variable does not), one would expect Stata to omit that observation from subsequent survival analyses, or at least to give a warning that values are missing. However, it treats these observations as censored! For example: 

. tab ios, miss

        ios |      Freq.     Percent        Cum.
          0 |         53       46.09       46.09
          1 |         55       47.83       93.91
          . |          7        6.09      100.00
      Total |        115      100.00 

. count if osmo==. 

. stset osmo, fail(ios)

     failure event:  ios != 0 & ios < .
obs. time interval:  (0, osmo]
 exit on or before:  failure

      115  total obs.
        0  exclusions
      115  obs. remaining, representing
       55  failures in single record/single failure data
 2609.162  total analysis time at risk, at risk from t =         0
                             earliest observed entry t =         0
                                  last observed exit t =  93.92876 

. stci

         failure _d:  ios
   analysis time _t:  osmo

             |    no. of 
             |  subjects         50%     Std. Err.     [95% Conf. Interval]
       total |       115    29.42466      3.419791       22.126    35.9014

. tab  ios _d, miss

           |          _d
       ios |         0          1 |     Total
         0 |        53          0 |        53
         1 |         0         55 |        55
         . |         7          0 |         7
     Total |        60         55 |       115


Wednesday, June 16, 2010

Thursday, June 10, 2010

R: spaghetti plot using ggplot2 package

## color and group can be based on different variables


Friday, June 4, 2010

R: Spaghetti Plots for growth curves using lattice package

xyplot(vol~time, dataname, group = id, type="a")

xyplot(vol~time|trt, dataname, group=id, type="a")

SAS: Spaghetti Plots for growth curves using SAS v.9.2 PROC SGPLOT and PROC SGPANEL

Reposted from

proc sort data=dat1;
by trt id time;

proc sgplot data=dat1;
series x=time y=vol/group=trt break;

proc sgpanel data=dat1;
panelby trt;
series x=time y=vol/break group=trt;

UPDATE: how to embed pdf or ppt in the blog or webpage

If the PDF file that you want to embed is not on the web already and therefore does not have a URL that ends with *.pdf (e.g. it's a file on your computer), you can still embed it using third-party websites. I used Scribd, which is a document sharing site. 

Here are the steps:

1. Go to
2. Click the "Upload" button at the top of the screen
3. Enter your email (they promise not to spam)
4. Once your file is uploaded, click the "Share" link next to it, and copy the HTML link in the "Embed this document" section
5. Paste the link into your blogger post

Note, that this will create a Scrbd account for you and will email you password for future use. That's essentially it for how to create an account. 

Mathematica: Clearing values for typeset variables

I was surprised that a subscripted variable (e.g. Z_{k}) cannot be cleared using Clear[] function in Mathematica. Instead, you need to use Unset[]. See below.
Unset Subscripts                                                            

Thursday, June 3, 2010

how to embed pdf or ppt in the blog or webpage

Go to
and get the code

Stata 11.1

Stata 11.1 was released today (type update all to update, and then help whatsnew to see what's new). Of particular note given the past couple of posts on this blog is the following:

  • Mixed models. Linear mixed (multilevel) models have new covariance structures for the residuals: exponential, banded, and Toeplitz. See xtmixed.

lme4: Mixed-effects modeling with R

partial draft of the book by Douglas Bates

Older print version can be found here

Mixed-Effects Models in Sand S-PLUS
José C. Pinheiro and Douglas M. Bates

Wednesday, May 26, 2010

New project at BCL

Ted, Masha, and Haky will write a book titled

The statistics of crappy data.

Any suggestions or comments are welcome.

Tuesday, May 25, 2010

Order from here and help Agassiz

Sorry for the ad but...
If you buy through this link, Amazon will pay Friends of Agassiz 4 to 15% of the total purchase made within 24 hours. It doesn't cost you anything and you'll be helping a public school.


Friday, May 14, 2010

boxplot with data points jittered

## CAN BE REPLACED BY qplot in ggplot2 package


## R function that plots boxplot and adds data points (jittered horizontally) 

## it can be improved so that it can take a model formula as argument
## like boxplot does
boxjitter =
    boxplot(y~x,border=border,ylim = range(y,na.rm=T),outline=F,...)
    meanvec = tapply(y,x,
  } else
    x = x[!]
    boxplot(x,border=border,ylim = range(x),outline=F,...)

Monday, April 26, 2010

Gene Expression Atlas (GXA)

This database is about 1 year old, and looks really cool.

Gene Expression Atlas:

Download PPT from Array Express/Atlas section of

EBI Roadshow

Several of us are attending the European Bioinformatics Institute (EBI) Roadshow workshop at NWU today and tomorrow.

"European Bioinformatics Institute Roadshows provide hands-on bioinformatics training for scientists.  Using presentations and practical sessions, expert trainers guide participants through web interfaces and key functionalities of a selected set of the extensive databases and tools developed at the EBI, including: Ensembl, BioMart, ArrayExpress, ArrayExpress, Gene Expression Atlas, Reactome."

Here are some useful links:

EMBL-EBI website:

Workshop Presentations:

Friday, April 23, 2010

A bit of wisdom from one of the greats

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

John Tukey

about data analysis

Frank Harrell warns that using the data to guide the data analysis is almost as dangerous as not doing so.

ROC curve and Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation

ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank 
correlation between pred. P[Y=1] and Y) is a measure of pure 
discrimination, not a measure of accuracy per se.

Chicago R User Group Established

A Chicago R User group has been established recently - it does not appear that they have any meet-ups scheduled yet, though.


Subscribe via email

Enter your email address:

Delivered by FeedBurner


Blog Archive

google analytics