Randomise v2.1Permutation-based nonparametric inferenceintro - using randomise - examples - theory - references | ![]() |
Permutation methods (also known as randomisation methods) are used for inference (thresholding) on statistic maps when the null distribution is not known. The null distribution is unknown because either the noise in the data does not follow a simple distribution, or because non-statandard statistics are used to summarize the data. randomise is a simple permutation program enabling modelling and inference using standard GLM design setup as used for example in FEAT. It can output voxelwise and cluster-based tests, and also offers variance smoothing as an option. For more detail on permutation testing in neuroimaging see Nichols and Holmes (2002).
Test Statistics in Randomise
randomise produces a test statistic image
(e.g., ADvsNC_tstat1
, if your chosen output rootname
is ADvsNC
) and sets of P-value images (stored as 1-P for
more convenient visualization, as bigger is then "better"). The table
below shows the filename suffices for each of the different test
statistics available.
Voxel-wise uncorrected P-values are generally only useful when a single voxel is selected a priori (i.e., you don't need to worry about multiple comparisons across voxels). The significance of suprathreshold clusters (defined by the cluster-forming threshold) can be assessed either by cluster size or cluster mass. Size is just cluster extent measured in voxels. Mass is the sum of all statistic values within the cluster. Cluster mass has been reported to be more sensitive than cluster size (Bullmore et al, 1999; Hayasaka & Nichols, 2003).
Accounting for Repeated Measures
Permutation tests do not easily accommodate correlated datasets (e.g., temporally smooth timeseries), as null-hypothesis exchangeability is essential. However, the case of "repeated measurements", or more than one measurement per subject in a multisubject analysis, can sometimes be accommodated.
randomise allows the definition of exchangeability blocks,
as specified by the group_labels
option. If specfied,
the program will only permute observations within block, i.e., only
observations with the same group label will be exchanged. See
the repeated measures example below for
more detail.
Confound Regressors
Unlike with the previous version of randomise, you no longer need to treat confound regressors in a special way (e.g. putting them in a separate design matrix). You can now include them in the main design matrix, and randomise will work out from your contrasts how to deal with them. For each contrast, an "effective regressor" is formed using the original full design matrix and the contrast, as well as a new set of "effective confound regressors", which are then pre-removed from the data before the permutation testing begins. One side-effect of the new, more powerful, approach is that the full set of permutations is run for each contrast separately, increasing the time that randomise takes to run.
More information on the theory behind randomise can be found in the Background Theory section below.
A typical simple call to randomise uses the following syntax:
randomise -i <4D_input_data> -o <output_rootname> -d design.mat -t
design.con -m <mask_image> -n 500 -D -T
design.mat
and design.con
are text files
containing the design matrix and list of contrasts required; they
follow the same format as generated by FEAT (see below for
examples). The -n 500
option tells randomise to generate
500 permutations of the data when building up the null distribution
to test against. The -D
option tells randomise to demean
the data before continuing - this is necessary if you are not
modelling the mean in the design matrix. The -T
option
tells randomise that the test statistic that you wish to use is TFCE
(threshold-free cluster enhancement - see below for more on this).
There are two programs that make it easy to create the design
matrix, contrast and exchangeability-block files design.mat / design.con / design.grp
. The
first is the Glm
GUI which allows the specification of
designs in the same way as in FEAT, and the second is a simple script
to allow you to easily generate design files for the two-group
unpaired t-test case, called design_ttest2
.
randomise has the following thresholding/output options:
<output>_vox_p_tstat
/
<output>_vox_p_fstat
. Corrected outputs
are: <output>_vox_corrp_tstat
/
<output>_vox_corrp_fstat
. To use this option, use
-x
.
-tfce
option in fslmaths
to test this
on an existing stats image. See
the TFCE
research page for more information. The "E", "H" and
neighbourhood-connectivity parameters have been optimised and should
be left unchanged. These optimisations are different for different
"dimensionality" of your data; for normal, 3D data (such as in an
FSL-VBM analysis), you should just just the -T
option,
while for TBSS analyses (that is in effect on the mostly "2D" white
matter skeleton), you should use the --T2
option.
<output>_clustere_corrp_tstat
/ <output>_clustere_corrp_fstat
-c <thresh>
for t contrasts and -F
<thresh>
for F contrasts, where the threshold is used
to form supra-threshold clusters of voxels.
<output>_clusterm_corrp_tstat
/
<output>_clusterm_corrp_fstat
-C <thresh>
for t contrasts and
-S <thresh>
for F contrasts.
Voxel-wise | TFCE | Cluster-wise | ||
---|---|---|---|---|
Extent | Mass | |||
Raw test statistic | _tstat _fstat |
_tfce_tstat _tfce_fstat |
n/a | n/a |
1 - Uncorrected P | _vox_p_tstat _vox_p_fstat |
_tfce_p_tstat _tfce_p_fstat |
n/a | n/a |
1 - FWE-Corrected P | _vox_corrp_tstat _vox_corrp_fstat |
_tfce_corrp_tstat _tfce_corrp_fstat |
_clustere_corrp_tstat _clustere_corrp_fstat |
_clusterm_corrp_tstat_ _clusterm_corrp_tstat_ |
"FWE-corrected" means that the family-wise error rate is controlled. If only FWE-corrected P-values less than 0.05 are accepted, the chance of one more false positives occurring over space space is no more than 5%. Equivalently, one has 95% confidence of no false positives in the image.
Note that these output images are 1-P images, where a value of 1 is therefore most significant (arranged this way to make display and thresholding convenient). Thus to "threshold at p<0.01", threshold the output images at 0.99 etc.
If your design is simply all 1s (for example, a single group of
subjects) then randomise needs to work in a different way. Normally it
generates random samples by randomly permuting the rows of the design;
however in this case it does so by randomly inverting the sign of the
1s. In this case, then, instead of specifying design and contrast
matrices on the command line, use the -1
option.
You can potentially improve the estimation of the variance that
feeds into the final "t" statistic image by using the variance
smoothing option -v <std>
where you need to specify
the spatial extent of the smoothing in mm.
One-Sample T-test.
To perform a nonparametric 1-sample t-test (e.g., on COPEs created
by FEAT FMRI analysis), create a 4D image of all of the images.
There should be no repeated measures, i.e., there should only be one
image per subject. Because this is a single group simple design you
don't need a design matrix or contrasts. Just use:
randomise -i OneSamp4D -o OneSampT -1 -T
Note you do not need the -D
option (as the mean is in the
model), and omit the -n
option, so that 5000 permutations
will be performed.
If you have fewer than 20 subjects (approx. 20 DF), then you will
usually see an increase in power by using variance smoothing, as in
randomise -i OneSamp4D -o OneSampT -1 -v 5 -T
which does a 5mm HWHM variance smoothing.
Note also that randomise will automatically select one-sample mode for
appropriate design/contrast combinations.
Two-Sample Unpaired T-test
To perform a nonparametric 2-sample t-test, create 4D image of all of the images, with the subjects in the right order! Create appropriate design.mat and design.con files.
Once you have your design files run:
randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T
Two-Sample Unpaired T-test with nuisance variables.
To perform a nonparametric 2-sample t-test in the presence of
nuisance variables, create a 4D image of all of the images. Create
appropriate design.mat
and design.con
files,
where your design matrix has additional nuisance variables that are
(appropriately) ignored by your contrast.
Once you have your design files the call is as before:
Repeated measures ANOVA
Following
the ANOVA:
1-factor 4-levels (Repeated Measures) example from the FEAT
manual, assume we have 2 subjects with 1 factor at 4 levels. We
therefore have eight input images and we want to test if there is any
difference over the 4 levels of the factor. The design matrix looks
like
Modify the exchangeability-block information in The number of permutations can be computed for each group, and then
multiplied together to find the total number of permutations. We use
the ANOVA computation for 4 levels, and hence (1+1+1+1)!/1!/1!/1!/1!
= 24 possible permutations for one subject, and hence 24 × 24
= 576 total permutations.
The call is then similar to the above examples: A standard nonparametric test is exact, in that the false
positive rate is exactly equal to the specified α level. Using
randomise with a GLM that corresponds to one of the following
simple statistical models will result in exact inference:
Permutation tests for the General Linear Model
For an arbitrary GLM randomise uses the method of
Freeman & Lane (1983). Based on the contrast (or set of contrasts
defining an F
test), the design matrix is automatically partitioned into tested
effects and nuisance (confound) effects.
The data are first fit to the nuisance effects alone and
nuisance-only residuals are formed. These residuals are permuted,
and then the estimated nuisance signal is added back on,
creating
an (approximate) realization of data under the null hypothesis. This
realization is fit to the full model and the desired test
statistic is computed as usual. This process is repeated to build a
distribution of test statistics equivalent under the null hypothesis
specified by the contrast(s). For the simple models above, this
method is equivalent to the standard exact tests; otherwise, it
accounts for nuisance variation present under the null.
Note, that randomise v2.0 and earlier used a method due to
Kennedy (1995). While both
the Freedman-Lane and Kennedy methods are accurate for large n,
for small n the Kennedy method can tend to false inflate
significances. For a review of these issues and even more possible
methods, see Anderson & Robinson (2001)
This approximate permutation test is asymptotically exact, meaning that
the results become more accurate with an ever-growing sample size
(for a fixed number of regressors). For large sample sizes, with
50-100 or more degrees of freedom, the P-values should be highly
accurate. When the sample size is low and there are many
nuisance regressors, accuracy could be a problem. (The accuracy is
easily assessed by generating random noise data and fitting it to your
design; the uncorrected P-values should be uniformly spread between
zero and one; the test will be invalid if there is an excess of small
P-values and conservative if there is a deficit of small P-values.)
Monte Carlo Permutation Tests
A proper "exact" test arises from evaluating every possible
permutation. Often this is not feasible, e.g., a simple correlation
with 12 scans has nearly a half a billion possible permutations.
Instead, a random sample of possible permutations can be used,
creating a Monte Carlo permutation test. On average the Monte Carlo
test is exact and will give similar results to carrying out all
possible permutations.
If the number of possible permutations is
large, one can show that a true, exhaustive P-value of p will
produce Monte Carlo P-values between p ±
2√(p(1-p)/n) about 95% of the time,
where n is the number of Monte Carlo permutations.
'Draft' Analyses to Check for Any Significance
To minimize computational expense a very short 'draft' analysis
can be run to screen for any significance. Since the draft analysis
won't be very accurate, you use a generous thresold to ensure you'll
catch a significant result. Specifically, if you run with 200
permutations and use a FWE significance threshold of p ≤ 0.1,
you will almost always (greater than 99.9% chance) detect a true
p-value of 0.05 (where "truth" corresponds to running every
permutation). If you find anything
significant you should re-run with at least 5,000 (ideally 10,000)
permutations to get the final result.
Counting Permutations
Exchangeabilty under the null hypothesis justifies the permutation
of the data. For n scans, there are n! (n factorial,
n×(n-1)×(n-2)×...×2) possible ways of
shuffling the data. For some designs, though, many of these shuffles
are redundant. For example, in a two-sample t-test, permuting two
scans within a group will not change the value of the test statistic.
The number of possible permutations for different designs are given
below.
Parallelising Randomise
If you are have an SGE-capable system then a randomise job can be split in parallel with Exchangeability Blocks
The pre-FSL4.1.8 version of randomise had a bug in the -e option that could generate incorrect permutation of scans over subject blocks. Incorrectly permuting scans over subject blocks does two things: Under permutation, it will randomly induce big positive or negative effects (inflating the variability in the numerator of the t statistics over permutations), but it will ALSO increase the residual standard deviation for each fit (inflating the denominator of the t statistic on each permutation). Hence it isn't clear which will dominate, whether the permutation distribution of T values will be artifactually expanded (wrongly decreasing significance), or artifactually contracted (wrongly inflating significance).
With some simulations, and with some re-analysis of real data, it appears that the effect of the bug is to always to wrongly deflate significance. Thus, it is anticipated that results with the corrected code will have improved significance.
MJ Anderson & J Robinson.
Permutation Tests for Linear Models.
Aust. N.Z. J. Stat. 43(1):75-88, 2001.
Bullmore, ET and Suckling, J and Overmeyer, S and
Rabe-Hesketh, S and Taylor, E and Brammer, MJ
Global, voxel, and cluster tests, by theory and
permutation, for a difference between two groups of
structural MR images of the brain.
IEEE TMI, 18(1):32-42, 1999.
D Freedman & D Lane.
A nonstocastic interpretation of reported significance levels.
J. Bus. Econom. Statist. 1:292-298, 1983.
S Hayasaka & TE Nichols.
Validating cluster size inference: random field and permutation methods.
NeuroImage, 20:2343-2356, 2003
PE Kennedy.
Randomization tests in econometrics.
J. Bus. Econom. Statist. 13:84-95, 1995.
TE Nichols and AP Holmes.
Nonparametric Permutation Tests for Functional Neuroimaging: A
Primer with Examples.
Human Brain Mapping, 15:1-25, 2002.
randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con -m mask -T
1 0 1 0 0
1 0 0 1 0
1 0 0 0 1
1 0 0 0 0
0 1 1 0 0
0 1 0 1 0
0 1 0 0 1
0 1 0 0 0
where the first two columns model subject means and the 3rd through
5th column model the categorical effect (Note the different
arrangement of rows relative to the FEAT example). Three t-contrasts
for the categorical effect
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
are selected together into a single F-contrast
1 1 1
design.grp
to match
1
1
1
1
2
2
2
2
This will ensure that permutations will only occur within subject,
respecting the repeated measures structure of the data.
randomise -i TwoSamp4D -o TwoSampT -d design.mat -t design.con
-f design.fts -m mask -e design.grp -T
BACKGROUND THEORY
Use of almost any other GLM will result in approximately exact
inference. In particular, when the model includes both the effect
tested (e.g., difference in FA between two groups) and nuisance
variables (e.g., age), exact tests are not generally available.
Permutation tests rely on an assumption of exchangeability; with the
models above, the null hypothesis implies complete exchangeability of
the observations. When there are nuisance effects, however, the null
hypothesis no longer assures the exchangeability of the data
(e.g. even when the null hypothesis of no FA difference is true, age
effects imply that you can't permute the data without altering the
structure of the data).
n
Confidence limits
for p=0.05
100
0.0500 ± 0.0436
1,000
0.0500 ± 0.0138
5,000
0.0500 ± 0.0062
10,000
0.0500 ± 0.0044
50,000
0.0500 ± 0.0019
-n
to set the number of
permutations (if this number is greater than or equal
to the number of possible permutations, an exhaustive test is run.)
Model
Sample Size(s)
Number of Permutations
One sample t-test on
difference measuresn
2n
Two sample t-test
n1,n2
(n1+n2)! / ( n1! ×
n2! )
One-way ANOVA
n1,...,nk
(n1+n2+ ... + nk)!
/
( n1! × n2! × ... × nk! )
Simple correlation
n
n!
randomise_parallel
which takes
the same input options as the standard randomise binary and then calculates and batches an optimal number of randomise sub-tasks. The parallelisation has two stages - firstly the randomise sub-jobs are run, and then the sub-job results are combined in the final output.
REFERENCES
Copyright © 2004-2007, University of
Oxford. Written by T. Behrens, S. Smith, M. Webster and
T. Nichols.