The address of this webpage:
1.
Hierarchical clustering.
2.
K-Means / K-Medians clustering.
3.
A software for microarray data analysis.
A microarray is an array of DNA molecules that permit many hybridization experiments to be performed in parallel. It can monitor expression levels of thousands of genes simultaneously.
Microarray emerged about 10 years ago as a
high-throughput technology for gene expression analysis. It has become a
powerful tool for biomedical research. The number of published papers referring
to microarray (in their titles or abstracts) increased very fast in the last
decade.

Microarray Data Matrix
Microarray data are usually presented in an expression matrix. Each column represents all the gene expression levels from a single experiment, and each row represents the expression of a gene across all experiments. Each element is a log ratio. The log ratio is defined as log2 (T/R), where T is the gene expression level in the testing sample, R is the gene expression level in the reference sample.

Why log transformation?
|
Original ratio |
Log2-tranformed ratio |
||
|
Up |
Down |
Up |
Down |
|
2 |
0.5 |
1 |
-1 |
|
4 |
0.25 |
2 |
-2 |
|
8 |
0.125 |
3 |
-3 |
|
16 |
0.0625 |
4 |
-4 |
The expression matrix can be presented as a matrix of colored rectangles. Each rectangle represents an element of the expression matrix.

Hierarchical Clustering is the most popular method for microarray data analysis. In hierarchical clustering, genes with similar expression patterns are grouped together and are connected by a series of ‘branches’, which is called clustering tree (or dendrogram). Experiments with similar expression profiles can also be grouped together using the same method.




(Adapted from the documentation of MeV)
The two important problems
in hierarchical clustering are:
1.
How to determine the similarity between two genes?
2.
How to determine the similarity between two clusters?
To solve the first problem,
we calculate the distance between two expression vectors. A Gene Expression
Vector consists of the expression of a gene over a set of experimental
conditions. A distance is used as a measure of dissimilarity between genes.


(From the documentation of MeV)
The second problem is: How
to determine the similarity between clusters? The method for determining
cluster-to-cluster distance is called linkage method.
Three linkage methods:



(Adapted from the documentation of MeV)
There is no theoretical
guideline for selecting the best linkage method. In practice, people almost
always use the average linkage method.

(Shannon W. et al. (2003) Analyzing microarray data using cluster analysis. Pharmacogenomics 4:41-51.)
Gene A and B are merged
first at the level of 1.58. The position of the “splitting point” shows the
distance (or dissimilarity) between two genes (or clusters). A low “splitting
point” means short distance and high similarity.
What is Mean? What is
Median?
Mean is the average.
Median is the “middle
number”, i.e. the middle of the distribution. For an odd number of numbers, the
median is simply the middle number. For example, the median of 2, 4 and 7 is 4.
For an even number of
numbers, the median is the average of the two middle numbers. Thus, the median
of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
For example, the mean of 5, 6, 7, 8 and 9 is 7. The
median of 5, 6, 7, 8 and 9 is also 7.
But, the mean of 5, 6, 7, 8 and 99 is 25, while the
median of 5, 6, 7, 8 and 99 is still 7.


(Adapted from MeV documentation)
MultiExperiment
Viewer (MeV) is an open source software for microarray data analysis. MeV
is distributed under The Artistic
License, which means you can freely download the software or get a copy
from another user.
It requires Java Runtime Environment (JRE) to run
MeV. A detailed instruction on installation of JRE and MeV can be found at http://compbio.uthsc.edu/microarray/install.htm
The program is in C:\MEV. Double click the batch
file (TMEV.bat) to start the program. Use the File menu to open a new
Multiple Array Viewer.
Select Load data from the File menu to
launch the file-loading dialog. At the top of this dialog, use the drop-down
menu to select the type of expression files to load. Use the file browser to
locate the files to be loaded.
Type of the Input file: Stanford Files (*.txt)
Name of the Input file: Stanford_Large.txt
Select Options:
1.
Average Linkage;
2.
Cluster both genes and experiments.
Clusters of interest can be stored:
1.
Click the dendrogram to select the cluster;
2.
Open a menu by right clicking in the viewer and selecting the store
cluster option;
3.
Input the name of the cluster and select a color to label the cluster.
A color bar is displayed along the right side of
cluster.


Options:
1.
Cluster genes;
2.
Use mean;
3.
Number of clusters = 5;
4.
Number of iterations = 50;





Further Reading
The review articles on microarray analysis in the
three special issues of Nature Genetics
a.
The Chipping Forecast I (http://www.nature.com/ng/journal/v21/n1s/index.html)
b.
The Chipping Forecast II (http://www.nature.com/ng/journal/v32/n4s/index.html)
c.
The Chipping Forecast III (http://www.nature.com/ng/journal/v37/n6s/index.html)
Homework
Due Date: Wednesday, February 17. Submit to Dr. Yan Cui via
email (ycui2@uthsc.edu). The solution will
be posted at http://compbio.uthsc.edu/MSCI814/Solution1.htm
on February 18.
Background: Microarray has shown great promise in studying
complex diseases such as cancer. The genome-wide gene expression profiles of
tumor tissues are considered as the “molecular portraits” of various cancers.
For example, Clustering
of breast and ovarian carcinoma cases is shown in the figure below, 68 breast
and 57 ovarian cases were co-clustered to discern both similarities and
disparities between the two sample sets. The common reference control
consisting of equal amounts of mRNA from 11 human cancer cell lines. (Schaner, M et al., Gene Expression
Patterns in Ovarian Carcinomas, Mol Biol Cell. 2003 Nov;14:4376-86).

Data: Download the dataset from http://compbio.uthsc.edu/MSCI814/Homework1.txt
(Right click on the link and select “Save Target As…”). The dataset contains
gene expression profiles of 16 tumor samples. Each of the 16 samples
is associated with one of the two cancers.
1. Analyze
the data with hierarchical clustering (HCL)
You should use average linkage method, Euclidean
distance metric and only cluster experiments.
Please infer from the dendrogram (clustering tree)
the two groups of the samples (each associated with a type of cancer).
For example,
Cancer 1: Sample 1, 3,5,7,9,11,13,15
Cancer 2: Sample 2,4,6,8,10,12,14,16
2. Analyze the data with
K-means clustering (KMC)
Use K-means clustering method to group the 16
samples into two clusters.
For example,
Cluster 1: Sample 1, 3,5,7,9,11,13,15
Cluster 2: Sample 2,4,6,8,10,12,14,16
What should be included in the email:
1. HCL
Cancer 1: Sample…
Cancer 2: Sample…
2. KMC
Cluster 1: Sample…
Cluster 2: Sample…