Biostatistics of cigarette smoking
Stay away from smoking! This is my analysis of the GSE5056 dataset from the Gene Expression Omnibus (GEO). The dataset is part of a study aimed at understanding how cigarette smoking modifies neuroendocrine cells. By comparing the smokers to the non-smokers, it’s possible to see very clear trends using statistical methods despite the enormous amount of gene data.
What were the findings?
The genes responsible for the responses below are very strongly associated with smokers:
- Alcohol detoxification/response (2 genes)
- Inflammatory response (3 genes)
- Oxidative stress response (2 genes)
- Cholesterol homeostasis (2 genes)
- Bile acid binding (1 gene)
- Organophosphorus response (1 gene)
- Immune response (1 gene)
- Toxin response (1 gene)
How is this related to smoking?
- Alcohol detoxification is strongly expressed and could be correlated with ethanol and a number of other alcohols found in cigarettes.
- Inflammatory response and oxidative stress are very highly correlated with cigarette smoking.
- Cholesterol homeostasis is needed possibly because HDL (good cholesterol) is lowered by cigarette smoking.
- Bile acid binding could be associated with bile salt concentration being increased by cigarette smoking.
- An organophosphorus response could be the result of organophosphorus found in pesticide which is used on the tobacco found in cigarettes.
- Immune response is suppressed by smoking cigarettes.
- Toxin response is triggered by smoking cigarettes.
How was this done?
The bulk of the analysis is done using R. Before loading the dataset into R, some pre-filtering is done to remove the annotations.
Using R, outliers are processed using a correlation plot, cluster dendrogram and a CV versus Mean plot. To further filter the genes, two approaches are taken: remove genes with an average expression value less than 50, and remove genes with a coefficient of variation in the lowest quantile (5%). A t-test with multiple test correction is used to do feature selection. Principle Component Analysis (PCA) is used for dimensionality reduction. A Scree plot shows the effectiveness of this approach. Using the previously identified PCA components, quadratic discriminant analysis (QDA) is used to classify the samples into their respective smoking or non-smoking classes.
The filtering, feature selection and dimensionality reduction focused our analysis from the original 7,129 genes to just 23 genes. Finally, using fold change, the 10 most discriminant genes are selected. This is defined as the 5 genes with the greatest positive fold change and the 5 genes with the greatest negative fold change. The result of this analysis is summarized in the findings above which correlate these 10 genes with smokers.
Here is the R code and PDF presentation.
