%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
R DESeq2 패키지 python으로 포팅
R
rpy2를 활용하여 DESeq2 패키지 파이썬에서 사용하기
DESeq2을 파이썬의 rpy2 라이브러리를 통해 포팅하는 방법은 여기를 참고하였다.
Dependencies
- pandas
- rpy2
- tzlocal
- biopython
- ReportLab
- pytest-cov
- bioconductor-deseq2
- codecov
Install Guide
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -q -n diffexpr python=3.6 \
\
pandas tzlocal rpy2 biopython ReportLab pytest-cov
bioconductor-deseq2 codecovconda activate diffexpr # activate diffexpr environment
Rscript setup.R #to install DESeq2 correctly
python setup.py install
Note: 여기에서는 MAQC 데이터의 샘플 A와 B에서 ERCC transcript의 일부 예제를 사용하였다. 아래 실습에서 사용된 파일들은 여기에서 확인하실 수 있다.
필요한 패키지 로드…
= pd.read_table('ercc.txt')
df df.head()
id | A_1 | A_2 | A_3 | B_1 | B_2 | B_3 | |
---|---|---|---|---|---|---|---|
0 | ERCC-00002 | 111461 | 106261 | 107547 | 333944 | 199252 | 186947 |
1 | ERCC-00003 | 6735 | 5387 | 5265 | 13937 | 8584 | 8596 |
2 | ERCC-00004 | 17673 | 13983 | 15462 | 5065 | 3222 | 3353 |
3 | ERCC-00009 | 4669 | 4431 | 4211 | 6939 | 4155 | 3647 |
4 | ERCC-00012 | 0 | 2 | 0 | 0 | 0 | 0 |
그리고 여기에서는 유전자 발현의 정량화된 값이 포함된 테이블(count table)의 샘플을 기반으로 design matrix를 생성한다.
참고로, 샘플 이름은 pd.DataFrame의 인덱스로 사용해야 한다.
= pd.DataFrame({'samplename': df.columns}) \
sample_df 'samplename != "id"')\
.query(= lambda d: d.samplename.str.extract('([AB])_', expand=False)) \
.assign(sample = lambda d: d.samplename.str.extract('_([123])', expand=False))
.assign(replicate = sample_df.samplename
sample_df.index sample_df
samplename | sample | replicate | |
---|---|---|---|
samplename | |||
A_1 | A_1 | A | 1 |
A_2 | A_2 | A | 2 |
A_3 | A_3 | A | 3 |
B_1 | B_1 | B | 1 |
B_2 | B_2 | B | 2 |
B_3 | B_3 | B | 3 |
DESeq2 패키지가 R에서 실행되는 방식과 유사하지만 count table의 유전자 ID인 row.name
대신에 어떤 열이 유전자 ID인지 알려야 한다.
from py_deseq import py_DESeq2
= py_DESeq2(count_matrix = df,
dds = sample_df,
design_matrix = '~ replicate + sample',
design_formula = 'id') # <- telling DESeq2 this should be the gene ID column
gene_column
dds.run_deseq() = ['sample','B','A'])
dds.get_deseq_result(contrast = dds.deseq_result
res res.head()
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating size factors
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating dispersions
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: gene-wise dispersion estimates
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: mean-dispersion relationship
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: final dispersion estimates
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: fitting model and testing
INFO:DESeq2:Using contrast: ['sample', 'B', 'A']
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | id | |
---|---|---|---|---|---|---|---|
ERCC-00002 | 167917.342729 | 0.808857 | 0.047606 | 16.990537 | 9.650176e-65 | 1.102877e-63 | ERCC-00002 |
ERCC-00003 | 7902.634073 | 0.521731 | 0.058878 | 8.861252 | 7.912104e-19 | 4.868987e-18 | ERCC-00003 |
ERCC-00004 | 10567.048228 | -2.330122 | 0.055754 | -41.792764 | 0.000000e+00 | 0.000000e+00 | ERCC-00004 |
ERCC-00009 | 4672.573043 | -0.195660 | 0.061600 | -3.176286 | 1.491736e-03 | 3.616329e-03 | ERCC-00009 |
ERCC-00012 | 0.384257 | -1.565491 | 4.047562 | -0.386774 | 6.989237e-01 | NaN | ERCC-00012 |
#DESeq2 normalized count dds.normalized_count()
INFO:DESeq2:Normalizing counts
A_1 | A_2 | A_3 | B_1 | B_2 | B_3 | id | |
---|---|---|---|---|---|---|---|
ERCC-00002 | 115018.353297 | 122494.471246 | 128809.545168 | 218857.357008 | 207880.854689 | 214443.474968 | ERCC-00002 |
ERCC-00003 | 6949.952086 | 6209.970889 | 6305.915138 | 9133.911628 | 8955.740754 | 9860.313944 | ERCC-00003 |
ERCC-00004 | 18237.045763 | 16119.180051 | 18518.909755 | 3319.456296 | 3361.532701 | 3846.164804 | ERCC-00004 |
ERCC-00009 | 4818.014297 | 5107.922964 | 5043.534405 | 4547.622357 | 4334.937422 | 4183.406812 | ERCC-00009 |
ERCC-00012 | 0.000000 | 2.305540 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ERCC-00012 |
... | ... | ... | ... | ... | ... | ... | ... |
ERCC-00164 | 2.063831 | 1.152770 | 5.988523 | 3.276857 | 1.043306 | 2.294163 | ERCC-00164 |
ERCC-00165 | 269.329992 | 246.692736 | 287.449123 | 513.811202 | 484.094095 | 489.803869 | ERCC-00165 |
ERCC-00168 | 1.031916 | 3.458309 | 0.000000 | 4.587600 | 4.173225 | 1.147082 | ERCC-00168 |
ERCC-00170 | 137.244785 | 148.707304 | 135.340629 | 26.870229 | 10.433062 | 32.118286 | ERCC-00170 |
ERCC-00171 | 8707.304484 | 9622.169484 | 8818.699555 | 7691.439109 | 7691.253592 | 6892.813691 | ERCC-00171 |
92 rows × 7 columns
# show coefficients for GLM dds.comparison
['Intercept', 'replicate_2_vs_1', 'replicate_3_vs_1', 'sample_B_vs_A']
# from the last cell, we see the arrangement of coefficients,
# so that we can now use "coef" for lfcShrink
# the comparison we want to focus on is 'sample_B_vs_A', so coef = 4 will be used
= dds.lfcShrink(coef=4, method='apeglm')
lfc_res lfc_res.head()
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: using 'apeglm' for LFC shrinkage. If used in published research, please cite:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
sequence count data: removing the noise and preserving large differences.
Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
id | baseMean | log2FoldChange | lfcSE | pvalue | padj | |
---|---|---|---|---|---|---|
0 | ERCC-00002 | 167917.342729 | 0.807316 | 0.047609 | 9.650176e-65 | 1.102877e-63 |
1 | ERCC-00003 | 7902.634073 | 0.519944 | 0.058823 | 7.912104e-19 | 4.868987e-18 |
2 | ERCC-00004 | 10567.048228 | -2.328037 | 0.055783 | 0.000000e+00 | 0.000000e+00 |
3 | ERCC-00009 | 4672.573043 | -0.194594 | 0.061466 | 1.491736e-03 | 3.616329e-03 |
4 | ERCC-00012 | 0.384257 | -0.052326 | 0.820696 | 6.989237e-01 | NaN |