R DESeq2 패키지 python으로 포팅

rpy2를 활용하여 DESeq2 패키지 파이썬에서 사용하기

저자

William Jeong

공개

2022년 2월 18일

DESeq2을 파이썬의 rpy2 라이브러리를 통해 포팅하는 방법은 여기를 참고하였다.

Dependencies

pandas
rpy2
tzlocal
biopython
ReportLab
pytest-cov
bioconductor-deseq2
codecov

Install Guide

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -q -n diffexpr python=3.6 \
    pandas tzlocal rpy2 biopython ReportLab pytest-cov \
    bioconductor-deseq2 codecov
conda activate diffexpr # activate diffexpr environment
Rscript setup.R #to install DESeq2 correctly 
python setup.py install

Note: 여기에서는 MAQC 데이터의 샘플 A와 B에서 ERCC transcript의 일부 예제를 사용하였다. 아래 실습에서 사용된 파일들은 여기에서 확인하실 수 있다.

필요한 패키지 로드…

%load_ext autoreload
%autoreload 2
import pandas as pd 
import numpy as np

df = pd.read_table('ercc.txt')
df.head()

	id	A_1	A_2	A_3	B_1	B_2	B_3
0	ERCC-00002	111461	106261	107547	333944	199252	186947
1	ERCC-00003	6735	5387	5265	13937	8584	8596
2	ERCC-00004	17673	13983	15462	5065	3222	3353
3	ERCC-00009	4669	4431	4211	6939	4155	3647
4	ERCC-00012	0	2	0	0	0	0

그리고 여기에서는 유전자 발현의 정량화된 값이 포함된 테이블(count table)의 샘플을 기반으로 design matrix를 생성한다.
참고로, 샘플 이름은 pd.DataFrame의 인덱스로 사용해야 한다.

sample_df = pd.DataFrame({'samplename': df.columns}) \
        .query('samplename != "id"')\
        .assign(sample = lambda d: d.samplename.str.extract('([AB])_', expand=False)) \
        .assign(replicate = lambda d: d.samplename.str.extract('_([123])', expand=False)) 
sample_df.index = sample_df.samplename
sample_df

	samplename	sample	replicate
samplename
A_1	A_1	A	1
A_2	A_2	A	2
A_3	A_3	A	3
B_1	B_1	B	1
B_2	B_2	B	2
B_3	B_3	B	3

DESeq2 패키지가 R에서 실행되는 방식과 유사하지만 count table의 유전자 ID인 row.name 대신에 어떤 열이 유전자 ID인지 알려야 한다.

from py_deseq import py_DESeq2

dds = py_DESeq2(count_matrix = df,
               design_matrix = sample_df,
               design_formula = '~ replicate + sample',
               gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
    
dds.run_deseq() 
dds.get_deseq_result(contrast = ['sample','B','A'])
res = dds.deseq_result 
res.head()

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating size factors

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating dispersions

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: gene-wise dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: mean-dispersion relationship

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: final dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: fitting model and testing

INFO:DESeq2:Using contrast: ['sample', 'B', 'A']

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	id
ERCC-00002	167917.342729	0.808857	0.047606	16.990537	9.650176e-65	1.102877e-63	ERCC-00002
ERCC-00003	7902.634073	0.521731	0.058878	8.861252	7.912104e-19	4.868987e-18	ERCC-00003
ERCC-00004	10567.048228	-2.330122	0.055754	-41.792764	0.000000e+00	0.000000e+00	ERCC-00004
ERCC-00009	4672.573043	-0.195660	0.061600	-3.176286	1.491736e-03	3.616329e-03	ERCC-00009
ERCC-00012	0.384257	-1.565491	4.047562	-0.386774	6.989237e-01	NaN	ERCC-00012

dds.normalized_count() #DESeq2 normalized count

INFO:DESeq2:Normalizing counts

	A_1	A_2	A_3	B_1	B_2	B_3	id
ERCC-00002	115018.353297	122494.471246	128809.545168	218857.357008	207880.854689	214443.474968	ERCC-00002
ERCC-00003	6949.952086	6209.970889	6305.915138	9133.911628	8955.740754	9860.313944	ERCC-00003
ERCC-00004	18237.045763	16119.180051	18518.909755	3319.456296	3361.532701	3846.164804	ERCC-00004
ERCC-00009	4818.014297	5107.922964	5043.534405	4547.622357	4334.937422	4183.406812	ERCC-00009
ERCC-00012	0.000000	2.305540	0.000000	0.000000	0.000000	0.000000	ERCC-00012
...	...	...	...	...	...	...	...
ERCC-00164	2.063831	1.152770	5.988523	3.276857	1.043306	2.294163	ERCC-00164
ERCC-00165	269.329992	246.692736	287.449123	513.811202	484.094095	489.803869	ERCC-00165
ERCC-00168	1.031916	3.458309	0.000000	4.587600	4.173225	1.147082	ERCC-00168
ERCC-00170	137.244785	148.707304	135.340629	26.870229	10.433062	32.118286	ERCC-00170
ERCC-00171	8707.304484	9622.169484	8818.699555	7691.439109	7691.253592	6892.813691	ERCC-00171

92 rows × 7 columns

dds.comparison # show coefficients for GLM

['Intercept', 'replicate_2_vs_1', 'replicate_3_vs_1', 'sample_B_vs_A']

# from the last cell, we see the arrangement of coefficients, 
# so that we can now use "coef" for lfcShrink
# the comparison we want to focus on is 'sample_B_vs_A', so coef = 4 will be used
lfc_res = dds.lfcShrink(coef=4, method='apeglm')
lfc_res.head()

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

	id	baseMean	log2FoldChange	lfcSE	pvalue	padj
0	ERCC-00002	167917.342729	0.807316	0.047609	9.650176e-65	1.102877e-63
1	ERCC-00003	7902.634073	0.519944	0.058823	7.912104e-19	4.868987e-18
2	ERCC-00004	10567.048228	-2.328037	0.055783	0.000000e+00	0.000000e+00
3	ERCC-00009	4672.573043	-0.194594	0.061466	1.491736e-03	3.616329e-03
4	ERCC-00012	0.384257	-0.052326	0.820696	6.989237e-01	NaN

라이센스

https://creativecommons.org/licenses/by/4.0/deed.ko