R DESeq2 패키지 python으로 포팅

R
rpy2를 활용하여 DESeq2 패키지 파이썬에서 사용하기
저자

William Jeong

공개

2022년 2월 18일

DESeq2을 파이썬의 rpy2 라이브러리를 통해 포팅하는 방법은 여기를 참고하였다.

Dependencies

  1. pandas
  2. rpy2
  3. tzlocal
  4. biopython
  5. ReportLab
  6. pytest-cov
  7. bioconductor-deseq2
  8. codecov

Install Guide

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -q -n diffexpr python=3.6 \
    pandas tzlocal rpy2 biopython ReportLab pytest-cov \
    bioconductor-deseq2 codecov
conda activate diffexpr # activate diffexpr environment
Rscript setup.R #to install DESeq2 correctly 
python setup.py install

Note: 여기에서는 MAQC 데이터의 샘플 A와 B에서 ERCC transcript의 일부 예제를 사용하였다. 아래 실습에서 사용된 파일들은 여기에서 확인하실 수 있다.

필요한 패키지 로드…

%load_ext autoreload
%autoreload 2
import pandas as pd 
import numpy as np
df = pd.read_table('ercc.txt')
df.head()
id A_1 A_2 A_3 B_1 B_2 B_3
0 ERCC-00002 111461 106261 107547 333944 199252 186947
1 ERCC-00003 6735 5387 5265 13937 8584 8596
2 ERCC-00004 17673 13983 15462 5065 3222 3353
3 ERCC-00009 4669 4431 4211 6939 4155 3647
4 ERCC-00012 0 2 0 0 0 0

그리고 여기에서는 유전자 발현의 정량화된 값이 포함된 테이블(count table)의 샘플을 기반으로 design matrix를 생성한다.
참고로, 샘플 이름은 pd.DataFrame의 인덱스로 사용해야 한다.

sample_df = pd.DataFrame({'samplename': df.columns}) \
        .query('samplename != "id"')\
        .assign(sample = lambda d: d.samplename.str.extract('([AB])_', expand=False)) \
        .assign(replicate = lambda d: d.samplename.str.extract('_([123])', expand=False)) 
sample_df.index = sample_df.samplename
sample_df
samplename sample replicate
samplename
A_1 A_1 A 1
A_2 A_2 A 2
A_3 A_3 A 3
B_1 B_1 B 1
B_2 B_2 B 2
B_3 B_3 B 3

DESeq2 패키지가 R에서 실행되는 방식과 유사하지만 count table의 유전자 ID인 row.name 대신에 어떤 열이 유전자 ID인지 알려야 한다.

from py_deseq import py_DESeq2

dds = py_DESeq2(count_matrix = df,
               design_matrix = sample_df,
               design_formula = '~ replicate + sample',
               gene_column = 'id') # <- telling DESeq2 this should be the gene ID column
    
dds.run_deseq() 
dds.get_deseq_result(contrast = ['sample','B','A'])
res = dds.deseq_result 
res.head()
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating size factors

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: estimating dispersions

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: gene-wise dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: mean-dispersion relationship

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: final dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: fitting model and testing

INFO:DESeq2:Using contrast: ['sample', 'B', 'A']
baseMean log2FoldChange lfcSE stat pvalue padj id
ERCC-00002 167917.342729 0.808857 0.047606 16.990537 9.650176e-65 1.102877e-63 ERCC-00002
ERCC-00003 7902.634073 0.521731 0.058878 8.861252 7.912104e-19 4.868987e-18 ERCC-00003
ERCC-00004 10567.048228 -2.330122 0.055754 -41.792764 0.000000e+00 0.000000e+00 ERCC-00004
ERCC-00009 4672.573043 -0.195660 0.061600 -3.176286 1.491736e-03 3.616329e-03 ERCC-00009
ERCC-00012 0.384257 -1.565491 4.047562 -0.386774 6.989237e-01 NaN ERCC-00012
dds.normalized_count() #DESeq2 normalized count
INFO:DESeq2:Normalizing counts
A_1 A_2 A_3 B_1 B_2 B_3 id
ERCC-00002 115018.353297 122494.471246 128809.545168 218857.357008 207880.854689 214443.474968 ERCC-00002
ERCC-00003 6949.952086 6209.970889 6305.915138 9133.911628 8955.740754 9860.313944 ERCC-00003
ERCC-00004 18237.045763 16119.180051 18518.909755 3319.456296 3361.532701 3846.164804 ERCC-00004
ERCC-00009 4818.014297 5107.922964 5043.534405 4547.622357 4334.937422 4183.406812 ERCC-00009
ERCC-00012 0.000000 2.305540 0.000000 0.000000 0.000000 0.000000 ERCC-00012
... ... ... ... ... ... ... ...
ERCC-00164 2.063831 1.152770 5.988523 3.276857 1.043306 2.294163 ERCC-00164
ERCC-00165 269.329992 246.692736 287.449123 513.811202 484.094095 489.803869 ERCC-00165
ERCC-00168 1.031916 3.458309 0.000000 4.587600 4.173225 1.147082 ERCC-00168
ERCC-00170 137.244785 148.707304 135.340629 26.870229 10.433062 32.118286 ERCC-00170
ERCC-00171 8707.304484 9622.169484 8818.699555 7691.439109 7691.253592 6892.813691 ERCC-00171

92 rows × 7 columns

dds.comparison # show coefficients for GLM
['Intercept', 'replicate_2_vs_1', 'replicate_3_vs_1', 'sample_B_vs_A']
# from the last cell, we see the arrangement of coefficients, 
# so that we can now use "coef" for lfcShrink
# the comparison we want to focus on is 'sample_B_vs_A', so coef = 4 will be used
lfc_res = dds.lfcShrink(coef=4, method='apeglm')
lfc_res.head()
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
id baseMean log2FoldChange lfcSE pvalue padj
0 ERCC-00002 167917.342729 0.807316 0.047609 9.650176e-65 1.102877e-63
1 ERCC-00003 7902.634073 0.519944 0.058823 7.912104e-19 4.868987e-18
2 ERCC-00004 10567.048228 -2.328037 0.055783 0.000000e+00 0.000000e+00
3 ERCC-00009 4672.573043 -0.194594 0.061466 1.491736e-03 3.616329e-03
4 ERCC-00012 0.384257 -0.052326 0.820696 6.989237e-01 NaN

송금하실 때 메세지에 어떤 글을 보았는지 남겨주시면 게시글에 남겨드립니다.
기부해주신 돈은 커피를 마시는데 사용됩니다.