(続き)
The simulation in Fig. 1A was carried out using conditions where N = 50 and β − α followed a Gaussian distribution, having standard deviation of 0 (no PCR bias) or 1 (high PCR bias). The simulation indicated that the variance of the distribution was highly dependent on PCR bias and that the mode of the distribution corresponded to the composition of SNP alleles. Allele frequencies of several sets of RNA-seq data from various cell types obtained from public databases were examined, and the results agreed with the simulation (Fig. 1B). Peaks at 0 and 100% might result from the homozygous SNPs in observed cells. An artificial contaminating situation was also generated with random sampling of RNA-seq datasets from two cell categories, pure C57BL/6 (B6) hematopoietic stem cells (HSCs), and a mixture of 129 and B6 embryonic stem cells (129B6F1 ESCs) at various ratios. The curve shape and peak positions varied along the ratio as shown in the mathematical simulation (Fig. 1C, gray line).
(Fig. 1のリジェンド)
Figure 1. Allele frequency analysis of RNA-seq data. (A) Simulation of SNP allele frequencies using a modified binomial distribution. Peak position was determined by the composition of two alleles, and variance of the distribution was dependent on sd, standard deviation of simulated PCR bias. (B) SNP distributions in several cell types. ESCs (red, SRR1047502, 129B6F1 background), iPSs derived from fibroblasts (yellow, SRR1047504, 129B6F1), MEFs (blue, SRR104220, 129B6F1), normal fibroblasts (NFs; green, SRR1191170, B6 x BALB/c), cancer-associated fibroblasts (CAFs; purple, SRR1191171, B6 x BALB/c), and HSCs (gray, SRR892995, B6). The number of applied SNPs for each cell type is shown in parentheses in each box. (C) Allele frequency of HSC samples contaminated with different percentages of ESCs as shown.
(本文続き)
Re-analysis of STAP paper: Genotype analysis of fibroblast growth factor-induced stem cells (FI-SCs)
This study examined how SNP allele frequencies in RNA-seq data can be used to show properties of the dataset. Obokata et al. recently reported the phenomenon of STAP, the induced cellular reprogramming of committed somatic cells into pluripotent stem cells that can produce embryonic and placental tissues when injected into blastocysts (Obokata et al. 2014a,b). The allele frequency approach described above was used to examine the NGS dataset provided by the researchers. Allele frequencies between reference allele (equivalent to B6 genotype for dbSNP) and alternative allele (corresponding to 129 genotype in this study) were examined in RNA-seq data from seven replicate experiments obtained using the TruSeq reagent (Figs 2A and S1 in Supporting Information).
(英文)
The allele distributions of six of the seven experiments showed the equal representation of parental chromosomes expected in Fig. 1A. For the experiments that involved ESCs, STAP cells, and STAP stem cells (STAP-SCs), there were no 0% peaks (Fig. S1 in Supporting Information), possibly because the cells were obtained from mice backcrossed in the laboratory that may have a different genotype than those in the public database.
7回の実験中6回の対立遺伝子分布は図 1Aで期待されている親の染色体と同じ表現を示した。ES細胞、STAP細胞、およびSTAP幹細胞(STAP-SCS)を含む実験では全くの0%のピークは無かったが(サポート情報の図S1)、おそらく細胞が実験室で戻し交配されたため、公開データベース (J. Sharif and K. Isono, personal communication)のものとは異なる遺伝子型のマウスから得られたからであろう。
(Fig. 2のリジェンド)
Figure 2. SNPs detected in FI-SC mRNAs indicating contamination. (A) Allele distributions obtained from ESC and FI-SC RNA-seq experiments used in the STAP paper. Both ESCs (blue) and FI-SCs (red) are annotated as having a 129B6F1 genetic background. The number of applied SNPs for each experiment is shown in parentheses in the boxes. (B) SNPs detected in Sall4 and Klf4, which are highly expressed in ESCs. B6-type alleles are shown in blue and 129-type alleles (i.e., non-B6) are in yellow. (C) SNPs detected in the TSC-specific genes Elf5 and Sox21. (D) The number of homozygous/heterozygous SNPs observed in the stem cells used in the original paper. Only the composition observed in FI-SCs would be predicted to affect gene expression. P-values were calculated using Fisher's exact test of genotype distribution between TSC-specific genes and ESC-specific genes. Rep1 and rep2 denote two replicated experiments. (E) Heatmap of representative cytokine and extracellular matrix genes that are highly expressed in MEFs. Normalized log ratios of fragments per kilobase of exon per million reads (FPKM) against the medians of all samples were shown.
(本文続き)
Surprisingly, FI-SCs that were annotated as coming from the F1 129Sv (129) and B6 cell populations did not show the allele distribution pattern of unbiased nonimprinting genes (Fig. S1 in Supporting Information). The distribution was more similar to that of cells with unequal chromosomes. These FI-SCs were reported to be induced from STAP cells with Fgf4 and to have characteristics similar to trophoblast stem cells (TSCs), such as their gene expression profiles and potential to contribute to the placenta (Obokata et al. 2014a).
(英文)
The obvious difference in the FI-SC curve from the 129B6F1 genotype, combined with the fact that the majority of SNPs were similar to B6, suggested that the FI-SCs originated from neonatal mice of a nearly pure B6 background. Further analysis of gene expression patterns suggested that the heterogeneity of SNPs between B6-type allele and non-B6 could be caused by the expression characteristics of genes. As shown in Fig. 2B, SNPs expected to be heterogeneous between 129 (i.e., non-B6) and B6 were examined in several ESC marker genes. ESCs carried alleles from both the 129 and B6 backgrounds at these loci, but the FI-SCs, although described as having the same genetic background as the ESCs (Obokata et al. 2014a), carried only SNPs from B6. This dominance of the B6 genotype was not observed in TSC marker genes (Fig. 2C).
(英文)
The FI-SC specificity was not limited to the genes shown in Fig. 2B and C. When all heterogeneous SNPs were classified into three groups, SNPs in ESC-specific genes, SNPs in TSC-specific genes, and SNPs in other genes, only FI-SCs had widely heterozygous SNPs for these groups (Fig. 2D). If all included cells in a sample share the same cellular features, one would not expect to see this phenomenon of particular gene sets having different genotypes.
(英文)
Because the FI-SCs showed a specific genotype at some TSC markers, they may have been contaminated with TSCs. Feeder cells, however, could be another source of contamination, as the FI-SCs were cultured with mouse embryonic fibroblast (MEF) feeder cells whose genotype was not described in the original paper. For this study, the expression of marker genes for MEFs was examined and compared with the expression of ESC and TSC markers, and the results indicated the absence of expression of these MEF genes in FI-SCs (Fig. 2E). The probability of contamination by MEFs is therefore negligible, and the most likely explanation for the skewed distribution of allele frequencies detected in the duplicated RNA-seq experiments is that the FI-SC population originated from two cell types: ESC-like cells having a B6 genetic background and TSC-like cells having a genotype similar to that of CD1, which is a mouse strain other than B6 and 129.
(英文)
Re-analysis of a retracted paper: Detection of chromosomal aberrations
SNP distribution analysis can also be applied to detect aneuploidy. In examining allele frequencies for each chromosome, abnormal chromosomes are assumed to have skewed distributions when paired chromosomes do not have the equal number of duplicates. As shown in Fig. 3A, chromosomal analysis showed abnormality of chromosome 8 in the STAP cells used in the original study. We can expect the peak allele frequency to occur at approximately 50% if a cell has two chromosomes from parents of different strains. The peak allele frequency of the STAP cells, however, was at approximately 33%, meaning that one of the two chromosomes appears to be duplicated leading to three copies of chromosome 8.
(英文)
The STAP cells used in the experiments were derived from 129 and B6 cells, and the duplicated chromosome is presumed to be from the 129 parent because it contained only non-B6 SNP alleles. It is notable that trisomy 8 is the most common chromosomal aberration in mouse ESCs, with 31 of 97 examined cell lines reported to carry this abnormality (Mayshar et al. 2010). ESCs having trisomy 8 have a growth advantage, but chimeras will not transmit the mutation to the germ-line (Ben-David et al. 2013), and trisomy 8 in mice results in prenatal death by day 12 or 13 (Kim et al. 2013).
(Fig. 3のリジェンド)
Figure 3. Trisomy detected by SNP analysis of RNA-seq data. (A) Allele frequency distributions of whole chromosomes and chromosome 8. Only chromosome 8 of STAP cells had a peak that was not centered approximately 50%, indicating that chromosome 8 originating from the 129 strain was duplicated to produce trisomy of the chromosome. Unlike the RNA-seq data analyzed in Figs 1 and 2, the RNA-seq data analyzed in this figure were generated using the SMARTer reagent kit. (B) Expression analysis by chromosome. Only genes on chromosome 8 were significantly more expressed, and genes on chromosome 13 were significantly less expressed. P-values were calculated using two-sided Student t-tests.
(本文続き)
Because aneuploidy detection using transcriptome analysis has been reported (Gropp 1982; Liu et al. 1997), the expression of genes on each chromosome was analyzed in this study. Genes on chromosome 8 and chromosome 13 had significantly different expression patterns between STAP cells and ESCs. Chromosome 8 gene expression was 1.3 times higher in STAP cells than in ESCs (P-value = 2.89 × 10−25; Fig. 3B). This result is concordant with trisomy of chromosome 8 as detected by SNP analysis. The SNP allele frequency method used here can, therefore, be used to detect aneuploidy if we know the SNP genotype of the cells and if the control cells have a normal karyotype.
(英文)
Availability of SNP allele frequencies
The SNP allele frequency method detected contaminating cells in samples used to generate RNA-seq data, but the sensitivity of this method is dependent on the depth of sequence reads. Based on the limited number of RNA-seq datasets here, the graphs become too noisy to detect differences in the distributions if the number of available SNPs is fewer than approximately one thousand. For example, MEF cells having only 5948 SNPs generates a more noticeable amount of spikes than data for ESCs with 23 838 SNPs. SNPs assigned to each chromosome also suggested that the distributions could be very noisy when the number of SNPs become <1000 (Fig. S2 in Supporting Information). Cover ratios are also important for the sensitivity because the variance of binomial distribution depends on the number of trials. Therefore, when the average cover ratio is required to be 20× for all genes, the required read count will be approximately 1.3 × 109 nucleotides, roughly corresponding to a read count of approximately 1.1 × 109 bases, which was used to derive the distribution for MEFs (Fig. 1B).
(英文)
Figure 1B also indicates some interesting aspects of genome stability of induced pluripotent stem (iPS) cells. The iPS cells generated from 129B6F1 had more homozygous SNPs of B6-type alleles. Although, as noted earlier, this may be the result of cellular contamination or this could be the result of differences in the properties of the cells used in the two experiments, there is also the intriguing possibility that the experimental process induced a transition of genotypes. As iPS cell engineering has been reported to induce genomic and/or epigenomic instability (Hussein et al. 2011; Chang et al. 2014), it will be important to examine allele frequencies of iPS cells in future studies.
(英文)
SNP frequency differences detected in RNA-seq or other NGS experiments are probably not more accurate than those detected by direct observation of karyotypes; however, the method described in this study can be used to detect chromosomal abnormalities even if the test cells are no longer available, and from both genotype (SNPs) and phenotype (gene expression), which is expected to provide more reliable evidence than genotype or phenotype alone.
(英文)
One advantage of the SNP allele frequency method of this study is its potential ability to detect and determine the parental origin of chromosomes duplications. The other advantage over virtual karyotyping (Ben-David et al. 2013) is that it does not require control cells without aneuploidy because simulated models expect the allele frequencies of diploid cells would have peaks approximately 50% (Fig. 1A).
(英文)
Contamination in FI-SC samples
Examination of allele frequencies in RNA-seq fragments enables detection of specific characteristics of the sampled cells. When a set of genes specifically expressed in a particular cell type shows a distinct genotype, the origins of the cells can be assumed. This analysis of SNPs in mRNA sequence reads from the Obokata et al. study showed that the origins of the STAP and FI-SC samples were likely different than originally reported. The simulation showed that the mode of the distribution depends both on the cellular composition of the sample and possible variance caused by PCR bias. When we ignore the PCR bias of binomial distribution, the ratio of contaminating cells can be roughly estimated by the peak position in the distribution.
(英文)
The most likely cell type to be contaminating the FI-SCs in the original study, TSCs, had many more heterozygous (B6/non-B6) alleles than non-B6 homozygous alleles. SNPs of FI-SCs were counted in alleles where B6 and 129 have different nucleotides and duplicated experiments resulted in 6859 and 7243 heterozygous SNPs and 24 and 14 non-B6 homozygous alleles, respectively. The percentage of contaminating cells can be assumed with genotype observation because the peak allele frequencies were at approximately 95–96%, whereas the population peak is expected to be at approximately 10%. This result was concordant with the expression of TSC marker genes. All of the examined TSC marker genes examined were expressed in approximately 10% of TSCs (Fig. 4A).
(Fig. 4のリジェンド)
Figure 4. Examination of incorporated mRNAs in FI-SCs. (A) Expression of TSC marker genes in ESCs, TSCs, and FI-SCs. The solid line indicates the average TSC gene expression and dashed line indicates 10% of the average. (B) Heterozygous SNPs detected in Des, Grb2, Setd7, Fbxo21, and Chd4 where B6 and 129 share the same SNPs but TSCs did not. (C) Gene-wise analysis of distribution of heterozygous/homozygous SNPs having TSC-specific alleles.
(本文続き)
FI-SC-specific SNPs were also examined where B6- and 129-derived sequences shared the same nucleotide but FI-SC-derived ones did not. Most of these SNPs matched with the CD1 background of the TSCs used in the experiment and were different from B6 or 129. Gene-by-gene representation (Fig. 4B) and whole-SNP analysis (Fig. 4C) showed that FI-SCs shared TSC-specific SNPs. As the proportion of contaminating TSCs in the FI-SCs was minor, most of the TSC-specific SNPs appeared heterozygous. These results support the view that the RNA-seq data contained transcripts from two major cell populations, with approximately 90% B6 cells having an ESC-like expression pattern and approximately 10% CD1-like cells having a TSC-like pattern. Therefore, the claim in the Obokata et al. paper that FI-SCs contribute to the placenta might have been based in error on contamination by TSCs, which are known to be organ stem cells (Tanaka et al. 1998).
(英文)
Interpretation of the STAP phenomenon
Detection of aneuploidy using SNPs also suggested contamination in the original experiments. Both genotype and phenotype analyses suggested that the STAP cells used in the Obokata et al. experiments had trisomy of chromosome 8, and a transcription assay indicated atypical expression of genes on chromosome 13. Trisomy 8 is the most common chromosomal abnormality in mice, and chromosome 8 has been reported to fuse to the end of chromosome 13 (Kim et al. 2013). These observations might explain the results of virtual karyotyping in Fig. 3B. The RNA-seq data used in Fig. 3 were annotated as being derived from neonatal mouse spleen cells cultured in conditions under which the STAP cells did not grow. This description does not agree with the dominance of cells having trisomy because mice carrying pure trisomy 8 are embryonic lethal. This therefore leads to the conclusion that the cells were cultured cells that possessed expression characteristics that were very similar to those of ESCs.
The SNP allele frequency method described here is limited by the fact that if the contaminating and examined cells share a common genetic background, the allele frequencies will not be different enough to detect the contamination. However, this method is, in principle, applicable to any RNA-seq data that contain polymorphisms and will be useful for both prospective and retrospective quality control of RNA-seq experiments, especially for studies using cultured cells such as ESCs and iPS cells and their derivatives.
Dataset
Mouse variation data were obtained from the Sanger Mouse Genomes Project (*ttp://www.sanger.ac.uk/resources/mouse/genomes/), and the version 137 VCF-formatted dataset was retrieved from dbSNP (*ttp://www.ncbi.nlm.nih.gov/SNP/).
The locations of genes and included exons were obtained from iGenomes (*ttp://support.illumina.com/sequencing/sequencing_software/igenome.ilmn). SNPs outside exons were excluded from the original VCF file using this annotation, and thus, 1 016 227 SNPs were used for the entire dataset.
(英文)
Raw sequencing data from the original RNA-seq experiments examined in this study were downloaded from the short read archive (SRA) at NCBI. The accession number of the project is SRP038104. Genome sequences of mouse (version 38, mm10) obtained from B6 mouse strain were downloaded from NCBI GenBank and encoded into a bowtie database using the bowtie-build (for colored space fastq files) or bowtie2-build (for fastq files) program. The accession numbers (i.e., SRA ID) of the RNA-seq experiments are listed in Table S1 (Supporting Information), and the checksums of archived sequences were confirmed by Dr Teruhiko Wakayama, Yamanashi University, one of the corresponding authors of the original paper.
(英文)
The SRA database sra-format files were converted into fastq format using sratoolkit.2.3.4-2. The Bowtie2 (version 2.1.0, bowtie-bio.sourceforge.net/bowtie2/index.shtml) and tophat2 (tophat.cbcb.umd.edu) programs were applied for sequence alignment. Because this study did not consider the structure of the mRNAs, all sequences were fragmented into 50-bp reads and aligned using tophat or tophat2 with the parameter “–no-coverage-search -G genes.gtf” allowing two mismatches. The tophat program was used only to アnalyze SOLiD colored space fastq files. Levels of gene expression were evaluated with fragments per kilobase of exon per million reads (FPKM) values calculated using cufflinks (version 2.1.1). A program written in C++ was developed to detect and enumerate SNP alleles in BAM files. The program is open-source software available in a public repository (github.com/takaho/snpexp/).
SRAデータベースsra-形式ファイルはsratoolkit.2.3.4-2を使ってfastq形式に変換されている。 配列アライメントのためにBowtie2(バージョン2.1.0、bowtie-bio.sourceforge.net/bowtie2/index.shtml)とtophat2(tophat.cbcb.umd.edu)プログラムが適用されている。この研究はメッセンジャーRNAの構造を考慮していなかったので、すべての配列は、50 bpの読み取り断片に断片化され、2つのミスマッチを許容する “–no-coverage-search -G genes.gtf”パラメータを指定してトップハットまたはtophat2を使用して整列させた。トップハットプログラムは SOLiD colored space fastq filesを分析するためだけに使用された。遺伝子発現のレベルは、(バージョン2.1.1)のcufflinksを用いて算出されたfragments per kilobase of exon per million reads (FPKM) 値で評価されている。 C ++で書かれたプログラムはBAMファイルの中のSNP対立遺伝子を検出、列挙するために開発されています。プログラムは、公開リポジトリ(github.com/takaho/snpexp/)で入手可能なオープンソースソフトウェアです。百万カフスを値を読み込むごと
(英文)
SNP identification and heterozygosity tests
Sequence fragments aligned on the mouse genome were analyzed using the SNP detection and enumeration program mentioned above, and only SNPs with cover ratios ≥20 were retained. If ≥95% of the SNP sequence was the same on the two strands, the allele was designated as homozygous.
Genomewide heterozygosity between B6 and 129 was analyzed using a subset of the SNPs described above with different alleles between B6 and 129. SNPs were classified by known expression characteristics (TSC-specific, ESC-specific or other) and genotypes (B6-type homozygous, 129-type homozygous or other). SNP distributions were tested with P-values obtained using Fisher's exact test with Monte-Carlo Markov chain approximation.
(英文)
Expression アnalysis by chromosome to identify trisomy
FPKM values originating from ESCs (SRR1171574 and SRR1171575) and STAP cells (SRR1171578 and SRR1171579) were calculated, and genes having >0.01 FPKM in all four original experiments were selected. Genes without pseudogenes were classified for chromosomes, and the log ratio of the average of two experiments using the same cells was determined for each chromosome. The distribution of relative FPKM values was evaluated using one-sided t-test against the mean log ratio of whole genes.
(英文)
MEF marker genes
Marker genes of feeder cells were identified using unpublished RNA-seq data provided by Dr Jafar Sharif and Dr Kyoichi Isono. Gene expression differences among ESCs, TSCs and MEFs were compared using the cuffdiff program, and genes that were significantly highly expressed in MEFs were selected. Genes encoding cytokines and extracellular matrix-related genes were selected to illustrate the features of feeder cells in Fig. 2E.
I would like to acknowledge Dr Akihiko Yoshimura, Keio University, who initially suggested on his website that NGS data could be evaluated by SNP analysis. Dr Ichiro Taniuchi and Dr Nyambayar Dashtsoodol, RIKEN-IMS, offered insightful information for interpreting the experimental procedures. Dr Norihito Hayatsu, RIKEN-IMS, provided unpublished inbred/outbred mice sequences to validate the genotype analysis. Dr Jafar Sharif and Dr Kyoichi Isono, RIKEN-IMS, provided unpublished transcriptome data to identify marker genes. Dr Shinichi Nakagawa, RIKEN, provided critical comments on my manuscript, and Mr David Gifford, RIKEN-CSRS, edited the text. I especially thank the authors of the retracted paper, Dr Teruhiko Wakayama, University of Yamanashi and Dr Hitoshi Niwa, RIKEN-CDB, for commenting on this manuscript.
Ben-David, U., Mayshar, Y. & Benvenisty, N. (2013) Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. Nat. Protoc. 8, 989–997.
Chang, G., Gao, S., Hou, X. et al. (2014) High-throughput sequencing reveals the disruption of methylation of imprinted gene in induced pluripotent stem cells. Cell Res. 24, 293–306.
DeVeale, B., van der Kooy, D. & Babak, T. (2012) Critical evaluation of imprinted gene expression by RNA-Seq: a new perspective. PLoS Genet. 8, e1002600.
Gropp, A. (1982) Value of an animal model for trisomy. Virchows Arch. A Pathol. Anat. Histol. 395, 117–131.
(英文)
Hussein, S.M., Batada, N.N., Vuoristo, S. et al. (2011) Copy number variation and selection during reprogramming to pluripotency. Nature 471, 58–62.
Kim, Y.M., Lee, J., Xia, L., Mulvihill, J.J. & Li, S. (2013) Trisomy 8: a common finding in mouse embryonic stem (ES) cell lines. Mol. Cytogenet. 6, 3–7.
Lagarrigue, S., Martin, L., Hormozdiari, F., Roux, P.F., Pan, C., van Nas, A., Demeure, O., Cantor, R., Ghazalpour, A., Eskin, E. & Lusis, A.J. (2013) アnalysis of allele-specific expression in mouse liver by RNA-Seq: a comparison with cis-eQTL identified using genetic linkage. Genetics 195, 1157–1166.
Liu, X., Wu, H., Loring, J., Hormuzdi, S., Disteche, C.M., Bornstein, P. & Jaenisch, R. (1997) Trisomy eight in ES cells is a common potential problem in gene targeting and interferes with germ line transmission. Dev. Dyn. 209, 85–91.
Mayshar, Y., Ben-David, U., Lavon, N., Biancotti, J.C., Yakir, B., Clark, A.T., Plath, K., Lowry, W.E. & Benvenisty, N. (2010) Identification and classification of chromosomal aberrations in human induced pluripotent stem cells. Cell Stem Cell 7, 521–531.
(Filename) gtc12178-sup-0001-FigS1.pdf
(Format) application/PDF
(Size) 265K
(Description) Figure S1 Allele distributions from the RNA-seq data obtained for the cell lines reported in the Obokata et al. study. CD45+ cells (gray), ESCs (yellow), STAP cells (blue), STAP-SCs (green), TSCs (orange), EpiSCs (light blue), and FI-SCs (red). The ESCs, STAP cells, STAP-SCs, FI-SCs, and epiblast stem cells (EpiSCs) were annotated as being derived from a 129B6F1 strain, and the TSCs as from a CD1 strain.
(英文)
(Filename) gtc12178-sup-0002-FigS2.pdf
(Format) application/PDF
(Size) 214K
(Description) Figure S2 Allele frequencies of all chromosomes. SNPs on all autosomes and the X chromosome were counted, and their distributions are indicated. The RNA-seq data from the CD45+ and STAP cells are identical to those used in Fig. 3. Numbers after each chromosome name are those of applied SNPs of CD45+ rep1, CD45+ rep2, STAP rep1, and STAP rep2, respectively.
(英文)
(Filename) gtc12178-sup-0003-TableS1.xlsx
(Format) application/msexcel
(Size) 11K
(Description) Table S1 RNA-seq raw data used in this study
Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.