Soybean (Glycine max) is one of the most important crops worldwide and it is providing oil and useful protein. Despite of many advantages, soybean has not been researched in genomic details than other model plants because of its complex genome structures, high haploid chromosome number (n=20), and large genome size. The 20 haploid chromosomes in soybean were almost doubled such as rice (n=12), maize (n=10) and Arabidopsis thaliana (n=5). After the soybean genome sequence of Williams 82 had been established (Schmutz et al., 2010), some papers have shown whole-genome resequencing studies of Glycine max and Glycine soja (Kim et al., 2010; Lam et al., 2010; Chung et al., 2014). The reference genome allows us to find more single nucleotide polymorphisms (SNP) (Zhu et al., 2003; Hyten et al., 2006; Lam et al., 2010, Chung et al., 2014), structural variation (SV), such as copy number variation (CNV), large insertions/deletions (InDels), inversions and translocations (McHale et al., 2012; Haun et al., 2011). Next generation sequencing (NGS) is also an important technology in the genomic studies and molecular breeding. In addition to its availability of the soybean genome sequence, NGS technologies and bioinformatics provides many opportunities to reveal genetic diversities such as SNP and InDel among soybean accessions. These technologies are valuable resources that have facilitated whole-genome genotyping, genome-wide association studies, predictions of gene functions, and analysis of genetic diversities in plants (Lam et al., 2010; Chung et al., 2014; Li et al., 2013).
The genome re-sequencing provides enormous number of new information onthe record of SNPs. Over 3.8 million SNPs were identified through high depth of re-sequencing of 10 cultivated and 6 wild accessions (Chung et al., 2014). Recently, SoySNP50k platform was developed, which comprise over 19,000 accessions (Song et al., 2013). Genome-wide identification of SNPs among soybean accessions was also reported and developed as markers for breeding (Zhu et al., 2003; Van et al., 2005). In contrast to SNPs which have been studied extensively, other forms of highly abundant genetic variation such as InDel, have received relatively less attention. Small InDels, a type of polymorphisms are common but functionally important type of polymorphism. Especially InDels in coding region cause single or more amino acid substitutions such as frameshift mutations and affect sequence conservations, structure and function of proteins. They are easily detected by based on mismatching in common beans (Zou et al., 2014). However many studies focus on single nucleotide variants (SNVs) or large structural variants (Wu et al., 2010; McHale et al., 2012; Lee et al., 2015). DNA polymorphisms positioned within coding regions are particularly important as it has a potential to change the protein function itself. Discovery of polymorphisms related to functional changes of genes is important in investigations on differences among various accessions and its main cause of diversions.
In this study, we have obtained RNA-sequencing data of 2 soybean accessions (Daewon and Hwangkeum) using NGS. The sequence data were analyzed to discover a large number of DNA polymorphisms, including single nucleotide variation (SNV), multiple nucleotide variation (MNV), insertion/deletion (InDel) and replacements. This data can be used in future studies on genotyping and marker development for molecular breeding.
MATERIALS & METHODS
Plant materials and RNA-sequencing
To detect polymorphisms, two Korean soybean accessions were analyzed in this study (Daewon and Hwangkeum). The soybean seeds were obtained from the National Institute of Crop Science, RDA, Suwon, Korea. The plants were grown until V5 stage at 2 5℃ under 16 h light / 8 h dark conditions. For all samples, total RNA was prepared from leaf tissues using an RNeasy Plant Mini Kit (Qiagen, Germany). The RNA quality was checked for integrity before performing the RNA sequencing process using a 2100 Bioanaylzer RNA Nanochip (Agilent Technologies). After a quality check, RNA sequencing using Illumina HiSeq platform was performed by Theragen, Inc. (Korea).
Alignments and variants detection
Raw reads from Illumina Hiseq 2000 sequencing platforms were imported to CLC Genomics Workbench 7.0 (CLC Bio) and read statistics were assessed using sequencing data quality control, followed by read trimming for quality. The raw read were trimmed with a quality score limit of 0.01. The CLC Genomics Workbench was used to map trimmed reads to the soybean genome sequence of Williams 82 from Phytozome v.9 (http://www.phytozome.net) used as the reference sequences. The mapping process followed defaults parameters, and global alignment setting. Consensus sequence were aligned to the reference sequence and analyzed for SNVs, MNVs, InDels and replacements. The variants were screened using the CLC Genomics Workbench 7.0 with the following parameters (minimum count: 3; minimum coverage: 10; minimum variants frequency: 35%).
RESULTS
RNA-Seq and assembly
A total of 78,994,236 reads were obtained from two accessions. Number of reads was mapped 27,534,208 and 51,460,028 in Daewon and Hwangkeum, respectively (Table 1). The total number of reads in Daewon was lower than the reads in Hwangkeum. After quality filtering and trimming, the 23,646,188 reads in Daewon (85.85%) were mapped to the reference genome (contained 56,044 genes and 177,294 transcripts), and 45,715,040 reads (88.84%) in Hwangkeum. The mapped reads uniquely covered 82.72% of the reference genome in Daewon and 86.23% reads were uniquely mapped in Hwangkeum.
Table 1.
RNA sequencing data assembled to soybean reference genome.
Discovery of variants on transcriptome
Using the default parameters implemented in CLC genomics workbench, we identified a total 89,955 genetic variants between the reference genome and the RNA sequence of two soybean accessions. The number of variants was 34,411, comprising 29,777 SNPs, 1,311 MNVs, 1,754 insertions, 1,458 deletions and 111 replacements in Daewon; 55,544, comprising 47,418 SNPs, 2,293 MNVs, 3,230 insertions, 2,396 deletions and 207 replacements in Hwangkeum (Table 2).
Table 2.
Variants against reference genome.
The distribution of DNA polymorphisms was detected across the 20 soybean genomes and 21 scaffolds. The largest number of variants was observed in chromosome 18 of both accessions. However the smallest number of variation was observed on different chromosome, chromosome 17 of Daewon and chromosome 11 of Hwangkeum (Table 3).
Table 3.
Variants against each chromosome of reference genome.
The highest SNP and MNV frequencies were found on chromosome 18 in both accessions, also on the other hand the lowest frequencies of SNP and MNV were found on different chromosomes. Other variants had different distributions in whole chromosomes. A relative high frequency of InDels was observed on chromosome 8 in both accessions, in comparison to other small number of InDel frequencies in different chromosomes. The 390 variants were found in location of the 21 scaffolds in the Williams 82 genome. The frequency of replacement showed to have a pretty even distribution among 20 chromosomes.
Annotation of variants
Some variants were detected in non-coding regions, introns and large repeat regions. These variants were excluded by amino acid change test with default parameter using CLC genomic workbench 7. After filtering out the variants, the remaining number of variants located in CDS regions was 13,225 nonsynonymous variants, comprising 11,522 SNVs, 665 MNVs, 481 insertions, 511 deletions, and 46 replacements in Hwangkeum and 9,611 non-synonymous variants were detected in Daewon that comprised 7,908 SNVs, 665 MNVs (Table 4), 481 insertions, 511 deletions and 46 replacements. The 4,290 genes and 5,672 genes were predicted to change protein functions through nonsynonymous variants in Daewon and Hwangkeum, respectively. Additionally, we observed that non-synonymous variants were not uniformly distributed within the chromosomes. The distributions of non-synonymous variants within coding regions and expression values of each accession are shown across the 20 soybean chromosomes (Fig. 1).

Fig. 1.
A circos plot showing differentially expressed genes and non-synonymous variants from RNA-sequencing. The outermost circle represents 20 soybean chromosomes (Chr 1-20) in different colours, and inner two circles represent gene expression value from RPKM (black) and distribution of variants (SNV and MNV: red, InDel: blue, Replacement: green) in Daewon (I) and Hwangkeum (II)DISCUSSION
This study provides non-synonymous variants within coding regions (CDS) of two Korean soybean accessions using RNA-seq. We analyzed the transcripts and obtained a total 19,430 nonsynonymous SNPs, 1,114 MNVs, 1,686 InDels and 75 replacements against Williams 82 genome as a reference. These non-synonymous variants and expression data will possibly provide information for markers development and functional genetics. The reads from each Korean soybean accession were mapped on to reference genome. The mapping ratio of reads was shown highly coverage in each accession, which varied from 85.82 % in Daewon to 88.84 % in Hwangkeum. This coverage is similar to the result of whole genome sequence in various soybean accession studies (Zou et al., 2014; Chung et al., 2014).
Short reads was used to map on reference genome in RNA-seq. The short reads were possible to assemble large part of DNA sequence, especially the heterochromatic regions around centromeres and other highly repetitive regions. The plant genomes contain large repetitive sequences that more over 35% regions of genome are observed in rice and Arabidopsis (Arabidopsis Genome Initiative 2000; International Rice Genome sequencing project 2005). Soybean genome also has 57% heterochromatic and large repetitive regions in its genome (Schmutz et al., 2010). To selectively choose the non-synonymous variants on CDS, these variants within repetitive regions were removed according to other genotyping studies in soybean (Wu et al., 2010; Zou et al., 2014). After the basic variants were detected with default parameter, we classified the variants into two types, homozygous and heterozygous. A number of variants was 34,411 comprising 22,586 homozygous types and 9,831 heterozygous types in Daewon. The 35,849 homozygous variants and 19,695 heterozygous variants were shown in Hwangkeum. Because soybeans have been reproduced by self-pollination, the detected variants were expected to be predominantly homozygous variants. But the homozygous variants were found only 65% in both accessions. According to Pinson and Rutger (1993), genetic doubling event, fusion of genotypically different cells in chimeric callus, and abnormal meiosis that resulted in heterozygous diploid microspores could occur.
We also found that variants present in 4,290 genes and 5,672 genes in Daewon and Hwangkeum, respectively are likely to have a major impact on gene function. Similar results were found in soybean genome according to resequencing of wild and cultivated soybeans (4,600 genes, approximately) (Lam et al., 2010; Chung et al., 2014). The variants were classified as nonsynonymous and synonymous by amino acid change from CLC genomics workbench 7. The distribution of non-synonymous variants was similar to genome-wide variants. The SNVs and MNV were observed with highest frequency in chromosome 18 of both accessions. But the other variants had different distributions in whole chromosomes. The ratio of non-synonymous to synonymous SNP was 1.45 and 1.39 in Daewon and Hwangkeum, respectively (Table 4). This ratio is similar to ratio shown in soybean (1.40) which is higher than those in Arabidopsis, maize, and sorghum and rice (Clark et al., 2007; Hufford et al., 2012; Lam et al., 2010; McNally et al., 2009; Nelson et al., 2011).
Except partial regions of chromosome 4, 9 and 18, both accessions showed similar patterns of gene expression and non-synonymous polymorphisms against reference genome. In comparison to Hwangkeum, Daewon had lower variants density regions in chromosome 4. Hwangkeum had lower density of variants in chromosome 9 and 18. Large regions of repetitive centromeric and pericentromeric regions in each chromosome had low expression genes and non-synonymously variants. Found in other studies, centromeric regions showed low expressions of genes, in contrast with telomere regions with highly gene expression (Lee et al., 2013). The distribution of variants corresponds with non-centromeric regions, which is similar to the result from SNP genotyping research (Lee et al., 2015) (Fig. 1).
In conclusion, this study reveals large number of polymorphism and non-synonymous variants via RNA-seq of two Korean soybean accessions. The variants included SNVs, MNVs, InDels and replacements were detected in genome and coding regions. The genome-wide variants and non-synonymous variants will serve as valuable resources for genetic and genomic study. In addition, expression profiles and non-synonymous variants within coding regions can be used for identification of agronomic traits with analysis of functional relevance. We believe that this information will be help to find genes associated with useful agronomic trait in future studies.


