High-throughput muscle fiber typing from RNA sequencing data

Skeletal muscle fiber type distribution has implications for human health, muscle function, and performance. This knowledge has been gathered using labor-intensive and costly methodology that limited these studies. Here, we present a method based on muscle tissue RNA sequencing data (totRNAseq) to estimate the distribution of skeletal muscle fiber types from frozen human samples, allowing for a larger number of individuals to be tested. By using single-nuclei RNA sequencing (snRNAseq) data as a reference, cluster expression signatures were produced by averaging gene expression of cluster gene markers and then applying these to totRNAseq data and inferring muscle fiber nuclei type via linear matrix decomposition. This estimate was then compared with fiber type distribution measured by ATPase staining or myosin heavy chain protein isoform distribution of 62 muscle samples in two independent cohorts (n = 39 and 22). The correlation between the sequencing-based method and the other two were rATPas = 0.44 [0.13–0.67], [95% CI], and rmyosin = 0.83 [0.61–0.93], with p = 5.70 × 10–3 and 2.00 × 10–6, respectively. The deconvolution inference of fiber type composition was accurate even for very low totRNAseq sequencing depths, i.e., down to an average of ~ 10,000 paired-end reads. This new method (https://github.com/OlaHanssonLab/PredictFiberType) consequently allows for measurement of fiber type distribution of a larger number of samples using totRNAseq in a cost and labor-efficient way. It is now feasible to study the association between fiber type distribution and e.g. health outcomes in large well-powered studies.


Introduction
Our bodies constitute to ~ 30-40% of the skeletal muscle, and it is the most abundant form of the three types of muscle, the others being smooth and cardiac. The skeletal muscle is composed of different fiber types (i.e., muscle cell types), and the relative proportions of these types vary among the muscles, locations within the muscles, individuals, and the sex of individuals [1][2][3][4]. The oxidative and glycolytic potential and the contractile properties differ considerably between fiber types, with the mitochondria-rich slow-twitch fibers (type I) having higher oxidative capacity, and fast-twitch fibers (type IIa and type IIx) having higher glycolytic capacity [1]. The proportions also change as people age, with type II fibers being preferentially affected by sarcopenia [5]. Exercising the skeletal muscle is a major site for catabolic metabolism of the blood glucose and lipids and the metabolic characteristics of this tissue influence both the Open Access *Correspondence: Ola.Hansson@med.lu.se performance of elite athletes [6,7] and an individual's predisposition to disease, e.g., impairments in glucose and lipid transportation into myocytes can promote diabetes and atherosclerotic vascular disease [8][9][10]. Although environmental factors, such a training or sedentary behavior and aging, lead to adaptive alteration in capillary density and fiber composition (an increase in type IIa vs type IIx) in the skeletal muscle, these features are partially under genetic control [11,12]. However, the extent to which transition of type I to type II fibers and vice-versa occurs remains uncertain [6]. The muscle tissues are usually classified according to the predominant myosin heavy chain (MyHC) isoforms, but this is a simplified classification that disregards the large number of other proteins expressed in the skeletal muscle [2,13]. It is thus not straightforward to estimate and compare muscle fiber types in humans, e.g., owing to large heterogeneity and limitations of the methodology, and it is even more challenging in frozen, postmortem samples.
In addition to cells from differing types of muscle fibers, a biopsy sample also contain cells such as fibroblasts, endothelial cells, adipocytes, smooth muscle cells, and neuron-associated Schwann cells. Gene expression analysis of the skeletal muscle is thus rendered difficult by the complexity of this tissue. Although sequencing of transcriptomes from muscle biopsies may still provide a perspective on functional differences between individuals, targeted RNA sequencing from isolated cells provides the opportunity to reveal differences specific to the different cell populations [14]. Previous studies investigating the skeletal muscle have applied single-cell RNA sequencing (scRNAseq) to extracted mono-nucleated cells [15][16][17][18][19][20]. These studies have for example described a complex landscape of different cell types, e.g., two different populations of muscle progenitor cells [15], provided detailed knowledge concerning muscle regeneration [19] and muscle disease etiology [20]. A challenge for singlecell genomic studies in the muscle and other solid tissues is to capture cell types that are difficult to isolate in suspension [21], e.g., muscle fibers have not been sequenced in the abovementioned scRNAseq studies. However, a few studies have isolated and sequenced single nuclei including the nuclei from poly-nucleated primary muscle fibers in mice [22][23][24] and e.g. found that myofiber types predominantly express either slow or one of the fast isoforms of MyHC proteins, while only a small proportion of hybrid fibers can express more than one MyHC [22]. However, no study has investigated poly-nucleated primary muscle fibers from humans.

Results
Here, we present a method to estimate the proportion of skeletal muscle fiber types using only muscle tissue RNA sequencing (totRNAseq) data that could be used on frozen samples. The method is based on snRNAseq information of one human individual and then evaluated in two independent larger totRNAseq data sets of the human skeletal muscle (the Muscle SATellite cell study, MSAT, n = 39 and the Juntendo Muscle Study, JMS, n = 23).

A novel muscle fiber type prediction model derived from single-nuclei RNA sequencing
After filtering and quality control (see the "Material and methods" for details), the snRNAseq data set consisted of 2699 nuclei, with 22,084 expressed genes. On average, ~ 500 genes were found per nuclei ( Supplementary  Fig. 1). Five major clusters of nuclei were identified using graph-based clustering built on Louvain modularity optimization [25]. For visualization of the nuclei populations, t-distributed stochastic neighbor embedding (tSNE) non-linear dimension reduction was applied (Fig. 1a). A separation of type I (cluster B) and type II (cluster A) fiber nuclei is clearly observed, i.e., gene markers of type I and type II fibers (e.g., ATP2A2, TPM3, MYH7B versus ATP2A1, MYBPC2, MYH2) display a distinct expression pattern in different clusters (Fig. 1b). Cluster D is likely representing endothelial cells, i.e., enriched for gene markers like LDB2, VWF, BTNL9, and FLT1. The identity of clusters C and E is less clear but could possibly represent fibroblasts respectively pericytes. The complete list of 48 gene markers for the five clusters are shown in Fig. 1c.

Muscle fiber typing using total RNA sequencing from frozen samples
To evaluate the efficacy of estimating proportions of type I versus type II fiber nuclei in muscle samples from totRNAseq data, a deconvolution analysis [26] was performed in a data set consisting of 39 human subjects, i.e., the MSAT study (Table 1). Briefly, by using the snR-NAseq data as a reference, cluster expression signatures were produced by averaging gene expression of cluster gene markers and then applying these to the totRNA MSAT data set by inferring muscle fiber nuclei type via linear matrix decomposition [26] and then finally compare this estimate with the fiber type proportions measured by ATPase staining of the same muscle samples. The correlation between the estimated proportions of muscle fiber nuclei types from totRNAseq data and muscle fiber types from ATPase staining was r = 0.44 [0.13-0.67], [95% CI] at p spearman = 5.70 × 10 -3 , n = 39 (Fig. 2a). The same deconvolution analysis was performed on a second dataset consisting of 23 human subjects from Japan, i.e., the JMS ( Table 2). The correlation between the estimated proportions of muscle fiber nuclei types from totRNAseq data and muscle fiber types measured by MyHC protein isoform distribution in the JMS was r = 0.83 [0.61-0.93], [95% CI] at p spearman = 2.00 × 10 -6 , n = 22 (Fig. 2b). One individual was excluded due to low sequencing quality. To further validate the method skeletal muscle totRNAseq data was downloaded from the Genotype-Tissue Expression project (n = 569) and tested for differences in fiber type distribution between men and women. As expected, we observed a larger proportion of type I fibers in women compared to men (68% [54-76%] versus 56% [39-71%], median [95% CI], p Mann-Whitney = 3.1 × 10 -7 , n women = 171, and n men = 398, Fig. 2c).
To test the possibility of implementing this method for fiber typing of a large number of samples, we estimated the needed minimal sequencing depth of totRNAseq data that accurately would infer skeletal muscle fiber type composition. An increasing number of randomly selected fractions of type I fibers were calculated. The accuracy of deconvolution inference of fiber type composition was similar to the 35 million PE reads level even at very low sequencing depths, i.e., down to an of average ~ 10,000 PE reads ( Fig. 2d and Supplementary Fig. 2).

Discussion
Here, we present a method to estimate the proportion of skeletal muscle fiber types from frozen samples allowing for a larger number of samples to be measured in a standardized, cost, and labor-efficient way. Skeletal muscle fiber type distribution is a determinant of physical performance [27][28][29][30] and overall health [31][32][33][34][35] and is highly heritable in humans [11,12]. For example, a reduced proportion of oxidative slow-twitch type I fibers is associated with lower insulin sensitivity in the diabetic muscle [31][32][33][34] and muscle atrophy, e.g., agerelated sarcopenia is progressing in a fiber type-specific manner [35]. Recently, it has also been shown that skeletal muscle response and recovery from exercise training is dependent on fiber type composition and is thus an important factor to consider in the development of individualized training advice [36][37][38]. However, many of the abovementioned associations with fiber type distribution are based on low-powered studies. One of the limiting factors is that the methodology for determining fiber type distribution is labor-intensive and thus relatively expensive. The method presented here allows for a larger number of samples to be analyzed using totRNAseq in a standardized and relatively fast way. With this new method, the distribution of type I versus type II fibers can be estimated, but it cannot distinguish between type IIa and type IIx fibers. However, this is achievable with only a small amount of muscle tissue (~ 5-10 mg), e.g., from sampling with the minimally invasive microbiopsy technique [39]. The deconvolution inference of fiber type composition was accurate even for very low sequencing depths, i.e., down to an average of ~ 10,000 PE reads. This means that a shallow-coverage totRNAseq experiment (or targeted RNAseq) will be sufficient to accurately estimate skeletal muscle fiber type composition at a low cost per sample (< 1 dollar/sample). This new method consequently allows for the measurement of fiber type distribution of a larger number of samples and makes it feasible to study the connection between fiber type distribution and health with wellpowered studies. It can also be used for estimating fiber type distribution in public repositories of totRNAseq data (e.g., Genotype-Tissue Expression project) and perform in silico analyses of fiber type associations. In conclusion, totRNAseq can efficiently be used to estimate skeletal muscle fiber type distributions of frozen samples.

Materials and methods snRNAseq data generation
For nuclei isolation from frozen tissue, all following steps were performed on ice with precooled buffers and centrifugation steps were performed at 4 °C. The tissue was disrupted and nuclei liberated through dounce homogenization in ice-cold homogenization buffer (0.32 M sucrose, 3 mM CaCl2, 2 mM magnesium acetate, 0.1 mM EDTA, 1 mM DTT, 10 mg/ml BSA, 10 mM Tris-HCL, protease inhibitors (Sigma-Aldrich)) in the presence of 0.1% NP40. The nuclei suspensions were sequentially passed through 40-, 30-, and 20-μm cell filters (Miltenyi Biotec) and centrifuged at 1000 g for 10 min. The pellet was resuspended in 1% RNase-free BSA in PBS and stained using DAPI (1:1000, BD Pharmingen). Intact single nuclei were sorted in bulk using the DAPI-positive event population, at single-cell sort precision and using a 100-µm nozzle (BD FACS AriaIII) into PBS/1% BSA. Nuclei were counted and loaded on a 10 × chromium microfluidic chip, aiming for the maximum possible number of nuclei to be targeted obtained from the sorting. Single-nucleus experiments were performed using the 10 × genomics single cell 3′ kit v.2 to encapsulate nuclei along with barcode tagged beads, generate, and amplify cDNA and to generate sequencing libraries. Each pooled library was barcoded using i7 barcodes provided by 10 × Genomics. cDNA and sequencing library quality and quantity were determined using Agilent's High Sensitivity DNA Assay. Libraries were pooled and sequenced in 150-bp paired-end mode on Illumina's HiSeq platform.

snRNAseq data processing and analysis
Post-processing pipeline Cell Ranger (https:// suppo rt. 10xge nomics. com/ single-cell-gene-expre ssion/ softw are/) provided by 10X Genomics was used for demultiplexing, alignment, filtering, barcode, and unique molecular identifier (UMI) counting. The pipeline produced files in FASTQ and BAM formats, as well as the matrix of UMI counts. We used Seurat workflow for further quality control and downstream analysis of the snRNAseq gene expression data. The initial data set contained 2937 nuclei and 22,336 expressed genes. Nuclei with a high fraction of their counts coming from mitochondrial and ribosomal genes were removed. Next, genes with at least one UMI count present in at least one nucleus were selected. After these steps, 2699 nuclei and 22,084 genes were included in the downstream analysis. We normalized the gene expression data with the LogNormalize method of Seurat and standardized the count values prior to performing the Principal Component Analysis (Supplementary Fig. 3). JackStraw procedure [40] was applied as a denoising step in order to select an optimal number of Principal Components (PCs), indicating that 3 PCs to keep for further downstream analysis, which can be viewed as identifying the true intrinsic dimensionality of the snRNAseq data. Further, the number of cells predicted to be proliferative was investigated using a list of the genes annotated as functioning in the cell cycle according to [41]. The vast majority of cells were not detected to be proliferating and the cycling cells did not form any separate cluster ( Supplementary Fig. 4a-b). Graph-based clustering based on Louvain modularity optimization [25] with resolution parameter equal to 0.1 was used for detecting boundaries between different populations of nuclei, and tSNE non-linear dimension reduction with perplexity 50, that was chosen as a square root of the number of nuclei according to the k-nearest neighbors rule of thumb, was applied for visualization of the nuclei populations (Fig. 1a). The snRNAseq data was uploaded to Gene Expression Omnibus (GEO) and is available under accession number GSE190489.

Deconvolution analysis
Deconvolution analysis [26] was performed using the snRNAseq data as a reference. For this purpose, the snRNAseq data was used to produce cluster signatures, which are marker gene expressions averaged across cells from each cluster. After that, the gene expression of each type of the skeletal muscle fibers in a bulk RNAseq sample was inferred via linear matrix decomposition [42]. The deconvolution analysis was performed using Decon-RNASeq R/Bioconductor package, and the results are presented in Fig. 2a, b.

Study subjects
The MSAT cohort: 39 Swedish male subjects (Table 1) were enrolled in the study by advertising on social media and through local cycling clubs. Inclusion criteria were as follows: (1) male, (2) healthy, not on any medications, and (3) age range between 20 and 55. Subjects were given both oral and written information about the experimental procedures before giving their written informed consent. Each participant went through three visits at different time points. All subjects completed all three visits. The first visit involved a regular doctor's examination with blood samples and measuring anthropometric characteristics. The second visit consisted, after an overnight fast, of a Wingate test followed by muscle biopsy and VO 2max was measured during the third and last visit. The study was approved by the local Ethics committee, Lund University (Dnr 2015/593). For determination of peak anaerobic power (Wingate) and VO 2max , subjects were instructed to perform only easy training 48 h prior to each test. To determine peak anaerobic power, a 30-s all-out Wingate test [43] was conducted on a cycle ergometer (Monark Peak power). Before the test, a 5-min low intensity ~ 150w warm-up, with instructions to perform a 5 s high cadence drill each minute was performed. The test started with the subject pedaling as fast as possible. When a cadence of 120 rpm was reached, a braking resistance equivalent to 0.7 N × m × kg −1 was applied to the freewheel and remained constant during the 30 s. Subjects were instructed to sit down throughout the test. Strong verbal encouragement was given throughout to ensure a maximal effort was provided. An incremental test to exhaustion was performed to determine VO 2max . The test started with 3 min of cycling at 3 W × kg −1 (rounded down to nearest 10 W) and then increased by 35 W every 2 min until voluntary exhaustion or failure to maintain ≥ 60 rpm. Strong verbal encouragement was given throughout. VO 2 was measured using an Oxycon Pro (Jaeger GmbH, Germany) with a mixing chamber and a 30 s sampling time. Gas sensors were calibrated according to instructions by the vendor before every test.
Maximal oxygen uptake was determined as the mean of the two highest values attained during exercise from any 30-s period. The Juntendo cohort: 23 Japanese subjects ( Table 2) were recruited to examine the associations between RNA expression profiles and muscle fiber composition in the Japanese population. All subjects gave their signed informed consent before inclusion in the study. The study protocols were approved by the Ethics Committees of Juntendo University and were performed in accordance with the Declaration of Helsinki. No additional tests were performed in this cohort.

Muscle biopsies and histology
In the MSAT cohort (Table 1), muscle biopsies were taken from the vastus lateralis muscle under sterile conditions and local anesthesia (1% lidocaine) by using a 5-mm Bergström needle and frozen in liquid nitrogen. The biopsies were taken within 5 min after the Wingate test. Serial Sects. (10 μm) were cut using a cryostat at − 20 °C. Myofibrillar ATPase histochemistry was performed by preincubation at pH 4.4, 4.6, and 10.3 to identify fiber types [44]; the proportion of fiber types (i.e., type I, IIa, or IIx) were calculated as the number of each fiber type, divided by the total number of fibers in the section. Computer image analysis was performed using image analysis equipment (BioPix IQ 2.0.16 software, BioPix AB, Sweden).
In the Juntendo cohort (Table 2), muscle biopsies were taken from the vastus lateralis muscle under sterile conditions and local anesthesia (1% lidocaine) by using a disposal needle biopsy instrument (Max Core; C. R. Bard, Covington, GA). The biopsies were collected from approximately 15 cm above the patella in both legs of each subject under ultrasound imaging (Noblus; Aloka, Tokyo, Japan) and avoided the inclusion of subcutaneous fat and the subfascial and myotendinous parts as far as possible. In addition, any visible non-muscle tissues (e.g., adipose tissue) were removed from the biopsy samples. Samples were frozen immediately in liquid nitrogen and stored at − 80℃ until further analysis. Myosin heavy chain (MyHC) protein isoforms were assessed as markers of muscle fiber composition. Frozen muscle samples were homogenized in ice-cold lysis buffer [50 mM HEPES (pH 7.4), 10 mM EDTA, 4 mM EGTA, 50 mM β-glycerophosphate, 25 mM NaF, 5 mM Na 3 VO 4 ], containing a phosphatase inhibitor (PhosSTOP tablet; Roche Diagnostics, Indianapolis, IN), and a protease inhibitor (Complete tablet; Roche Diagnostics). The lysates obtained were centrifuged at 10,000 g for 10 min at 4℃. An insoluble pellet, obtained after centrifugation, was suspended in a sufficient volume of SDS sample buffer [30% glycerol, 5% β-mercaptoethanol, 2.3% SDS, 0.05% bromophenol blue, and 62.5 mM Tris-HCl (pH 6.8)] and boiled at 95℃ for 5 min. MyHC composition was determined by glycerol SDS-PAGE, according to Kakigi et al. [45]. Briefly, protein samples were resolved by performing glycerol SDS-PAGE [stacking gel: 4% acrylamide, 34.7% glycerol, and 125 mM Tris-HCl (pH 6.8); separating gel: 8% acrylamide, 33.3% glycerol, and 375 mM Tris-HCl (pH 8.3)]. Electrophoresis was started at 60 V with stacking gel at 8℃. The voltage was set to 150 V and run for 18 h at 8℃ when the tracking dye had entered the separating gel completely. After separation, the gels were stained with Coomassie brilliant blue (Biosafe G250; Bio-Rad Laboratories, Hercules, CA) and rinsed repeatedly with water. Each gel was scanned using a calibrated densitometer (ChemiDoc Touch Imaging System; Bio-Rad Laboratories), and the relative proportion of MyHC-I, MyHC-IIa, and MyHC-IIx were determined using the calibrated densitometer (ChemiDoc Touch Imaging System) and analytical software (Image Laboratory software version 5.2.1; Bio-Rad Laboratories).

RNA extraction and totRNAseq analysis
In the MSAT cohort, RNA was extracted from 25 to 30 mg of muscle biopsies using a TissueLyser II (Qiagen) and the miRneasy Mini Kit (Qiagen). The RNA concentration was determined using a NanoDrop ND-1000 spectrophotometer (A260/A280 > 1.8 and A260/A230 > 1.0 (NanoDrop Technologies, Wilmington, DE, USA). RNA integrity was verified using the 2200 TapeStation instrument (Agilent Technologies, CA, USA), where all samples had an average RNA integrity number (RIN) above 8. In the Juntendo cohort, frozen muscle samples were crushed with 5.0-mm zirconia beads using a Micro Smash MS-100R (Tomy Seiko, Japan) at 3000 rpm twice for 15 s at 2 °C. The total RNA was extracted from muscle samples using TRIzol ® Reagent (Thermo Fisher Scientific, Waltham, MA, USA) according to the manufacturer's protocol. RNA concentration and purity were checked using a NanoDrop 8000 UV-Vis Spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA). RNA integrity was verified using the 2200 TapeStation instrument (Agilent Technologies, CA, US), where all samples had an average RNA integrity number (RIN) above 8.
All samples from both the MSAT and Juntendo cohorts were sequenced at Lund University using 800 ng input RNA. Library preparation was made using the TruSeq Stranded Total RNA Library Prep Kit with Ribo-Zero Human/Mouse/Rat Set A (Illumina), and the 75 bp paired-end sequencing was performed on a NextSeq instrument using the NextSeq ® 500/550 High Output Kit v2 (150 cycles) (Illumina). The sequencing quality was checked with fastQC v0.11.9 (http:// www. bioin forma tics. babra ham. ac. uk/ proje cts/ fastqc) and multiQC [46] v1.9. Gene expression was assessed using Salmon [47] v1.2.1. Exon expression was obtained by mapping reads with STAR [48] v2.7.6 and counting with featureCounts [49] v2.0.1 (options: -p -f -C -O). We used the GRCh38 Ensembl v77 [50] as a reference genome. We used the Seqtk tool (https:// github. com/ lh3/ seqtk) for gradual random downsampling of the MSAT data and applied Salmon for quantifying gene expression of the downsampled RNAseq data. The quantified gene expression for each downsampling iteration was normalized with TMM [51], and deconvolution analysis was performed using the gene markers identified for slow-and fast-twitch human clusters in the snRNAseq data as described above. We computed the Spearman correlation coefficient and mean square deviation between predicted and true fiber type composition for each downsampling iteration.
Additional file 1: Figure S1. Seurat snRNAseq workflow quality control statistics. From left to right, first figure reports the number of expressed genes per cell, second figure reports the number of UMIs (library size) per cell, third figure shows the fraction of reads mapped to mitochondrial genes per cell. Figure S2. The initial average sequencing depth of 35 million paired-end (PE) reads were ´down-sampled´ and mean square errors between the ATPase and totRNAseq predicted fractions of Type I fibers were calculated. Figure S3. Principal Component Analysis (PCA) plot of gene exppression from human skeletal muscle snRNAseq colored by the cluster annotation, see Figure 1A, obtained from the Louvain clustering on the number of significant principal components (according to Seurat workflow). Figure S4.