To infer orthology relationships between genes in the S. lacustris de-novo transcriptome (see below) with genes of other animals, we generated a comprehensive phylome containing trees for all genes in the Spongilla transcriptome. Briefly, each inferred S. lacustris protein was used as a seed to construct a gene/protein tree. For each seed, identification of the 250 closest homologs in 53 proteomes from other metazoan and unicellular relatives (listed in Data S1) was performed using BLAST with e-value and query coverage cutoffs of E10-3 and 33%, respectively. Our choice of 250 homologs as the inclusion threshold came from our observation that this nearly always included the immediate orthology group, as well as several related paralogs that could serve as outgroups for orthology inference. It is important to note that our goal for the phylome pipeline was strictly to provide rigorous orthology inference for a given protein, and not to reconstruct the entire tree for large gene families. Phylogenetic workflow was executed using the workflow template named “clustalo_default-trimal001-none-raxml_default” from the ETE Toolkit v3.1.1, which consisted of multiple sequence alignment reconstruction using clustalOmega v1.2.1 with default parameters, Trimal for alignment cleaning by removing columns with more than 1% gaps, and Maximum Likelihood tree inference using RAXML v8.1.20 with default parameters.
All the resulting trees were rooted to the farthest leaf to each seed sequence using the function ‘get_farthest_oldest_leaf’ from the ETE Toolkit and using the NCBI Taxonomy tree as a reference to established relative dating of the species included. Rooted trees were automatically processed using ETE’s `get_descendant_evol_events` function to infer speciation and duplication nodes. Speciation nodes were used to infer fine-grained orthology relationships (one-to-one, one-to-many, many-to-many) between seed sequences and other species.
Spongillade novo transcriptome construction and refinement
RNA was extracted from juvenile S. lacustris using Trizol (Thermofisher #15596026), and the resulting extracts were quantified using a Nanodrop 2100 and quality controlled on a Bioanalyzer 2100. Using strand-specific sequencing, we conducted 100PE and 125PE on a HiSeq 1500 using an RNA extract with RIN 10, generating 10,459,248 and 12,198,962 read pairs for the two samples, respectively.
Bulk RNA-seq reads were adapter trimmed, quality filtered, and assembled using Trinity, with the in-silico normalization option selected and other parameters set as default. This resulted in a transcriptome N50 of 905 base pairs. A BUSCO search using the metazoa dataset indicated that the reference transcriptome was fairly complete (95.2% BUSCOs found). We identified putative proteins using Transdecoder (version 3.0.1), requiring a minimum open reading frame length of 100 amino acids.
We next refined the assignment of transcripts to genes in our de-novo transcriptome using information derived from our phylome. We identified S. lacustris genes whose transcripts and proteins did not align to each other but mapped as one-to-one orthologs to the same gene in the closely related freshwater demosponge Ephydatia muelleri. Visual inspection of many individual cases suggested these S. lacustris Trinity genes were fragments (e.g., 5‘ and 3‘) of the same gene. In a few cases, these also represented genes that were incorrectly fused in the Ephydatia muelleri transcriptome, although this represented only a small minority of cases (<5%). Based on this, we grouped those trinity genes that we inferred to represent fragments of the same gene into a single merged gene. The final merged gene set was then used to construct a GTF file for mapping our single-cell 3‘ RNAseq data and quantifying UMI counts for each gene. The assignment of trinity genes to their final merged gene IDs, which are used in the single-cell expression matrix, is available in the file “spongilla_merged_genes_and_names.tsv” on the project’s GitLab repository. We also provide full Spongilla transcriptome and proteome fasta files in the repository, in which sequence headers include the protein/transcript ID, automated gene name, and manually-curated gene name.
In total, our refined transcriptome contained 39,562 genes. Of these we identified 14,197 protein-coding genes containing an open reading frame of at least 100 amino acids. Our automated annotation pipeline assigned names to 9,809 protein-coding genes.