API Documentation¶
- class Topyfic.train.Train(name, k, n_runs=100, random_state_range=None)[source]¶
A class used to train reproducible latent dirichlet allocation (rLDA) model
- Parameters
name (str) – name of the Train class
k (int) – number of topics to learn one LDA model using sklearn package
n_runs (int) – number of run to define rLDA model (default: 100)
random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))
top_models (list of TopModel) – list of TopModel class to save all LDA models
- combine_LDA_models(data, single_trains=[])[source]¶
combine single top_model
- Parameters
data (anndata) – data you used to learn model
single_trains (list) – list of single train object
- make_LDA_models_attributes()[source]¶
make LDA attributes by combining all single LDA model attributes which you need to define LDA model (sklearn.decomposition.LatentDirichletAllocation)
- Returns
three data frame which the first one is gathering all components from all LDA runs,
the second one is exp_dirichlet_component from all LDA runs and the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe
- make_single_LDA_model(data, random_state, name, learning_method, batch_size, max_iter, n_jobs, kwargs)[source]¶
train simple LDA model using sklearn package and embed it to TopModel class
- Parameters
name (str) – name of LDA model
data (anndata) – processed expression data along with cells and genes/region information
random_state (int) – Pass an int for reproducible results across multiple function calls
max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)
batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)
learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)
n_jobs (int) – The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)
- Returns
LDA model embedded in TopModel class
- Return type
- run_LDA_models(data, learning_method='online', batch_size=1000, max_iter=10, n_jobs=None, n_thread=1, **kwargs)[source]¶
train LDA models
- Parameters
max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)
batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)
learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)
data (anndata) – expression data embedded in anndata format use to train LDA model
n_jobs (int) –
The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)
n_thread (int) – number of threads you used to learn LDA models (default=1)
**kwargs –
other parameter in sklearn.decomposition.LatentDirichletAllocation function (more info: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- Returns
None
- Return type
None
- save_train(name=None, save_path='', file_format='pickle')[source]¶
save Train class as a pickle file
- Parameters
name (str) – name of the pickle file (default is train_Train.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: pickle (default), HDF5)
- class Topyfic.topic.Topic(topic_id, topic_name=None, topic_gene_weights=None, gene_information=None, topic_information=None)[source]¶
A class saved topic along with other useful information
- Parameters
topic_id (str) – ID of topic which is unique
topic_name (str) – name of the topic (default: topic_id)
topic_gene_weights (pandas dataframe) – dataframe that contains weights of the topics for each genes
gene_information (pandas dataframe) – dataframe that contains information of genes i.e gene biotype
topic_information (pandas dataframe) – dataframe that contains information of genes i.e cell state / cell type
- GSEA(gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]¶
Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.
- Parameters
gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)
plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank
- Returns
dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)
- Return type
pandas dataframe
- functional_enrichment_analysis(type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]¶
Doing functional enrichment analysis including GO, KEGG and REACTOME
- Parameters
type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”
organism (str) – name of the organ you want to do functional enrichment analysis
sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO
p_value (float) – Defines the pValue threshold. (default: 0.05)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- gene_weight_variance(save=True)[source]¶
calculate the gene weight variance
- Parameters
save (bool) – added as an information to the Topic (default: True)
- Returns
Gene weight variance for given topic
- Return type
float
- update_gene_information(gene_information)[source]¶
update/add genes information for each topics
- Parameters
gene_information (pandas dataframe) – dataframe contains genes information we would like to add/update (the index should be same as an index of gene_information in class)
- write_topic_yaml(topic_id=None, model_yaml_path='model.yaml', topic_yaml_path='topic.yaml', save=True)[source]¶
write topic in YAML format
- Parameters
topic_id (str) – unique topic ID (default is topic ID)
model_yaml_path (str) – model yaml path that has information about the dataset you use
topic_yaml_path (str) – path that you use to save topic
save (bool) – indicate if you want to save yaml file (True) or just show them (Fasle) (default: True)
- class Topyfic.topModel.TopModel(name, N, topics=None, gene_weights=None, gene_information=None, model=None)[source]¶
A class that saved a model
- Parameters
name (str) – name of class
N (int) – number of topics
gene_weights (pandas dataframe) – dataframe that has weights of genes for each topics; genes are indexes and topics are columns
topics (Dictionary of Topics) – dictionary contains all topics for the topmodel
model (sklearn.decomposition.LatentDirichletAllocation) – store reproducible LDA model
- MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]¶
plot MA based on the gene weights on given topics
- Parameters
topic1 (str) – first topic to be compared
topic2 (str) – second topic to be compared
size (pandas dataframe) – table contains size of dot for each genes (genes are index)
pseudocount (float) – pseudocount that you want to add (default: 1)
threshold (float) – threshold to filter genes based on A values (default: 1)
cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)
consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold
labels (list) – list of gene names wish to show in MA-plot
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: MA_plot)
- Returns
return M and A values
- gene_weight_rank_heatmap(genes=None, topics=None, show_rank=True, scale=None, save=True, show=True, figsize=None, file_format='pdf', file_name='gene_weight_rank_heatmap')[source]¶
plot selected genes weights and their ranks in selected topics
- Parameters
genes (list) – list of genes you want to see their weights (default: all genes)
topics (list) – list of topics
show_rank (bool) – indicate if you want to show the rank of significant genes or not (default: True)
scale – indicate if you want to plot as log2, log10 or not (default: None which show actual value) other options is log2 and log10
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_weight_rank_heatmap)
- Scale scale
str
- get_feature_name()[source]¶
get feature(gene) name
- Returns
list of feature(gene) name
- Return type
list
- get_gene_weights()[source]¶
get feature(gene) weights
- Returns
dataframe contains feature(gene) weights; genes are indexes and topics are columns
- Return type
pandas dataframe
- get_ranked_gene_weight()[source]¶
get sorted feature(gene) weights. each value is gene and weights on each topics
- Returns
dataframe contains feature(gene) and their weights; ranks are indexes and topics are columns
- Return type
pandas dataframe
- get_top_model_attributes()[source]¶
get top model attributes to be able to make sklearn.decomposition.LatentDirichletAllocation
- Returns
three data frame which the first one is components, the second one is exp_dirichlet_component and
the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe
- save_rLDA_model(name='rLDA', save_path='', file_format='joblib')[source]¶
save rLDA model (instance of LDA model in sklearn) as a joblib/HDF5 file.
- Parameters
name (str) – name of the joblib file (default: rLDA)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: joblib (default), HDF5)
- save_topModel(name=None, save_path='', file_format='pickle')[source]¶
save TopModel class as a pickle/HDF5 file
- Parameters
name (str) – name of the file (default: topModel_TopModel.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: pickle (default), HDF5)
- class Topyfic.analysis.Analysis(Top_model, colors_topics=None, cell_participation=None)[source]¶
A class used to investigate the topics and gene weights compositions
- Parameters
Top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation
colors_topics (pandas dataframe) – dataframe that mapped colored to topics
cell_participation (anndata) – anndata that stores cell participation along with cell information in obs
- TopicTraitRelationshipHeatmap(metaData, alternative='two-sided', annotation=False, save=True, show=True, file_format='pdf', file_name='topic-traitRelationships')[source]¶
plot topic-trait relationship heatmap
- Parameters
metaData (list) – traits you would like to see the relationship with topics (must be column name of cell_participation.obs)
alternative (str) – Defines the alternative hypothesis for calculating correlation for module-trait relationship. Default is ‘two-sided’. The following options are available: ‘two-sided’: the correlation is nonzero, ‘less’: the correlation is negative (less than zero), ‘greater’: the correlation is positive (greater than zero)
annotation (bool) – indicate if you want to add correlation and p_values as a text in each square (default:False)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: topic-traitRelationships)
- average_cell_participation(label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='average_cell_participation')[source]¶
barplot showing average of cell participation in each topic
- Parameters
label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name
color (str) – color of bar plot (default: blue)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- average_cell_participation_line_plot(topic, color, category, color_pallet=None, save=True, show=True, figsize=None, file_format='pdf', file_name='line_average_cell_participation')[source]¶
line plot showing average of cell participation in topic divided by two features of cells (i.e. cell type and time point)
- Parameters
topic (str) – name of the topic
color – name of the feature you want to have one line per group of that (it should be column name of cell_participation.obs)
:type color:str :param color_pallet: color of each category of color (if it None color assign randomly) :type color_pallet: dict :param category: name of the feature you want to have on x axis (it should be column name of cell_participation.obs) :type category: str :param save: indicate if you want to save the plot or not (default: True) :type save: bool :param show: indicate if you want to show the plot or not (default: True) :type show: bool :param figsize: indicate the size of plot (default: (10 * (len(category) + 1), 10)) :type figsize: tuple of int :param file_format: indicate the format of plot (default: pdf) :type file_format: str :param file_name: name and path of the plot use for save (default: piechart_topicAvgCell) :type file_name: str
- calculate_cell_participation(data)[source]¶
Calculate cell participation for give data
- Parameters
data (anndata) – processed expression data along with cells and genes/region information
- cell_participation_distribution(plot_type='violin', threshold=0.05, max_topic=True, label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='dist_cell_participation')[source]¶
plot showing distribution of max/all topics in cell participation for each topic
- Parameters
plot_type (str) – type of the plot which can be “violin” or “bax”
threshold (float) – indicate the threshold to filter out cells with low participation in each topics (default: 0.05)
max_topic (bool) – indicate if you want to consider all topics for each cells (False) or only the topic with highest pariticipation (max topic) for each cells (True)
label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name
color (str) – color of bar plot (default: blue)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- static convertDatTraits(data)[source]¶
get data trait module base on samples information
- Returns
a dataframe contains information in suitable format for plotting module trait relationship heatmap
- Return type
pandas dataframe
- extract_cells(level, category, top_cells=0.05, min_cell_participation=0.05, min_cells=50, file_name=None, save=False)[source]¶
extract subset of cells and cells participation with specific criteria
- Parameters
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
top_cells (float) – fraction of the cells you want to be considers (default: 0.05)
min_cell_participation (float) – minimum cell participation each cells in each topics should have to be count (default: 0.05)
min_cells (int) – minimum number of cells each topics should have to be reported (default: 50)
file_name (str) – name and path of the plot use for save (default: selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv and cellParticipation_selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv)
save (bool) – indicate if you want to save the data or not (default: False)
- Returns
table contains cell ID that pass threshold for each topic, table contains cell particiaption for cells that pass threshold for each topic (same order as fist table)
- Return type
pandas dataframe, pandas dataframe
- max_topic_cell_participation(cutoff=10, color='blue', title='Maximum cell topic participation for each cells', save=True, show=True, figsize=None, file_format='pdf', file_name='max_topic_cell_participation')[source]¶
step plot showing maximum cell participation
- Parameters
cutoff (float) – indicate if you want to eliminate any cells with maximum participation less than this
color (str) – color of bar plot (default: blue)
title (str) – indicate if you want to all title into plot (default: Maximum cell topic participation for each cells)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- pie_structure_Chart(level, category=None, ascending=None, n=5, save=True, show=True, figsize=None, file_format='pdf', file_name='piechart_topicAvgCell')[source]¶
plot pie charts that shows contribution of each topics to each category (i.e cell type)
- Parameters
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot pie charts which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
ascending (list of bool) – for each pie chart on which order you want to sort your data (default is descending for all pie charts)
n (int) – number of topics you want to annotate in pie charts (default: 5)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- plot_topic_composition(category, level='topic', biotype='biotype', label=False, save=True, show=True, file_format='pdf', file_name='gene_composition')[source]¶
plot gene composition dividing by gene biotype or topics
- Parameters
category (str) – topic name or gene biotype name you want to see gene composition for
level (str) – indicate weather if you want to show it within each topic or gene biotype (options: “topic” or “gene_biotype”) (default: topic)
biotype (str) – name of the column in gene_weight to look for gene_biotype (default: biotype)
label (bool) – show label of each line within plot or not (default: False)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- save_analysis(name=None, save_path='')[source]¶
save Analysis class as a pickle file
- Parameters
name (str) – name of the pickle file (default: analysis_Analysis.top_model.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- structure_plot(level, category=None, topic_order=None, ascending=None, metaData=None, metaData_palette=None, width=None, n=2, order_cells=['hierarchy'], save=True, show=True, figsize=None, file_format='pdf', file_name='structure_topicAvgCell')[source]¶
plot structure which shows contribution of each topics for each cells in given categories
- Parameters
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
topic_order (list of str) – indicate if you want to have a specific order of topics which it should be name of topics. if None, it’s gonna sort by cell participation
ascending (list of bool) – for each structure plot on which order you want to sort your data (default is descending for all structure plot)
metaData (list) – if you want to add annotation for each cell add column name of that information (make sure you have that inforamtion in your cell_participation.obs)
metaData_palette (dict) – color palette for each metaData you add
width (list of int) – width ratios of each category (default is based on the number of the cells we have in each category)
n (int) – number of topics you want to sum if you used order_cell == ‘sum’ (default: 2)
order_cells (list) – determine which kind of sorting options you want to use (‘sum’, ‘hierarchy’, sort by metaData); sum: sort cells by sum of top n topics; hierarchy: sort data by doing hierarchical clustring; metaData sort by metaData (default: [‘hierarchy’])
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- Topyfic.utilsMakeModel.calculate_leiden_clustering(trains, data, n_top_genes=50, resolution=1, max_iter_harmony=10, min_cell_participation=None, file_format='pdf')[source]¶
Do leiden clustering w/o harmony base on number of assays you have and then remove low participation topics
- Parameters
trains (list of Train) – list of train class
data (anndata) – gene-count data with cells and genes information
n_top_genes (int) – Number of highly-variable genes to keep (default: 50)
resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
max_iter_harmony (int) – number of iteration for running harmony (default: 10)
min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)
file_format (str) – indicate the format of plot (default: pdf)
- Returns
final TopModel instance after clustering and trimming, dataframe containing which run goes to which topic
- Return type
TopModel, pandas dataframe
- Topyfic.utilsMakeModel.combine_topModels(topModels, name='Combined_TopModel', data=None, min_cell_participation=None)[source]¶
Combine two topmodels. It will not apply any method when we want to combine them, so basically just combine all models without performing any method
- Parameters
topModels (list of TopModel) – list of topmodels you want to combine
name (str) – name of the combined topmodels
data (anndata) – if you want to remove topics with low cell participation, you can pass the data you used to train models
min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)
- Returns
return the combined TopModel, number of topics, gene weights
- Return type
TopModel, int, pandas DataFrame
- Topyfic.utilsMakeModel.filter_LDA_model(main_lda, keep)[source]¶
filter LDA based on the topics we want to keep
- Parameters
main_lda (sklearn.decomposition.LatentDirichletAllocation) – Latent Dirichlet Allocation with online variational Bayes algorithm.
keep (pandas dataframe) – dataframe that define which topics we want to keep
- Returns
Latent Dirichlet Allocation with online variational Bayes algorithm, weights of genes in each topics (indexes are topics and columns are genes)
- Return type
sklearn.decomposition.LatentDirichletAllocation, pandas dataframe
- Topyfic.utilsMakeModel.initialize_lda_model(components, exp_dirichlet_component, others)[source]¶
Initialize LDA model by passing all necessary attributes
- Parameters
components (pandas dataframe) – Variational parameters for topic gene distribution
exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution
others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.
- Returns
Latent Dirichlet Allocation with online variational Bayes algorithm.
- Return type
sklearn.decomposition.LatentDirichletAllocation
- Topyfic.utilsMakeModel.initialize_rLDA_model(all_components, all_exp_dirichlet_component, all_others, clusters)[source]¶
Initialize reproducible LDA model by calculating all necessary attributes using clustering.
- Parameters
all_components (pandas dataframe) – Variational parameters for topic gene distribution from all single LDA models
all_exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution from all single LDA models
all_others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.
clusters (pandas dataframe) – dataframe that mapped each LDA run to each clusters
- Returns
Latent Dirichlet Allocation with online variational Bayes algorithm.
- Return type
sklearn.decomposition.LatentDirichletAllocation
- Topyfic.utilsMakeModel.make_analysis_class(top_model, data, colors_topics=None, save_path='')[source]¶
Creating Analysis object
- Parameters
top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation
data (anndata) – processed expression data along with cells and genes/region information
colors_topics (pandas dataframe) – dataframe that mapped colored to topics
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsMakeModel.make_topModel(trains, data, n_top_genes=50, resolution=1, file_format='pdf', save_path='')[source]¶
Creating topModel base on train data and save it along with clustering information
- Parameters
trains (list of Train) – list of train class
data (anndata) – expression data embedded in anndata format along with cells and genes/region information
n_top_genes (int) – Number of highly-variable genes to keep (default: 50)
resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
file_format (str) – indicate the format of plot (default: pdf)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsMakeModel.plot_cluster_contribution(clustering, feature, show_all=False, portion=True, save=True, show=True, file_format='pdf', file_name='cluster_contribution')[source]¶
barplot shows number of topics contribute to each cluster
- Parameters
clustering (pandas dataframe) – dataframe that map each single LDA run to each cluster
feature (str) – name of the feature you want to see the cluster contribution (should be one of the columns name of clustering df)
show_all (bool) – Indicate if you want to show all clusters or only the ones that pass threshold (default: False)
portion (bool) – Indicate if you want to normalized the bar to show percentage instead of actual value (default: True)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: cluster_contribution)
- Topyfic.utilsMakeModel.read_analysis(file)[source]¶
reading analysis pickle file
- Parameters
file (str) – path of the pickle file
- Returns
analysis instance
- Return type
Analysis class
- Topyfic.utilsMakeModel.read_model_yaml(model_yaml_path='model.yaml', topic_yaml_path=None, cell_topic_participation_path=None, save=True)[source]¶
read YMAL files and make topmodel object write topic in YAML format
- Parameters
model_yaml_path (str) – model yaml path
topic_yaml_path (str) – path that you use to save all topics information
cell_topic_participation_path (str) – path of cell-topic participation
save (bool) – indicate if you want to save objects (topmodel and analysis) as a pickle file (default: True)
- Returns
Topmodel and analysis objects
- Return type
- Topyfic.utilsMakeModel.read_topModel(file)[source]¶
reading topModel pickle/HDF5 file
- Parameters
file (str) – path of the pickle/HDF5 file
- Returns
topModel instance
- Return type
TopModel class
- Topyfic.utilsMakeModel.read_train(file)[source]¶
reading train pickle file
- Parameters
file (str) – path of the pickle file
- Returns
train instance
- Return type
Train class
- Topyfic.utilsMakeModel.subset_data(data, keep, loc='var')[source]¶
Subsetting data
- Parameters
data (anndata) – data we want to subset
keep (list) – values in the obs/var_names
loc – subsetting in which direction (default: ‘var’)
- Returns
data we want to keep
- Return type
anndata
- Topyfic.utilsMakeModel.train_model(name, data, k, n_runs=100, random_state_range=None, n_thread=5, save_path='')[source]¶
Training model and save it
- Parameters
name (str) – name of the Train class
k (int) – number of topics to learn one LDA model using sklearn package (default: 50)
n_runs (int) – number of run to define rLDA model (default: 100)
random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))
data (anndata) – data embedded in anndata format use to train LDA model
n_thread (int) – number of threads you used to learn LDA models (default=5)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsAnalyseModel.GSEA(gene_list, gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]¶
Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.
- Parameters
gene_list (pandas series) – pandas series with index as a gene names and their ranks/weights as value
gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)
plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank
- Returns
dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)
- Return type
pandas dataframe
- Topyfic.utilsAnalyseModel.MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2.0, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]¶
plot MA based on the gene weights on given topics
- Parameters
topic1 (pandas.series) – gene weight of first topic to be compared
topic2 (pandas.series) – gene weight of second topic to be compared
size (pandas dataframe) – table contains size of dot for each genes (genes are index)
pseudocount (float) – pseudocount that you want to add (default: 1)
threshold (float) – threshold to filter genes based on A values (default: 1)
cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)
consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold
labels (list) – list of gene names wish to show in MA-plot
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: MA_plot)
- Returns
return M and A values
- Topyfic.utilsAnalyseModel.compare_topModels(topModels, comparison_method='Jensen–Shannon divergence', output_type='graph', threshold=0.8, topModels_color=None, topModels_label=None, ignore_genes=True, save=False, plot_show=True, figsize=None, plot_format='pdf', file_name='compare_topics')[source]¶
compare topModels using topic gene weights
- Parameters
topModels (list of TopModel class) – list of topModel class you want to compare to each other
comparison_method (str) – indicate the method you want to use for comparing topics. if you used Jensen–Shannon, we show -log2 (options: pearson correlation, spearman correlation, Jensen–Shannon divergence, cosine similarity)
output_type (str) – indicate the type of output you want. graph: plot as a graph, heatmap: plot as a heatmap, table: table contains correlation. Note: if you want to plot Jensen–Shannon divergence as a graph, we convert the values to be at the -log2(), so you need to take that account for defining threshold
threshold (float) – only apply when you choose circular which only show correlation above that
topModels_color (dict) – dictionary of colors mapping each topics to each color (default: blue)
topModels_label (dict) – dictionary of label mapping each topics to each label
ignore_genes (bool) – idicate how to behave to the genes that only represent in one the topics. “True” means it’s gonna ignore those genes and “False” means it’s gonna asumne the weights are zero for those genes that we don’t have any weights in one of the mouse models
save (bool) – indicate if you want to save the plot or not (default: True)
plot_show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
plot_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: compare_topics)
- Returns
table contains correlation between topics only when table is choose and save is False
- Return type
pandas dataframe
- Topyfic.utilsAnalyseModel.functional_enrichment_analysis(gene_list, type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]¶
Doing functional enrichment analysis including GO, KEGG and REACTOME
- Parameters
gene_list (list) – list of gene name
type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”
organism (str) – name of the organ you want to do functional enrichment analysis
sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO
p_value (float) – Defines the pValue threshold. (default: 0.05)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- Topyfic.utilsAnalyseModel.modified_zscore(data, consistency_correction=1.4826)[source]¶
Returns the modified z score and Median Absolute Deviation (MAD) from the scores in data. The consistency_correction factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
- Topyfic.utilsAnalyseModel.summarize_GO_Term(GO_terms, p_value=0.05, file_format='html', file_name='GO_sum')[source]¶
Summarize long, unintelligible lists of GO terms by finding a representative subset of the terms showing more unique (child) Go terms We suggest save it as html since it’s gonna be plot by plotly so you can take an advantage of using plotly
- Parameters
GO_terms (pandas dataframe) – Dataframe contains results of gene ontology analysis performs by GSEAPY (https://gseapy.readthedocs.io/en/latest/index.html)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
file_format (str) – indicate the format of plot (default: html)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- Returns
dataframe used to plot the results
- Return type
pandas dataframe