API Documentation¶
- class Topyfic.train.Train(name, k, n_runs=100, random_state_range=None)[source]¶
A class used to train reproducible latent dirichlet allocation (rLDA) model
- Parameters:
name (str) – name of the Train class
k (int) – number of topics to learn one LDA model using sklearn package
n_runs (int) – number of run to define rLDA model (default: 100)
random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))
top_models (list of TopModel) – list of TopModel class to save all LDA models
- combine_LDA_models(data, single_trains=[])[source]¶
combine single top_model
- Parameters:
data (anndata) – data you used to learn model
single_trains (list) – list of single train object
- make_LDA_models_attributes()[source]¶
make LDA attributes by combining all single LDA model attributes which you need to define LDA model (sklearn.decomposition.LatentDirichletAllocation)
- Returns:
three data frame which the first one is gathering all components from all LDA runs,
the second one is exp_dirichlet_component from all LDA runs and the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe
- make_single_LDA_model(data, random_state, name, learning_method, batch_size, max_iter, n_jobs, kwargs)[source]¶
train simple LDA model using sklearn package and embed it to TopModel class
- Parameters:
name (str) – name of LDA model
data (anndata) – processed expression data along with cells and genes/region information
random_state (int) – Pass an int for reproducible results across multiple function calls
max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)
batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)
learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)
n_jobs (int) – The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)
- Returns:
LDA model embedded in TopModel class
- Return type:
- run_LDA_models(data, learning_method='online', batch_size=1000, max_iter=10, n_jobs=None, n_thread=1, **kwargs)[source]¶
train LDA models
- Parameters:
max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)
batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)
learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)
data (anndata) – expression data embedded in anndata format use to train LDA model
n_jobs (int) –
The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)
n_thread (int) – number of threads you used to learn LDA models (default=1)
**kwargs –
other parameter in sklearn.decomposition.LatentDirichletAllocation function (more info: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- Returns:
None
- Return type:
None
- save_train(name=None, save_path='', file_format='pickle')[source]¶
save Train class as a pickle file
- Parameters:
name (str) – name of the pickle file (default is train_Train.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: pickle (default), HDF5)
- class Topyfic.topic.Topic(topic_id, topic_name=None, topic_gene_weights=None, gene_information=None, topic_information=None)[source]¶
A class saved topic along with other useful information
- Parameters:
topic_id (str) – ID of topic which is unique
topic_name (str) – name of the topic (default: topic_id)
topic_gene_weights (pandas dataframe) – dataframe that contains weights of the topics for each genes
gene_information (pandas dataframe) – dataframe that contains information of genes i.e gene biotype
topic_information (pandas dataframe) – dataframe that contains information of genes i.e cell state / cell type
- GSEA(gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]¶
Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.
- Parameters:
gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)
plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank
- Returns:
dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)
- Return type:
pandas dataframe
- functional_enrichment_analysis(type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]¶
Doing functional enrichment analysis including GO, KEGG and REACTOME
- Parameters:
type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”
organism (str) – name of the organ you want to do functional enrichment analysis
sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO
p_value (float) – Defines the pValue threshold. (default: 0.05)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- gene_weight_variance(save=True)[source]¶
calculate the gene weight variance
- Parameters:
save (bool) – added as an information to the Topic (default: True)
- Returns:
Gene weight variance for given topic
- Return type:
float
- update_gene_information(gene_information)[source]¶
update/add genes information for each topics
- Parameters:
gene_information (pandas dataframe) – dataframe contains genes information we would like to add/update (the index should be same as an index of gene_information in class)
- write_topic_yaml(topic_id=None, model_yaml_path='model.yaml', topic_yaml_path='topic.yaml', save=True)[source]¶
write topic in YAML format
- Parameters:
topic_id (str) – unique topic ID (default is topic ID)
model_yaml_path (str) – model yaml path that has information about the dataset you use
topic_yaml_path (str) – path that you use to save topic
save (bool) – indicate if you want to save yaml file (True) or just show them (Fasle) (default: True)
- class Topyfic.topModel.TopModel(name, N, topics=None, gene_weights=None, gene_information=None, model=None)[source]¶
A class that saved a model
- Parameters:
name (str) – name of class
N (int) – number of topics
gene_weights (pandas dataframe) – dataframe that has weights of genes for each topics; genes are indexes and topics are columns
topics (Dictionary of Topics) – dictionary contains all topics for the topmodel
model (sklearn.decomposition.LatentDirichletAllocation) – store reproducible LDA model
- MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]¶
plot MA based on the gene weights on given topics
- Parameters:
topic1 (str) – first topic to be compared
topic2 (str) – second topic to be compared
size (pandas dataframe) – table contains size of dot for each genes (genes are index)
pseudocount (float) – pseudocount that you want to add (default: 1)
threshold (float) – threshold to filter genes based on A values (default: 1)
cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)
consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold
labels (list) – list of gene names wish to show in MA-plot
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: MA_plot)
- Returns:
return M and A values
- gene_weight_rank_heatmap(genes=None, topics=None, show_rank=True, scale=None, save=True, show=True, figsize=None, file_format='pdf', file_name='gene_weight_rank_heatmap')[source]¶
plot selected genes weights and their ranks in selected topics
- Parameters:
genes (list) – list of genes you want to see their weights (default: all genes)
topics (list) – list of topics
show_rank (bool) – indicate if you want to show the rank of significant genes or not (default: True)
scale – indicate if you want to plot as log2, log10 or not (default: None which show actual value) other options is log2 and log10
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_weight_rank_heatmap)
- Scale scale:
str
- get_feature_name()[source]¶
get feature(gene) name
- Returns:
list of feature(gene) name
- Return type:
list
- get_gene_weights()[source]¶
get feature(gene) weights
- Returns:
dataframe contains feature(gene) weights; genes are indexes and topics are columns
- Return type:
pandas dataframe
- get_ranked_gene_weight()[source]¶
get sorted feature(gene) weights. each value is gene and weights on each topics
- Returns:
dataframe contains feature(gene) and their weights; ranks are indexes and topics are columns
- Return type:
pandas dataframe
- get_top_model_attributes()[source]¶
get top model attributes to be able to make sklearn.decomposition.LatentDirichletAllocation
- Returns:
three data frame which the first one is components, the second one is exp_dirichlet_component and
the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe
- save_rLDA_model(name='rLDA', save_path='', file_format='joblib')[source]¶
save rLDA model (instance of LDA model in sklearn) as a joblib/HDF5 file.
- Parameters:
name (str) – name of the joblib file (default: rLDA)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: joblib (default), HDF5)
- save_topModel(name=None, save_path='', file_format='pickle')[source]¶
save TopModel class as a pickle/HDF5 file
- Parameters:
name (str) – name of the file (default: topModel_TopModel.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
file_format (str) – format of the file you want to save (option: pickle (default), HDF5)
- class Topyfic.analysis.Analysis(Top_model, colors_topics=None, cell_participation=None)[source]¶
A class used to investigate the topics and gene weights compositions
- Parameters:
Top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation
colors_topics (pandas dataframe) – dataframe that mapped colored to topics
cell_participation (anndata) – anndata that stores cell participation along with cell information in obs
- TopicTraitRelationshipHeatmap(metaData, alternative='two-sided', annotation=False, save=True, show=True, file_format='pdf', file_name='topic-traitRelationships')[source]¶
plot topic-trait relationship heatmap
- Parameters:
metaData (list) – traits you would like to see the relationship with topics (must be column name of cell_participation.obs)
alternative (str) – Defines the alternative hypothesis for calculating correlation for module-trait relationship. Default is ‘two-sided’. The following options are available: ‘two-sided’: the correlation is nonzero, ‘less’: the correlation is negative (less than zero), ‘greater’: the correlation is positive (greater than zero)
annotation (bool) – indicate if you want to add correlation and p_values as a text in each square (default:False)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: topic-traitRelationships)
- average_cell_participation(label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='average_cell_participation')[source]¶
barplot showing average of cell participation in each topic
- Parameters:
label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name
color (str) – color of bar plot (default: blue)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- average_cell_participation_line_plot(topic, color, category, color_pallet=None, save=True, show=True, figsize=None, file_format='pdf', file_name='line_average_cell_participation')[source]¶
line plot showing average of cell participation in topic divided by two features of cells (i.e. cell type and time point)
- Parameters:
topic (str) – name of the topic
color – name of the feature you want to have one line per group of that (it should be column name of cell_participation.obs)
:type color:str :param color_pallet: color of each category of color (if it None color assign randomly) :type color_pallet: dict :param category: name of the feature you want to have on x axis (it should be column name of cell_participation.obs) :type category: str :param save: indicate if you want to save the plot or not (default: True) :type save: bool :param show: indicate if you want to show the plot or not (default: True) :type show: bool :param figsize: indicate the size of plot (default: (10 * (len(category) + 1), 10)) :type figsize: tuple of int :param file_format: indicate the format of plot (default: pdf) :type file_format: str :param file_name: name and path of the plot use for save (default: piechart_topicAvgCell) :type file_name: str
- calculate_cell_participation(data)[source]¶
Calculate cell participation for give data
- Parameters:
data (anndata) – processed expression data along with cells and genes/region information
- cell_participation_distribution(plot_type='violin', threshold=0.05, max_topic=True, label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='dist_cell_participation')[source]¶
plot showing distribution of max/all topics in cell participation for each topic
- Parameters:
plot_type (str) – type of the plot which can be “violin” or “bax”
threshold (float) – indicate the threshold to filter out cells with low participation in each topics (default: 0.05)
max_topic (bool) – indicate if you want to consider all topics for each cells (False) or only the topic with highest pariticipation (max topic) for each cells (True)
label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name
color (str) – color of bar plot (default: blue)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- static convertDatTraits(data)[source]¶
get data trait module base on samples information
- Returns:
a dataframe contains information in suitable format for plotting module trait relationship heatmap
- Return type:
pandas dataframe
- extract_cells(level, category, top_cells=0.05, min_cell_participation=0.05, min_cells=50, file_name=None, save=False)[source]¶
extract subset of cells and cells participation with specific criteria
- Parameters:
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
top_cells (float) – fraction of the cells you want to be considers (default: 0.05)
min_cell_participation (float) – minimum cell participation each cells in each topics should have to be count (default: 0.05)
min_cells (int) – minimum number of cells each topics should have to be reported (default: 50)
file_name (str) – name and path of the plot use for save (default: selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv and cellParticipation_selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv)
save (bool) – indicate if you want to save the data or not (default: False)
- Returns:
table contains cell ID that pass threshold for each topic, table contains cell particiaption for cells that pass threshold for each topic (same order as fist table)
- Return type:
pandas dataframe, pandas dataframe
- max_topic_cell_participation(cutoff=10, color='blue', title='Maximum cell topic participation for each cells', save=True, show=True, figsize=None, file_format='pdf', file_name='max_topic_cell_participation')[source]¶
step plot showing maximum cell participation
- Parameters:
cutoff (float) – indicate if you want to eliminate any cells with maximum participation less than this
color (str) – color of bar plot (default: blue)
title (str) – indicate if you want to all title into plot (default: Maximum cell topic participation for each cells)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- pie_structure_Chart(level, category=None, ascending=None, n=5, save=True, show=True, figsize=None, file_format='pdf', file_name='piechart_topicAvgCell')[source]¶
plot pie charts that shows contribution of each topics to each category (i.e cell type)
- Parameters:
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot pie charts which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
ascending (list of bool) – for each pie chart on which order you want to sort your data (default is descending for all pie charts)
n (int) – number of topics you want to annotate in pie charts (default: 5)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- plot_topic_composition(category, level='topic', biotype='biotype', label=False, save=True, show=True, file_format='pdf', file_name='gene_composition')[source]¶
plot gene composition dividing by gene biotype or topics
- Parameters:
category (str) – topic name or gene biotype name you want to see gene composition for
level (str) – indicate weather if you want to show it within each topic or gene biotype (options: “topic” or “gene_biotype”) (default: topic)
biotype (str) – name of the column in gene_weight to look for gene_biotype (default: biotype)
label (bool) – show label of each line within plot or not (default: False)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- save_analysis(name=None, save_path='')[source]¶
save Analysis class as a pickle file
- Parameters:
name (str) – name of the pickle file (default: analysis_Analysis.top_model.name)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- structure_plot(level, category=None, topic_order=None, ascending=None, metaData=None, metaData_palette=None, width=None, n=2, order_cells=['hierarchy'], save=True, show=True, figsize=None, file_format='pdf', file_name='structure_topicAvgCell')[source]¶
plot structure which shows contribution of each topics for each cells in given categories
- Parameters:
level (str) – name of the column from cell_participation.obs
category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])
topic_order (list of str) – indicate if you want to have a specific order of topics which it should be name of topics. if None, it’s gonna sort by cell participation
ascending (list of bool) – for each structure plot on which order you want to sort your data (default is descending for all structure plot)
metaData (list) – if you want to add annotation for each cell add column name of that information (make sure you have that inforamtion in your cell_participation.obs)
metaData_palette (dict) – color palette for each metaData you add
width (list of int) – width ratios of each category (default is based on the number of the cells we have in each category)
n (int) – number of topics you want to sum if you used order_cell == ‘sum’ (default: 2)
order_cells (list) – determine which kind of sorting options you want to use (‘sum’, ‘hierarchy’, sort by metaData); sum: sort cells by sum of top n topics; hierarchy: sort data by doing hierarchical clustring; metaData sort by metaData (default: [‘hierarchy’])
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)
- Topyfic.utilsMakeModel.calculate_leiden_clustering(trains, data, n_top_genes=None, resolution=1, max_iter_harmony=10, min_cell_participation=None, file_format='pdf')[source]¶
Do leiden clustering w/o harmony base on number of assays you have and then remove low participation topics
- Parameters:
trains (list of Train) – list of train class
data (anndata) – gene-count data with cells and genes information
n_top_genes (int) – Number of highly-variable genes to keep (default: 50)
resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
max_iter_harmony (int) – number of iteration for running harmony (default: 10)
min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)
file_format (str) – indicate the format of plot (default: pdf)
- Returns:
final TopModel instance after clustering and trimming, dataframe containing which run goes to which topic
- Return type:
TopModel, pandas dataframe
- Topyfic.utilsMakeModel.combine_topModels(topModels, name='Combined_TopModel', data=None, min_cell_participation=None)[source]¶
Combine two topmodels. It will not apply any method when we want to combine them, so basically just combine all models without performing any method
- Parameters:
topModels (list of TopModel) – list of topmodels you want to combine
name (str) – name of the combined topmodels
data (anndata) – if you want to remove topics with low cell participation, you can pass the data you used to train models
min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)
- Returns:
return the combined TopModel, number of topics, gene weights
- Return type:
TopModel, int, pandas DataFrame
- Topyfic.utilsMakeModel.filter_LDA_model(main_lda, keep)[source]¶
filter LDA based on the topics we want to keep
- Parameters:
main_lda (sklearn.decomposition.LatentDirichletAllocation) – Latent Dirichlet Allocation with online variational Bayes algorithm.
keep (pandas dataframe) – dataframe that define which topics we want to keep
- Returns:
Latent Dirichlet Allocation with online variational Bayes algorithm, weights of genes in each topics (indexes are topics and columns are genes)
- Return type:
sklearn.decomposition.LatentDirichletAllocation, pandas dataframe
- Topyfic.utilsMakeModel.initialize_lda_model(components, exp_dirichlet_component, others)[source]¶
Initialize LDA model by passing all necessary attributes
- Parameters:
components (pandas dataframe) – Variational parameters for topic gene distribution
exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution
others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.
- Returns:
Latent Dirichlet Allocation with online variational Bayes algorithm.
- Return type:
sklearn.decomposition.LatentDirichletAllocation
- Topyfic.utilsMakeModel.initialize_rLDA_model(all_components, all_exp_dirichlet_component, all_others, clusters)[source]¶
Initialize reproducible LDA model by calculating all necessary attributes using clustering.
- Parameters:
all_components (pandas dataframe) – Variational parameters for topic gene distribution from all single LDA models
all_exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution from all single LDA models
all_others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.
clusters (pandas dataframe) – dataframe that mapped each LDA run to each clusters
- Returns:
Latent Dirichlet Allocation with online variational Bayes algorithm.
- Return type:
sklearn.decomposition.LatentDirichletAllocation
- Topyfic.utilsMakeModel.make_analysis_class(top_model, data, colors_topics=None, save_path='')[source]¶
Creating Analysis object
- Parameters:
top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation
data (anndata) – processed expression data along with cells and genes/region information
colors_topics (pandas dataframe) – dataframe that mapped colored to topics
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsMakeModel.make_topModel(trains, data, n_top_genes=50, resolution=1, file_format='pdf', save_path='')[source]¶
Creating topModel base on train data and save it along with clustering information
- Parameters:
trains (list of Train) – list of train class
data (anndata) – expression data embedded in anndata format along with cells and genes/region information
n_top_genes (int) – Number of highly-variable genes to keep (default: 50)
resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)
file_format (str) – indicate the format of plot (default: pdf)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsMakeModel.plot_cluster_contribution(clustering, feature, show_all=False, portion=True, save=True, show=True, file_format='pdf', file_name='cluster_contribution')[source]¶
barplot shows number of topics contribute to each cluster
- Parameters:
clustering (pandas dataframe) – dataframe that map each single LDA run to each cluster
feature (str) – name of the feature you want to see the cluster contribution (should be one of the columns name of clustering df)
show_all (bool) – Indicate if you want to show all clusters or only the ones that pass threshold (default: False)
portion (bool) – Indicate if you want to normalized the bar to show percentage instead of actual value (default: True)
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: cluster_contribution)
- Topyfic.utilsMakeModel.read_analysis(file)[source]¶
reading analysis pickle file
- Parameters:
file (str) – path of the pickle file
- Returns:
analysis instance
- Return type:
Analysis class
- Topyfic.utilsMakeModel.read_model_yaml(model_yaml_path='model.yaml', topic_yaml_path=None, cell_topic_participation_path=None, save=True)[source]¶
read YMAL files and make topmodel object write topic in YAML format
- Parameters:
model_yaml_path (str) – model yaml path
topic_yaml_path (str) – path that you use to save all topics information
cell_topic_participation_path (str) – path of cell-topic participation
save (bool) – indicate if you want to save objects (topmodel and analysis) as a pickle file (default: True)
- Returns:
Topmodel and analysis objects
- Return type:
- Topyfic.utilsMakeModel.read_topModel(file)[source]¶
reading topModel pickle/HDF5 file
- Parameters:
file (str) – path of the pickle/HDF5 file
- Returns:
topModel instance
- Return type:
TopModel class
- Topyfic.utilsMakeModel.read_train(file)[source]¶
reading train pickle file
- Parameters:
file (str) – path of the pickle file
- Returns:
train instance
- Return type:
Train class
- Topyfic.utilsMakeModel.subset_data(data, keep, loc='var')[source]¶
Subsetting data
- Parameters:
data (anndata) – data we want to subset
keep (list) – values in the obs/var_names
loc – subsetting in which direction (default: ‘var’)
- Returns:
data we want to keep
- Return type:
anndata
- Topyfic.utilsMakeModel.train_model(name, data, k, n_runs=100, random_state_range=None, n_thread=5, save_path='')[source]¶
Training model and save it
- Parameters:
name (str) – name of the Train class
k (int) – number of topics to learn one LDA model using sklearn package (default: 50)
n_runs (int) – number of run to define rLDA model (default: 100)
random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))
data (anndata) – data embedded in anndata format use to train LDA model
n_thread (int) – number of threads you used to learn LDA models (default=5)
save_path (str) – directory you want to use to save pickle file (default is saving near script)
- Topyfic.utilsAnalyseModel.GSEA(gene_list, gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]¶
Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.
- Parameters:
gene_list (pandas series) – pandas series with index as a gene names and their ranks/weights as value
gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)
plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank
- Returns:
dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)
- Return type:
pandas dataframe
- Topyfic.utilsAnalyseModel.MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2.0, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]¶
plot MA based on the gene weights on given topics
- Parameters:
topic1 (pandas.series) – gene weight of first topic to be compared
topic2 (pandas.series) – gene weight of second topic to be compared
size (pandas dataframe) – table contains size of dot for each genes (genes are index)
pseudocount (float) – pseudocount that you want to add (default: 1)
threshold (float) – threshold to filter genes based on A values (default: 1)
cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)
consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold
labels (list) – list of gene names wish to show in MA-plot
save (bool) – indicate if you want to save the plot or not (default: True)
show (bool) – indicate if you want to show the plot or not (default: True)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: MA_plot)
- Returns:
return M and A values
- Topyfic.utilsAnalyseModel.compare_topModels(topModels, comparison_method='Jensen–Shannon divergence', output_type='graph', threshold=0.8, topModels_color=None, topModels_label=None, ignore_genes=True, save=False, plot_show=True, figsize=None, plot_format='pdf', file_name='compare_topics')[source]¶
compare topModels using topic gene weights
- Parameters:
topModels (list of TopModel class) – list of topModel class you want to compare to each other
comparison_method (str) – indicate the method you want to use for comparing topics. if you used Jensen–Shannon, we show -log2 (options: pearson correlation, spearman correlation, Jensen–Shannon divergence, cosine similarity)
output_type (str) – indicate the type of output you want. graph: plot as a graph, heatmap: plot as a heatmap, table: table contains correlation. Note: if you want to plot Jensen–Shannon divergence as a graph, we convert the values to be at the -log2(), so you need to take that account for defining threshold
threshold (float) – only apply when you choose circular which only show correlation above that
topModels_color (dict) – dictionary of colors mapping each topics to each color (default: blue)
topModels_label (dict) – dictionary of label mapping each topics to each label
ignore_genes (bool) – idicate how to behave to the genes that only represent in one the topics. “True” means it’s gonna ignore those genes and “False” means it’s gonna asumne the weights are zero for those genes that we don’t have any weights in one of the mouse models
save (bool) – indicate if you want to save the plot or not (default: True)
plot_show (bool) – indicate if you want to show the plot or not (default: True)
figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))
plot_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: compare_topics)
- Returns:
table contains correlation between topics only when table is choose and save is False
- Return type:
pandas dataframe
- Topyfic.utilsAnalyseModel.functional_enrichment_analysis(gene_list, type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]¶
Doing functional enrichment analysis including GO, KEGG and REACTOME
- Parameters:
gene_list (list) – list of gene name
type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”
organism (str) – name of the organ you want to do functional enrichment analysis
sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO
p_value (float) – Defines the pValue threshold. (default: 0.05)
file_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- Topyfic.utilsAnalyseModel.modified_zscore(data, consistency_correction=1.4826)[source]¶
Returns the modified z score and Median Absolute Deviation (MAD) from the scores in data. The consistency_correction factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed
- Topyfic.utilsAnalyseModel.summarize_GO_Term(GO_terms, p_value=0.05, file_format='html', file_name='GO_sum')[source]¶
Summarize long, unintelligible lists of GO terms by finding a representative subset of the terms showing more unique (child) Go terms We suggest save it as html since it’s gonna be plot by plotly so you can take an advantage of using plotly
- Parameters:
GO_terms (pandas dataframe) – Dataframe contains results of gene ontology analysis performs by GSEAPY (https://gseapy.readthedocs.io/en/latest/index.html)
p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)
file_format (str) – indicate the format of plot (default: html)
file_name (str) – name and path of the plot use for save (default: gene_composition)
- Returns:
dataframe used to plot the results
- Return type:
pandas dataframe