API Documentation

class Topyfic.train.Train(name, k, n_runs=100, random_state_range=None)[source]

A class used to train reproducible latent dirichlet allocation (rLDA) model

Parameters:
  • name (str) – name of the Train class

  • k (int) – number of topics to learn one LDA model using sklearn package

  • n_runs (int) – number of run to define rLDA model (default: 100)

  • random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))

  • top_models (list of TopModel) – list of TopModel class to save all LDA models

combine_LDA_models(data, single_trains=[])[source]

combine single top_model

Parameters:
  • data (anndata) – data you used to learn model

  • single_trains (list) – list of single train object

make_LDA_models_attributes()[source]

make LDA attributes by combining all single LDA model attributes which you need to define LDA model (sklearn.decomposition.LatentDirichletAllocation)

Returns:

three data frame which the first one is gathering all components from all LDA runs,

the second one is exp_dirichlet_component from all LDA runs and the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe

make_single_LDA_model(data, random_state, name, learning_method, batch_size, max_iter, n_jobs, kwargs)[source]

train simple LDA model using sklearn package and embed it to TopModel class

Parameters:
  • name (str) – name of LDA model

  • data (anndata) – processed expression data along with cells and genes/region information

  • random_state (int) – Pass an int for reproducible results across multiple function calls

  • max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)

  • batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)

  • learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)

  • n_jobs (int) – The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)

Returns:

LDA model embedded in TopModel class

Return type:

TopModel

run_LDA_models(data, learning_method='online', batch_size=1000, max_iter=10, n_jobs=None, n_thread=1, **kwargs)[source]

train LDA models

Parameters:
  • max_iter (int) – The maximum number of passes over the training data (aka epochs) (default = 10)

  • batch_size (int) – Number of documents to use in each EM iteration. Only used in online learning. (default = 1000)

  • learning_method (str) – Method used to update _component. {‘batch’, ‘online’} (default=’online’)

  • data (anndata) – expression data embedded in anndata format use to train LDA model

  • n_jobs (int) –

    The number of jobs to use in the E-step. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. (default = None)

  • n_thread (int) – number of threads you used to learn LDA models (default=1)

  • **kwargs

    other parameter in sklearn.decomposition.LatentDirichletAllocation function (more info: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

Returns:

None

Return type:

None

save_train(name=None, save_path='', file_format='pickle')[source]

save Train class as a pickle file

Parameters:
  • name (str) – name of the pickle file (default is train_Train.name)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

  • file_format (str) – format of the file you want to save (option: pickle (default), HDF5)

class Topyfic.topic.Topic(topic_id, topic_name=None, topic_gene_weights=None, gene_information=None, topic_information=None)[source]

A class saved topic along with other useful information

Parameters:
  • topic_id (str) – ID of topic which is unique

  • topic_name (str) – name of the topic (default: topic_id)

  • topic_gene_weights (pandas dataframe) – dataframe that contains weights of the topics for each genes

  • gene_information (pandas dataframe) – dataframe that contains information of genes i.e gene biotype

  • topic_information (pandas dataframe) – dataframe that contains information of genes i.e cell state / cell type

GSEA(gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]

Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.

Parameters:
  • gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)

  • p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)

  • table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)

  • plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

  • kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank

Returns:

dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)

Return type:

pandas dataframe

functional_enrichment_analysis(type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]

Doing functional enrichment analysis including GO, KEGG and REACTOME

Parameters:
  • type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”

  • organism (str) – name of the organ you want to do functional enrichment analysis

  • sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO

  • p_value (float) – Defines the pValue threshold. (default: 0.05)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

gene_weight_variance(save=True)[source]

calculate the gene weight variance

Parameters:

save (bool) – added as an information to the Topic (default: True)

Returns:

Gene weight variance for given topic

Return type:

float

update_gene_information(gene_information)[source]

update/add genes information for each topics

Parameters:

gene_information (pandas dataframe) – dataframe contains genes information we would like to add/update (the index should be same as an index of gene_information in class)

write_topic_yaml(topic_id=None, model_yaml_path='model.yaml', topic_yaml_path='topic.yaml', save=True)[source]

write topic in YAML format

Parameters:
  • topic_id (str) – unique topic ID (default is topic ID)

  • model_yaml_path (str) – model yaml path that has information about the dataset you use

  • topic_yaml_path (str) – path that you use to save topic

  • save (bool) – indicate if you want to save yaml file (True) or just show them (Fasle) (default: True)

class Topyfic.topModel.TopModel(name, N, topics=None, gene_weights=None, gene_information=None, model=None)[source]

A class that saved a model

Parameters:
  • name (str) – name of class

  • N (int) – number of topics

  • gene_weights (pandas dataframe) – dataframe that has weights of genes for each topics; genes are indexes and topics are columns

  • topics (Dictionary of Topics) – dictionary contains all topics for the topmodel

  • model (sklearn.decomposition.LatentDirichletAllocation) – store reproducible LDA model

MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]

plot MA based on the gene weights on given topics

Parameters:
  • topic1 (str) – first topic to be compared

  • topic2 (str) – second topic to be compared

  • size (pandas dataframe) – table contains size of dot for each genes (genes are index)

  • pseudocount (float) – pseudocount that you want to add (default: 1)

  • threshold (float) – threshold to filter genes based on A values (default: 1)

  • cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)

  • consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed

  • topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold

  • labels (list) – list of gene names wish to show in MA-plot

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: MA_plot)

Returns:

return M and A values

gene_weight_rank_heatmap(genes=None, topics=None, show_rank=True, scale=None, save=True, show=True, figsize=None, file_format='pdf', file_name='gene_weight_rank_heatmap')[source]

plot selected genes weights and their ranks in selected topics

Parameters:
  • genes (list) – list of genes you want to see their weights (default: all genes)

  • topics (list) – list of topics

  • show_rank (bool) – indicate if you want to show the rank of significant genes or not (default: True)

  • scale – indicate if you want to plot as log2, log10 or not (default: None which show actual value) other options is log2 and log10

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_weight_rank_heatmap)

Scale scale:

str

get_feature_name()[source]

get feature(gene) name

Returns:

list of feature(gene) name

Return type:

list

get_gene_weights()[source]

get feature(gene) weights

Returns:

dataframe contains feature(gene) weights; genes are indexes and topics are columns

Return type:

pandas dataframe

get_ranked_gene_weight()[source]

get sorted feature(gene) weights. each value is gene and weights on each topics

Returns:

dataframe contains feature(gene) and their weights; ranks are indexes and topics are columns

Return type:

pandas dataframe

get_top_model_attributes()[source]

get top model attributes to be able to make sklearn.decomposition.LatentDirichletAllocation

Returns:

three data frame which the first one is components, the second one is exp_dirichlet_component and

the last one is combining the rest of LDA attributes which put them to gather as a dataframe :rtype: pandas dataframe, pandas dataframe, pandas dataframe

save_rLDA_model(name='rLDA', save_path='', file_format='joblib')[source]

save rLDA model (instance of LDA model in sklearn) as a joblib/HDF5 file.

Parameters:
  • name (str) – name of the joblib file (default: rLDA)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

  • file_format (str) – format of the file you want to save (option: joblib (default), HDF5)

save_topModel(name=None, save_path='', file_format='pickle')[source]

save TopModel class as a pickle/HDF5 file

Parameters:
  • name (str) – name of the file (default: topModel_TopModel.name)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

  • file_format (str) – format of the file you want to save (option: pickle (default), HDF5)

class Topyfic.analysis.Analysis(Top_model, colors_topics=None, cell_participation=None)[source]

A class used to investigate the topics and gene weights compositions

Parameters:
  • Top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation

  • colors_topics (pandas dataframe) – dataframe that mapped colored to topics

  • cell_participation (anndata) – anndata that stores cell participation along with cell information in obs

TopicTraitRelationshipHeatmap(metaData, alternative='two-sided', annotation=False, save=True, show=True, file_format='pdf', file_name='topic-traitRelationships')[source]

plot topic-trait relationship heatmap

Parameters:
  • metaData (list) – traits you would like to see the relationship with topics (must be column name of cell_participation.obs)

  • alternative (str) – Defines the alternative hypothesis for calculating correlation for module-trait relationship. Default is ‘two-sided’. The following options are available: ‘two-sided’: the correlation is nonzero, ‘less’: the correlation is negative (less than zero), ‘greater’: the correlation is positive (greater than zero)

  • annotation (bool) – indicate if you want to add correlation and p_values as a text in each square (default:False)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: topic-traitRelationships)

average_cell_participation(label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='average_cell_participation')[source]

barplot showing average of cell participation in each topic

Parameters:
  • label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name

  • color (str) – color of bar plot (default: blue)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)

average_cell_participation_line_plot(topic, color, category, color_pallet=None, save=True, show=True, figsize=None, file_format='pdf', file_name='line_average_cell_participation')[source]

line plot showing average of cell participation in topic divided by two features of cells (i.e. cell type and time point)

Parameters:
  • topic (str) – name of the topic

  • color – name of the feature you want to have one line per group of that (it should be column name of cell_participation.obs)

:type color:str :param color_pallet: color of each category of color (if it None color assign randomly) :type color_pallet: dict :param category: name of the feature you want to have on x axis (it should be column name of cell_participation.obs) :type category: str :param save: indicate if you want to save the plot or not (default: True) :type save: bool :param show: indicate if you want to show the plot or not (default: True) :type show: bool :param figsize: indicate the size of plot (default: (10 * (len(category) + 1), 10)) :type figsize: tuple of int :param file_format: indicate the format of plot (default: pdf) :type file_format: str :param file_name: name and path of the plot use for save (default: piechart_topicAvgCell) :type file_name: str

calculate_cell_participation(data)[source]

Calculate cell participation for give data

Parameters:

data (anndata) – processed expression data along with cells and genes/region information

cell_participation_distribution(plot_type='violin', threshold=0.05, max_topic=True, label=None, color='blue', save=True, show=True, figsize=None, file_format='pdf', file_name='dist_cell_participation')[source]

plot showing distribution of max/all topics in cell participation for each topic

Parameters:
  • plot_type (str) – type of the plot which can be “violin” or “bax”

  • threshold (float) – indicate the threshold to filter out cells with low participation in each topics (default: 0.05)

  • max_topic (bool) – indicate if you want to consider all topics for each cells (False) or only the topic with highest pariticipation (max topic) for each cells (True)

  • label (dict) – fill with dictionary contain mapping new name for each topics to name you want to show if you want to change default topic name

  • color (str) – color of bar plot (default: blue)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)

static convertDatTraits(data)[source]

get data trait module base on samples information

Returns:

a dataframe contains information in suitable format for plotting module trait relationship heatmap

Return type:

pandas dataframe

extract_cells(level, category, top_cells=0.05, min_cell_participation=0.05, min_cells=50, file_name=None, save=False)[source]

extract subset of cells and cells participation with specific criteria

Parameters:
  • level (str) – name of the column from cell_participation.obs

  • category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])

  • top_cells (float) – fraction of the cells you want to be considers (default: 0.05)

  • min_cell_participation (float) – minimum cell participation each cells in each topics should have to be count (default: 0.05)

  • min_cells (int) – minimum number of cells each topics should have to be reported (default: 50)

  • file_name (str) – name and path of the plot use for save (default: selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv and cellParticipation_selectedCells_top{top_cells}_{min_cell_score}min_score_{min_cells}min_cells.csv)

  • save (bool) – indicate if you want to save the data or not (default: False)

Returns:

table contains cell ID that pass threshold for each topic, table contains cell particiaption for cells that pass threshold for each topic (same order as fist table)

Return type:

pandas dataframe, pandas dataframe

max_topic_cell_participation(cutoff=10, color='blue', title='Maximum cell topic participation for each cells', save=True, show=True, figsize=None, file_format='pdf', file_name='max_topic_cell_participation')[source]

step plot showing maximum cell participation

Parameters:
  • cutoff (float) – indicate if you want to eliminate any cells with maximum participation less than this

  • color (str) – color of bar plot (default: blue)

  • title (str) – indicate if you want to all title into plot (default: Maximum cell topic participation for each cells)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)

pie_structure_Chart(level, category=None, ascending=None, n=5, save=True, show=True, figsize=None, file_format='pdf', file_name='piechart_topicAvgCell')[source]

plot pie charts that shows contribution of each topics to each category (i.e cell type)

Parameters:
  • level (str) – name of the column from cell_participation.obs

  • category (list of str) – list of items you want to plot pie charts which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])

  • ascending (list of bool) – for each pie chart on which order you want to sort your data (default is descending for all pie charts)

  • n (int) – number of topics you want to annotate in pie charts (default: 5)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)

plot_topic_composition(category, level='topic', biotype='biotype', label=False, save=True, show=True, file_format='pdf', file_name='gene_composition')[source]

plot gene composition dividing by gene biotype or topics

Parameters:
  • category (str) – topic name or gene biotype name you want to see gene composition for

  • level (str) – indicate weather if you want to show it within each topic or gene biotype (options: “topic” or “gene_biotype”) (default: topic)

  • biotype (str) – name of the column in gene_weight to look for gene_biotype (default: biotype)

  • label (bool) – show label of each line within plot or not (default: False)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

save_analysis(name=None, save_path='')[source]

save Analysis class as a pickle file

Parameters:
  • name (str) – name of the pickle file (default: analysis_Analysis.top_model.name)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

structure_plot(level, category=None, topic_order=None, ascending=None, metaData=None, metaData_palette=None, width=None, n=2, order_cells=['hierarchy'], save=True, show=True, figsize=None, file_format='pdf', file_name='structure_topicAvgCell')[source]

plot structure which shows contribution of each topics for each cells in given categories

Parameters:
  • level (str) – name of the column from cell_participation.obs

  • category (list of str) – list of items you want to plot which are subsets of cell_participation.obs[level](default: all the unique items in cell_participation.obs[level])

  • topic_order (list of str) – indicate if you want to have a specific order of topics which it should be name of topics. if None, it’s gonna sort by cell participation

  • ascending (list of bool) – for each structure plot on which order you want to sort your data (default is descending for all structure plot)

  • metaData (list) – if you want to add annotation for each cell add column name of that information (make sure you have that inforamtion in your cell_participation.obs)

  • metaData_palette (dict) – color palette for each metaData you add

  • width (list of int) – width ratios of each category (default is based on the number of the cells we have in each category)

  • n (int) – number of topics you want to sum if you used order_cell == ‘sum’ (default: 2)

  • order_cells (list) – determine which kind of sorting options you want to use (‘sum’, ‘hierarchy’, sort by metaData); sum: sort cells by sum of top n topics; hierarchy: sort data by doing hierarchical clustring; metaData sort by metaData (default: [‘hierarchy’])

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: piechart_topicAvgCell)

Topyfic.utilsMakeModel.calculate_leiden_clustering(trains, data, n_top_genes=None, resolution=1, max_iter_harmony=10, min_cell_participation=None, file_format='pdf')[source]

Do leiden clustering w/o harmony base on number of assays you have and then remove low participation topics

Parameters:
  • trains (list of Train) – list of train class

  • data (anndata) – gene-count data with cells and genes information

  • n_top_genes (int) – Number of highly-variable genes to keep (default: 50)

  • resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)

  • max_iter_harmony (int) – number of iteration for running harmony (default: 10)

  • min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)

  • file_format (str) – indicate the format of plot (default: pdf)

Returns:

final TopModel instance after clustering and trimming, dataframe containing which run goes to which topic

Return type:

TopModel, pandas dataframe

Topyfic.utilsMakeModel.combine_topModels(topModels, name='Combined_TopModel', data=None, min_cell_participation=None)[source]

Combine two topmodels. It will not apply any method when we want to combine them, so basically just combine all models without performing any method

Parameters:
  • topModels (list of TopModel) – list of topmodels you want to combine

  • name (str) – name of the combined topmodels

  • data (anndata) – if you want to remove topics with low cell participation, you can pass the data you used to train models

  • min_cell_participation (float) – minimum cell participation across for each topics to keep them, when is None, it will keep topics with cell participation more than 1% of #cells (#cells / 100)

Returns:

return the combined TopModel, number of topics, gene weights

Return type:

TopModel, int, pandas DataFrame

Topyfic.utilsMakeModel.filter_LDA_model(main_lda, keep)[source]

filter LDA based on the topics we want to keep

Parameters:
  • main_lda (sklearn.decomposition.LatentDirichletAllocation) – Latent Dirichlet Allocation with online variational Bayes algorithm.

  • keep (pandas dataframe) – dataframe that define which topics we want to keep

Returns:

Latent Dirichlet Allocation with online variational Bayes algorithm, weights of genes in each topics (indexes are topics and columns are genes)

Return type:

sklearn.decomposition.LatentDirichletAllocation, pandas dataframe

Topyfic.utilsMakeModel.initialize_lda_model(components, exp_dirichlet_component, others)[source]

Initialize LDA model by passing all necessary attributes

Parameters:
  • components (pandas dataframe) – Variational parameters for topic gene distribution

  • exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution

  • others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.

Returns:

Latent Dirichlet Allocation with online variational Bayes algorithm.

Return type:

sklearn.decomposition.LatentDirichletAllocation

Topyfic.utilsMakeModel.initialize_rLDA_model(all_components, all_exp_dirichlet_component, all_others, clusters)[source]

Initialize reproducible LDA model by calculating all necessary attributes using clustering.

Parameters:
  • all_components (pandas dataframe) – Variational parameters for topic gene distribution from all single LDA models

  • all_exp_dirichlet_component (pandas dataframe) – Exponential value of expectation of log topic gene distribution from all single LDA models

  • all_others (pandas dataframe) – dataframe contains remaining necessary attributes including: n_batch_iter: Number of iterations of the EM step. n_features_in: Number of features seen during fit. n_iter: Number of passes over the dataset. bound: Final perplexity score on training set. doc_topic_prior: Prior of document topic distribution theta. If the value is None, it is 1 / n_components. topic_word_prior: Prior of topic word distribution beta. If the value is None, it is 1 / n_components.

  • clusters (pandas dataframe) – dataframe that mapped each LDA run to each clusters

Returns:

Latent Dirichlet Allocation with online variational Bayes algorithm.

Return type:

sklearn.decomposition.LatentDirichletAllocation

Topyfic.utilsMakeModel.make_analysis_class(top_model, data, colors_topics=None, save_path='')[source]

Creating Analysis object

Parameters:
  • top_model (TopModel) – top model that used for analysing topics, gene weights compositions and calculate cell participation

  • data (anndata) – processed expression data along with cells and genes/region information

  • colors_topics (pandas dataframe) – dataframe that mapped colored to topics

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

Topyfic.utilsMakeModel.make_topModel(trains, data, n_top_genes=50, resolution=1, file_format='pdf', save_path='')[source]

Creating topModel base on train data and save it along with clustering information

Parameters:
  • trains (list of Train) – list of train class

  • data (anndata) – expression data embedded in anndata format along with cells and genes/region information

  • n_top_genes (int) – Number of highly-variable genes to keep (default: 50)

  • resolution (int) – A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. (default: 1)

  • file_format (str) – indicate the format of plot (default: pdf)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

Topyfic.utilsMakeModel.plot_cluster_contribution(clustering, feature, show_all=False, portion=True, save=True, show=True, file_format='pdf', file_name='cluster_contribution')[source]

barplot shows number of topics contribute to each cluster

Parameters:
  • clustering (pandas dataframe) – dataframe that map each single LDA run to each cluster

  • feature (str) – name of the feature you want to see the cluster contribution (should be one of the columns name of clustering df)

  • show_all (bool) – Indicate if you want to show all clusters or only the ones that pass threshold (default: False)

  • portion (bool) – Indicate if you want to normalized the bar to show percentage instead of actual value (default: True)

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: cluster_contribution)

Topyfic.utilsMakeModel.read_analysis(file)[source]

reading analysis pickle file

Parameters:

file (str) – path of the pickle file

Returns:

analysis instance

Return type:

Analysis class

Topyfic.utilsMakeModel.read_model_yaml(model_yaml_path='model.yaml', topic_yaml_path=None, cell_topic_participation_path=None, save=True)[source]

read YMAL files and make topmodel object write topic in YAML format

Parameters:
  • model_yaml_path (str) – model yaml path

  • topic_yaml_path (str) – path that you use to save all topics information

  • cell_topic_participation_path (str) – path of cell-topic participation

  • save (bool) – indicate if you want to save objects (topmodel and analysis) as a pickle file (default: True)

Returns:

Topmodel and analysis objects

Return type:

TopModel, Analysis

Topyfic.utilsMakeModel.read_topModel(file)[source]

reading topModel pickle/HDF5 file

Parameters:

file (str) – path of the pickle/HDF5 file

Returns:

topModel instance

Return type:

TopModel class

Topyfic.utilsMakeModel.read_train(file)[source]

reading train pickle file

Parameters:

file (str) – path of the pickle file

Returns:

train instance

Return type:

Train class

Topyfic.utilsMakeModel.subset_data(data, keep, loc='var')[source]

Subsetting data

Parameters:
  • data (anndata) – data we want to subset

  • keep (list) – values in the obs/var_names

  • loc – subsetting in which direction (default: ‘var’)

Returns:

data we want to keep

Return type:

anndata

Topyfic.utilsMakeModel.train_model(name, data, k, n_runs=100, random_state_range=None, n_thread=5, save_path='')[source]

Training model and save it

Parameters:
  • name (str) – name of the Train class

  • k (int) – number of topics to learn one LDA model using sklearn package (default: 50)

  • n_runs (int) – number of run to define rLDA model (default: 100)

  • random_state_range (list of int) – list of random state, we used to run LDA models (default: range(n_runs))

  • data (anndata) – data embedded in anndata format use to train LDA model

  • n_thread (int) – number of threads you used to learn LDA models (default=5)

  • save_path (str) – directory you want to use to save pickle file (default is saving near script)

Topyfic.utilsAnalyseModel.GSEA(gene_list, gene_sets='GO_Biological_Process_2021', p_value=0.05, table=True, plot=True, file_format='pdf', file_name='GSEA', **kwargs)[source]

Doing Gene Set Enrichment Analysis on based on the topic weights using GSEAPY package.

Parameters:
  • gene_list (pandas series) – pandas series with index as a gene names and their ranks/weights as value

  • gene_sets (str, list, tuple) – Enrichr Library name or .gmt gene sets file or dict of gene sets. (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats)

  • p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)

  • table (bool) – indicate if you want to save all GO terms that passed the threshold as a table (default: True)

  • plot (bool) – indicate if you want to plot all GO terms that passed the threshold (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

  • kwargs – Argument to pass to gseapy.prerank(). more info: https://gseapy.readthedocs.io/en/latest/run.html?highlight=gp.prerank#gseapy.prerank

Returns:

dataframe contains these columns: Term: gene set name, ES: enrichment score, NES: normalized enrichment score, NOM p-val: Nominal p-value (from the null distribution of the gene set, FDR q-val: FDR qvalue (adjusted False Discory Rate), FWER p-val: Family wise error rate p-values, Tag %: Percent of gene set before running enrichment peak (ES), Gene %: Percent of gene list before running enrichment peak (ES), Lead_genes: leading edge genes (gene hits before running enrichment peak)

Return type:

pandas dataframe

Topyfic.utilsAnalyseModel.MA_plot(topic1, topic2, size=None, pseudocount=1, threshold=1, cutoff=2.0, consistency_correction=1.4826, labels=None, save=True, show=True, file_format='pdf', file_name='MA_plot')[source]

plot MA based on the gene weights on given topics

Parameters:
  • topic1 (pandas.series) – gene weight of first topic to be compared

  • topic2 (pandas.series) – gene weight of second topic to be compared

  • size (pandas dataframe) – table contains size of dot for each genes (genes are index)

  • pseudocount (float) – pseudocount that you want to add (default: 1)

  • threshold (float) – threshold to filter genes based on A values (default: 1)

  • cutoff (float) – cutoff for categorized genes by modified z-score (default: 2)

  • consistency_correction (float) – the factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed

  • topN (int) – number of genes to be consider for calculating z-score based on the A value (if it’s none is gonna be avarage of # genes in both topics with weights above threshold

  • labels (list) – list of gene names wish to show in MA-plot

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • show (bool) – indicate if you want to show the plot or not (default: True)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: MA_plot)

Returns:

return M and A values

Topyfic.utilsAnalyseModel.compare_topModels(topModels, comparison_method='Jensen–Shannon divergence', output_type='graph', threshold=0.8, topModels_color=None, topModels_label=None, ignore_genes=True, save=False, plot_show=True, figsize=None, plot_format='pdf', file_name='compare_topics')[source]

compare topModels using topic gene weights

Parameters:
  • topModels (list of TopModel class) – list of topModel class you want to compare to each other

  • comparison_method (str) – indicate the method you want to use for comparing topics. if you used Jensen–Shannon, we show -log2 (options: pearson correlation, spearman correlation, Jensen–Shannon divergence, cosine similarity)

  • output_type (str) – indicate the type of output you want. graph: plot as a graph, heatmap: plot as a heatmap, table: table contains correlation. Note: if you want to plot Jensen–Shannon divergence as a graph, we convert the values to be at the -log2(), so you need to take that account for defining threshold

  • threshold (float) – only apply when you choose circular which only show correlation above that

  • topModels_color (dict) – dictionary of colors mapping each topics to each color (default: blue)

  • topModels_label (dict) – dictionary of label mapping each topics to each label

  • ignore_genes (bool) – idicate how to behave to the genes that only represent in one the topics. “True” means it’s gonna ignore those genes and “False” means it’s gonna asumne the weights are zero for those genes that we don’t have any weights in one of the mouse models

  • save (bool) – indicate if you want to save the plot or not (default: True)

  • plot_show (bool) – indicate if you want to show the plot or not (default: True)

  • figsize (tuple of int) – indicate the size of plot (default: (10 * (len(category) + 1), 10))

  • plot_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: compare_topics)

Returns:

table contains correlation between topics only when table is choose and save is False

Return type:

pandas dataframe

Topyfic.utilsAnalyseModel.functional_enrichment_analysis(gene_list, type, organism, sets=None, p_value=0.05, file_format='pdf', file_name='functional_enrichment_analysis')[source]

Doing functional enrichment analysis including GO, KEGG and REACTOME

Parameters:
  • gene_list (list) – list of gene name

  • type (str) – indicate the type of databases which it should be one of “GO”, “REACTOME”

  • organism (str) – name of the organ you want to do functional enrichment analysis

  • sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO

  • p_value (float) – Defines the pValue threshold. (default: 0.05)

  • file_format (str) – indicate the format of plot (default: pdf)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

Topyfic.utilsAnalyseModel.modified_zscore(data, consistency_correction=1.4826)[source]

Returns the modified z score and Median Absolute Deviation (MAD) from the scores in data. The consistency_correction factor converts the MAD to the standard deviation for a given distribution. The default value (1.4826) is the conversion factor if the underlying data is normally distributed

Topyfic.utilsAnalyseModel.summarize_GO_Term(GO_terms, p_value=0.05, file_format='html', file_name='GO_sum')[source]

Summarize long, unintelligible lists of GO terms by finding a representative subset of the terms showing more unique (child) Go terms We suggest save it as html since it’s gonna be plot by plotly so you can take an advantage of using plotly

Parameters:
  • GO_terms (pandas dataframe) – Dataframe contains results of gene ontology analysis performs by GSEAPY (https://gseapy.readthedocs.io/en/latest/index.html)

  • p_value (float) – Defines the pValue threshold for plotting. (default: 0.05)

  • file_format (str) – indicate the format of plot (default: html)

  • file_name (str) – name and path of the plot use for save (default: gene_composition)

Returns:

dataframe used to plot the results

Return type:

pandas dataframe