API Documentation¶
- class PyWGCNA.geneExp.GeneExp(species=None, level='gene', anndata=None, geneExp=None, geneExpPath=None, sep=',', geneInfo=None, sampleInfo=None)[source]¶
A class used to creat gene expression anndata along data trait including both genes and samples information.
- Parameters:
species (str) – species of the data you use i.e mouse, human
level (str) – which type of data you use including gene, transcript (default: gene)
anndata – if the expression data is in anndata format you should pass it through this parameter. X should be expression matrix. var is a gene information and obs is a sample information.
anndata – anndata
geneExp (pandas dataframe) – expression matrix which genes are in the rows and samples are columns
geneExpPath (str) – path of expression matrix
sep (str) – separation symbol to use for reading data in geneExpPath properly
geneInfo (pandas dataframe) – dataframe that contains genes information it should have a same index as gene expression column names (gene/transcript ID)
sampleInfo (pandas dataframe) – dataframe that contains samples information it should have a same index as gene expression index (sample ID)
- static updateGeneInfo(geneExpr, geneInfo=None, path=None, sep=',')[source]¶
add/update genes info in expr anndata
- Parameters:
geneExpr (anndata) – gene expression data along with sample and genes/transcript information
geneInfo (pandas dataframe) – gene information table you want to add to your data
path (str) – path of geneInfo
sep (str) – separation symbol to use for reading data in path properly (default: ‘,’)
- Returns:
updated gene expression data along with sample and genes/transcript information
- Return type:
anndata
- static updateSampleInfo(geneExpr, sampleInfo=None, path=None, sep=',')[source]¶
add/update metadata in expr anndata
- Parameters:
geneExpr (anndata) – gene expression data along with sample and genes/transcript information
sampleInfo (pandas dataframe) – Sample information table you want to add to your data
path (str) – path of metaData
sep (str) – separation symbol to use for reading data in path properly (default: ‘,’)
- Returns:
updated gene expression data along with sample and genes/transcript information
- Return type:
anndata
- class PyWGCNA.wgcna.WGCNA(name='WGCNA', TPMcutoff=1, powers=None, RsquaredCut=0.9, MeanCut=100, networkType='signed hybrid', TOMType='signed', minModuleSize=50, naColor='grey', cut=inf, MEDissThres=0.2, species=None, level='gene', anndata=None, geneExp=None, geneExpPath=None, sep=',', geneInfo=None, sampleInfo=None, save=False, outputPath=None, figureType='pdf')[source]¶
A class used to do weighted gene co-expression network analysis.
- Parameters:
name (str) – name of the WGCNA we used to visualize data (default: ‘WGCNA’)
save (bool) – indicate if you want to save result of important steps in a figure directory (default: False)
species (str) – species of the data you use i.e mouse, human
level (str) – which type of data you use including gene, transcript (default: gene)
outputPath (str) – path you want to save all you figures and object (default: ‘’, where you rau your script)
anndata (anndata) – if the expression data is in anndata format you should pass it through this parameter. X should be expression matrix. var is a gene information and obs is a sample information.
geneExp (pandas dataframe) – expression matrix which genes are in the columns and samples are rows
geneExpPath (str) – path of expression matrix
sep (str) – separation symbol to use for reading data in geneExpPath properly
geneInfo (pandas dataframe) – dataframe that contains genes information it should have a same index as gene expression column names (gene/transcript ID)
sampleInfo (pandas dataframe) – dataframe that contains samples information it should have a same index as gene expression index (sample ID)
TPMcutoff (int) – cut off for removing genes that expressed under this number along samples
cut (float) – number to remove outlier sample (default: ‘inf’) By default we don’t remove any sample by hierarchical clustering
powers (list of int) – different powers to test to have scale free network (default: [1:10, 11:21:2])
RsquaredCut (float) – R squaered cut to choose power for having scale free network; between 0 to 1 (default: 0.9)
MeanCut (int) – mean connectivity to choose power for having scale free network (default: 100)
networkType (str) – Type of network we can create including “unsigned”, “signed” and “signed hybrid” (default: “signed hybrid”)
TOMType (str) – Type of topological overlap matrix(TOM) including “unsigned”, “signed” (default: “signed”)
minModuleSize (int) – We like large modules, so we set the minimum module size relatively high (default: 50)
naColor (str) – color we used to identify genes we don’t find any cluster for them (default: “grey”)
MEDissThres (float) – diss similarity threshold (default: 0.2)
figureType (str) – extension of figure (default: “pdf”)
MEs (ndarray) – eigengenes
geneExpr (geneExp class) – gene expression object that contains raw gene expression along with gene and sample information.
datExpr (anndata) – data expression data that contains preprocessed data
dynamicMods (list) – name of modules by clustering similar genes together
TOM – topological overlap measure using average linkage hierarchical clustering which inputs a measure of interconnectedness
TOM – ndarray
adjacency (ndarray) – adjacency matrix calculating base of the type of network
geneTree (ndarray) – average hierarchical clustering of dissTOM matrix
power (int) – power to have scale free network (default: 6)
sft (pandas dataframe) – soft threshold table which has information for each powers
datME (pandas dataframe)
:param signedKME:(signed) eigengene-based connectivity (module membership) :type signedKME: pandas dataframe :param moduleTraitCor: correlation between each module and metadata :type moduleTraitCor: pandas dataframe :param moduleTraitPvalue: p-value of correlation between each module and metadata :type moduleTraitPvalue: pandas dataframe
- CalculateSignedKME(exprWeights=None, MEWeights=None)[source]¶
Calculation of (signed) eigengene-based connectivity, also known as module membership.
- Parameters:
exprWeights (pandas dataframe) – optional weight matrix of observation weights for datExpr, of the same dimensions as datExpr
MEWeights (pandas dataframe) – optional weight matrix of observation weights for datME, of the same dimensions as datME
- Returns:
A data frame in which rows correspond to input genes and columns to module eigengenes, giving the signed eigengene-based connectivity of each gene with respect to each eigengene.
- Return type:
pandas dataframe
- CoexpressionModulePlot(modules, numGenes=10, numConnections=100, minTOM=0, filters=None, file_name=None)[source]¶
plot Coexpression for given module
- Parameters:
modules (list of str) – name of modules you like to plot
numGenes (int) – number of genes you want to show for each module
numConnections (int) – number of connection you want to show for each module
minTOM (float) – minimum TOM to keep connections
filters (dict) – Dictionary which keys are columns names of datExpr.var that you want to filter the genes based on it and values are determining which rows you want to keep
file_name (str) – name of the html output file (default: module names or network if there is more than 3 modules for input)
- Returns:
save a html file with name of modules in figures directory
- PPI_network(species, moduleName=None, geneList=None, output_format='image')[source]¶
retrieve an image of a STRING network of a neighborhood surrounding one or more proteins or ask STRING to show only the network of interactions between your input proteins.
- Parameters:
species (int) – NCBI taxon identifiers (e.g. Human is 9606, see: https://string-db.org/cgi/input.pl?input_page_active_form=organisms).
moduleName (str) – name of module you want to find PPI
geneList (list) – list of genes you want to find PPI
output_format (str) – format of output which can be “image”, “highres_image”, “svg” (default: “image”)
- Returns:
dataframe contains genes along with interaction with their scores
- Return type:
pandas dataframe
- static TOMsimilarity(adjMat, TOMType='signed', TOMDenom='min')[source]¶
Calculation of the topological overlap matrix, and the corresponding dissimilarity, from a given adjacency matrix
- Parameters:
adjMat (pandas dataframe) – adjacency matrix, that is a square, symmetric matrix with entries between 0 and 1 (negative values are allowed if TOMType==”signed”).
TOMType (str) – one of “unsigned”, “signed”
TOMDenom (str) – a character string specifying the TOM variant to be used. Recognized values are “min” giving the standard TOM described in Zhang and Horvath (2005), and “mean” in which the min function in the denominator is replaced by mean. The “mean” may produce better results but at this time should be considered experimental.
- Returns:
A matrix holding the topological overlap.
- Return type:
pandas dataframe
- static adjacency(datExpr, selectCols=None, adjacencyType='unsigned', power=6, corOptions=Empty DataFrame Columns: [] Index: [], weights=None, weightArgNames=None)[source]¶
Calculates (correlation or distance) network adjacency from given expression data or from a similarity
- Parameters:
datExpr (pandas dataframe) – data frame containing expression data. Columns correspond to genes and rows to samples.
selectCols (list) – for correlation networks only; can be used to select genes whose adjacencies will be calculated. Should be either a numeric list giving the indices of the genes to be used, or a boolean list indicating which genes are to be used.
adjacencyType (str) – adjacency network type. Allowed values are (unique abbreviations of) “unsigned”, “signed”, “signed hybrid”. (default = unsigned)
power (int) – soft thresholding power.
corOptions (pandas dataframe) – specifying additional arguments to be passed to the function given by corFnc.
weights (pandas dataframe) – optional observation weights for datExpr to be used in correlation calculation. A matrix of the same dimensions as datExpr, containing non-negative weights. Only used with Pearson correlation.
weightArgNames (list) – character list of length 2 giving the names of the arguments to corFnc that represent weights for variable x and y. Only used if weights are non-NULL.
- Returns:
Adjacency matrix
- Return type:
pandas dataframe
- analyseWGCNA(order=None, geneList=None, show=True, alternative='two-sided')[source]¶
Analysing results: 1.calculating module trait relationship 2.plotting module heatmap eigengene 3.finding GO term for each module
- Parameters:
order (list) – indicate in which order metadata will show up in plots (should same as metadata name in anndata)
geneList (pandas dataframe) – genes information you want to add (keep in mind you can not have multiple row for same gene)
show (bool) – indicate if you want to see plots in when you run your code
alternative (str) – Defines the alternative hypothesis for calculating correlation for module-trait relationship. Default is ‘two-sided’. The following options are available: ‘two-sided’: the correlation is nonzero, ‘less’: the correlation is negative (less than zero), ‘greater’: the correlation is positive (greater than zero)
- barplotModuleEigenGene(moduleName, metadata, combine=True, colorBar=None, show=True)[source]¶
bar plot of module eigen gene figure in given module
- Parameters:
moduleName (str) – module name
metadata (list) – list of metadata you want to be plotted
combine (bool) – indicate if you want to combine all metadata to show them together
show (bool) – indicate if you want to see plots in when you run your code
- Praram colorBar:
metadata you want to use to color bar plot with
- static calBlockSize(matrixSize, rectangularBlocks=True, maxMemoryAllocation=None, overheadFactor=3)[source]¶
find suitable block size for calculating soft power threshold
- static checkAdjMat(adjMat, min=0, max=1)[source]¶
check adjacency matrix format is correct
- Parameters:
adjMat (pandas dataframe) – data we want to be checked
min (int) – minimum value to be allowed for data (default = 0)
max (int) – maximum value to be allowed for data (default = 1)
- Raises:
exit – if format is not correct
- static checkAndScaleWeights(weights, expr, scaleByMax=True)[source]¶
check and scale weights of gene expression :param weights: weights of gene expression :type weights: pandas dataframe :param expr: gene expression matrix :type expr: pandas dataframe :param scaleByMax: if you want to scale your weights by diving to max :type scaleByMax: boll
- Returns:
processed weights of gene expression
- Return type:
pandas dataframe
- static checkSets(data, checkStructure=False, useSets=None)[source]¶
Checks whether given sets have the correct format and retrieves dimensions.
- Parameters:
data (dict) – A dict of lists; in each list there must be a component named data whose content is a matrix or dataframe or array of dimension 2.
checkStructure (bool) – If FALSE, incorrect structure of data will trigger an error. If TRUE, an appropriate flag (see output) will be set to indicate whether data has correct structure. (default = False)
useSets (list) – Optional specification of entries of the list data that are to be checked. Defaults to all components. This may be useful when data only contains information for some of the sets.
- Returns:
a dictionary contains: “nSets”: Number of sets (length of the vector data). “nGenes”: Number of columns in the data components in the lists. This number must be the same for all sets. “nSamples”: A vector of length nSets giving the number of rows in the data components. “structureOK”: Only set if the argument checkStructure equals TRUE. The value is TRUE if the paramter data passes a few tests of its structure, and FALSE otherwise. The tests are not exhaustive and are meant to catch obvious user errors rather than be bulletproof.
- Return type:
dict
- static checkSimilarity(adjMat, min=-1, max=1)[source]¶
check similarity matrix format is correct
- Parameters:
adjMat (pandas dataframe) – data we want to be checked
min (int) – minimum value to be allowed for data (default = 0)
max (int) – maximum value to be allowed for data (default = 1)
- Raises:
exit – if format is not correct
- static consensusMEDissimilarityMajor(MEs, useAbs=False, useSets=None, method='consensus')[source]¶
Calculates consensus dissimilarity (1-cor) of given module eigengenes realized in several sets.
- static consensusOrderMEs(MEs, useAbs=False, useSets=None, greyLast=True, greyName='MEgrey', method='consensus')[source]¶
Reorder given (eigen-)vectors such that similar ones (as measured by correlation) are next to each other.
- Parameters:
MEs (dict) – Module eigengenes of several sets in a multi-set format
useAbs (bool) – Controls whether vector similarity should be given by absolute value of correlation or plain correlation. (defualt = False)
useSet – Allows the user to specify for which sets the eigengene ordering is to be performed.
greyLast (bool) – Normally the color grey is reserved for unassigned genes; hence the grey module is not a proper module and it is conventional to put it last. If this is not desired, set the parameter to FALSE. (defualt = True)
greyName (str) – Name of the grey module eigengene. (defualt = “MEgrey”)
method (str) – A character string giving the method to be used calculating the consensus dissimilarity. Allowed values are (abbreviations of) “consensus” and “majority”. The consensus dissimilarity is calculated as the maximum of given set dissimilarities for “consensus” and as the average for “majority”.
- Returns:
A dictionary of the same type as MEs containing the re-ordered eigengenes
- Return type:
dict
- static cutree(sampleTree, cutHeight=50000.0)[source]¶
Given a linkage matrix Z, return the cut tree. remove samples/genes/modules base on hierarchical clustering
- Parameters:
sampleTree (scipy.cluster.linkage array) – The linkage matrix.
cutHeight (array_like) – A optional height at which to cut the tree (default = 50000)
- Returns:
An array indicating group membership at each agglomeration step. I.e., for a full cut tree, in the first column each data point is in its own cluster. At the next step, two nodes are merged. Finally, all singleton and non-singleton clusters are in one group. If n_clusters or height are given, the columns correspond to the columns of n_clusters or height.
- Return type:
array
- static cutreeHybrid(dendro, distM, cutHeight=None, minClusterSize=20, deepSplit=1, maxCoreScatter=None, minGap=None, maxAbsCoreScatter=None, minAbsGap=None, minSplitHeight=None, minAbsSplitHeight=None, externalBranchSplitFnc=None, nExternalSplits=0, minExternalSplit=None, externalSplitOptions=Empty DataFrame Columns: [] Index: [], externalSplitFncNeedsDistance=None, assumeSimpleExternalSpecification=True, pamStage=True, pamRespectsDendro=True, useMedoids=False, maxPamDist=None, respectSmallClusters=True)[source]¶
Detect clusters in a dendorgram produced by the function hclust.
- Parameters:
dendro (ndarray) – a hierarchical clustering dendorgram such as one returned by hclust.
distM (pandas dataframe) – Distance matrix that was used as input to hclust.
cutHeight (int) – Maximum joining heights that will be considered. It defaults to 99of the range between the 5th percentile and the maximum of the joining heights on the dendrogram.
minClusterSize (int) – Minimum cluster size. (default = 20)
deepSplit (int or bool) – Either logical or integer in the range 0 to 4. Provides a rough control over sensitivity to cluster splitting. The higher the value, the more and smaller clusters will be produced. (default = 1)
maxCoreScatter (int) – Maximum scatter of the core for a branch to be a cluster, given as the fraction of cutHeight relative to the 5th percentile of joining heights.
minGap (int) – Minimum cluster gap given as the fraction of the difference between cutHeight and the 5th percentile of joining heights.
maxAbsCoreScatter (int) – Maximum scatter of the core for a branch to be a cluster given as absolute heights. If given, overrides maxCoreScatter.
minAbsGap (int) – Minimum cluster gap given as absolute height difference. If given, overrides minGap.
minSplitHeight (int) – Minimum split height given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. Branches merging below this height will automatically be merged. Defaults to zero but is used only if minAbsSplitH
minAbsSplitHeight (int) – Minimum split height given as an absolute height. Branches merging below this height will automatically be merged. If not given (default), will be determined from minSplitHeight above.
externalBranchSplitFnc – Optional function to evaluate split (dissimilarity) between two branches. Either a single function or a list in which each component is a function.
minExternalSplit (list) – Thresholds to decide whether two branches should be merged. It should be a numeric list of the same length as the number of functions in externalBranchSplitFnc above.
externalSplitOptions (pandas dataframe) – Further arguments to function externalBranchSplitFnc. If only one external function is specified in externalBranchSplitFnc above, externalSplitOptions can be a named list of arguments or a list with one component.
externalSplitFncNeedsDistance (pandas dataframe) – Optional specification of whether the external branch split functions need the distance matrix as one of their arguments. Either NULL or a logical list with one element per branch
assumeSimpleExternalSpecification (bool) – when minExternalSplit above is a scalar (has length 1), should the function assume a simple specification of externalBranchSplitFnc and externalSplitOptions. (default = True)
pamStage (bool) – If TRUE, the second (PAM-like) stage will be performed. (default = True)
pamRespectsDendro (bool) – If TRUE, the PAM stage will respect the dendrogram in the sense an object can be PAM-assigned only to clusters that lie below it on the branch that the object is merged into. (default = True)
useMedoids – if TRUE, the second stage will be use object to medoid distance; if FALSE, it will use average object to cluster distance. (default = False)
maxPamDist (float) – Maximum object distance to closest cluster that will result in the object assigned to that cluster. Defaults to cutHeight.
respectSmallClusters (bool) – If TRUE, branches that failed to be clusters in stage 1 only because of insufficient size will be assigned together in stage 2. If FALSE, all objects will be assigned individually. (default = False)
- Returns:
list detailing the deteced branch structure.
- Return type:
list
- findModules(kwargs_function={'cutreeHybrid': {'deepSplit': 2, 'pamRespectsDendro': False}})[source]¶
Clustering genes through original WGCNA pipeline: 1.pick soft threshold 2.calculating adjacency matrix 3.calculating TOM similarity matrix 4.cluster genes base of dissTOM 5.merge similar cluster dynamically
- Parameters:
kwargs_function (dict) – dictionary where the keys are the name of the function and values are the dictionary contains parameter you want to change within function
- static fixDataStructure(data)[source]¶
Encapsulates single-set data in a wrapper that makes the data suitable for functions working on multiset data collections.
- Parameters:
data (pandas dataframe ot dict) – A dataframe, matrix or array with two dimensions to be encapsulated.
- Returns:
input data in a format suitable for functions operating on multiset data collections.
- Return type:
dict
- functional_enrichment_analysis(type, moduleName, sets=None, p_value=1, file_name=None, **kwargs)[source]¶
Doing functional enrichment analysis including GO, KEGG and REACTOME
- Parameters:
type (str) – indicate the type of databases which it should be one of “GO”, “KEGG”, “REACTOME”
moduleName (str) – module name
sets (str, list, tuple) – str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file) (you can add any Enrichr Libraries from here: https://maayanlab.cloud/Enrichr/#stats) only need to fill if the type is GO or KEGG
p_value (float) – Defines the pValue threshold. (default: 0.05)
file_name (str) – name of the file you want to use to save plot (default is moduleName)
kwargs (key, value pairings) – Other keyword arguments are passed through to the underlying gseapy.enrichr() finction
- getDatTraits(metaData)[source]¶
get data trait module base on samples information
- Returns:
a dataframe contains information in suitable format for plotting module trait relationship heatmap
- Return type:
pandas dataframe
- getGeneModule(moduleName)[source]¶
get list of genes corresponding to modules
- Parameters:
moduleName (list) – name of modules
- Returns:
A dictionary contains list of genes for requested module(s)
- Return type:
dict
- getModulesGene(geneIds)[source]¶
get list of modules corresponding to gene(s)
- Parameters:
geneIds (list or str) – gene id
- Returns:
A list contains name of module(s) for requested gene(s)
- Return type:
list or str
- static goodGenesFun(datExpr, weights=None, useSamples=None, useGenes=None, minFraction=0.5, minNSamples=4, minNGenes=4, tol=None, minRelativeWeight=0.1)[source]¶
Check data for missing entries and returns a list of genes that have non-zero variance
:param datExpr:expression data. A data frame in which columns are genes and rows ar samples. :type datExpr: pandas dataframe :param weights: optional observation weights in the same format (and dimensions) as datExpr. :type weights: pandas dataframe :param useSamples: optional specifications of which samples to use for the check (Defaults to using all samples) :type useSamples: list of bool :param useGenes: optional specifications of genes for which to perform the check (Defaults to using all genes) :type useGenes: list of bool :param minFraction: minimum fraction of non-missing samples for a gene to be considered good. (default = 1/2) :type minFraction: float :param minNSamples: minimum number of non-missing samples for a gene to be considered good. (default = 4) :type minNSamples: int :param minNGenes: minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued. (default = 4) :type minNGenes: int :param tol: An optional ‘small’ number to compare the variance against :type tol: float :param minRelativeWeight: observations whose relative weight is below this threshold will be considered missing. Here relative weight is weight divided by the maximum weight in the column (gene). (default = 0.1) :type minRelativeWeight: float
- Returns:
A logical list with one entry per gene that is TRUE if the gene is considered good and FALSE otherwise. Note that all genes excluded by useGenes are automatically assigned FALSE.
- Return type:
list of bool
- static goodSamplesFun(datExpr, weights=None, useSamples=None, useGenes=None, minFraction=0.5, minNSamples=4, minNGenes=4, minRelativeWeight=0.1)[source]¶
Check data for missing entries and returns a list of samples that have non-zero variance
:param datExpr:expression data. A data frame in which columns are genes and rows ar samples. :type datExpr: pandas dataframe :param weights: optional observation weights in the same format (and dimensions) as datExpr. :type weights: pandas dataframe :param useSamples: optional specifications of which samples to use for the check (Defaults to using all samples) :type useSamples: list of bool :param useGenes: optional specifications of genes for which to perform the check (Defaults to using all genes) :type useGenes: list of bool :param minFraction: minimum fraction of non-missing samples for a gene to be considered good. (default = 1/2) :type minFraction: float :param minNSamples: findModulesminimum number of non-missing samples for a gene to be considered good. (default = 4) :type minNSamples: int :param minNGenes: minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued. (default = 4) :type minNGenes: int :param minRelativeWeight: observations whose relative weight is below this threshold will be considered missing. Here relative weight is weight divided by the maximum weight in the column (gene). (default = 0.1) :type minRelativeWeight: float
- Returns:
A logical list with one entry per sample that is TRUE if the sample is considered good and FALSE otherwise. Note that all samples excluded by useSamples are automatically assigned FALSE.
- Return type:
list of bool
- static goodSamplesGenes(datExpr, weights=None, minFraction=0.5, minNSamples=4, minNGenes=4, tol=None, minRelativeWeight=0.1)[source]¶
Checks data for missing entries, entries with weights below a threshold, and zero-variance genes. If necessary, the filtering is iterated.
:param datExpr:expression data. A data frame in which columns are samples and rows are gene. :type datExpr: pandas dataframe :param weights: optional observation weights in the same format (and dimensions) as datExpr. :type weights: pandas dataframe :param minFraction: minimum fraction of non-missing samples for a gene to be considered good. (default = 1/2) :type minFraction: float :param minNSamples: minimum number of non-missing samples for a gene to be considered good. (default = 4) :type minNSamples: int :param minNGenes: minimum number of good genes for the data set to be considered fit for analysis. If the actual number of good genes falls below this threshold, an error will be issued. (default = 4) :type minNGenes: int :param tol: An optional ‘small’ number to compare the variance against :type tol: float :param minRelativeWeight: observations whose relative weight is below this threshold will be considered missing. Here relative weight is weight divided by the maximum weight in the column (gene). (default = 0.1) :type minRelativeWeight: float
- Returns:
A triple containing (goodGenes, goodSamples, allOK) goodSamples: A logical vector with one entry per sample that is TRUE if the sample is considered good and FALSE otherwise. goodGenes: A logical vector with one entry per gene that is TRUE if the gene is considered good and FALSE otherwise. allOK: if everything is okay
- Return type:
list, list, bool
- static hclust(d, method='complete')[source]¶
Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
- Parameters:
d (ndarray) – a dissimilarity structure as produced by ‘pdist’.
method (str) – The linkage algorithm to use. (default = complete)
- Returns:
The hierarchical clustering encoded as a linkage matrix.
- Return type:
ndarray
- static intramodularConnectivity(mat, colors, scaleByMax=False, index=None)[source]¶
Calculates intramodular connectivity, i.e., connectivity of nodes to other nodes within the same module.
- Parameters:
mat (ndarray) – adjacency which should be a square, symmetric matrix with entries between 0 and 1.
colors (list) – module labels. A list of length ncol(adjMat) giving a module label for each gene (node) of the network.
scaleByMax (bool) – should intramodular connectivity be scaled by the maximum IM connectivity in each module?
index (ndarray) – gene id or name of mat index
- Returns:
If input getWholeNetworkConnectivity is TRUE, a data frame with 4 columns giving the total connectivity, intramodular connectivity, extra-modular connectivity, and the difference of the intra- and extra-modular connectivities for all genes; otherwise a vector of intramodular connectivities
- Return type:
pandas dataframe
- static labels2colors(labels, zeroIsGrey=True, colorSeq=None, naColor='grey')[source]¶
Converts a vector or array of numerical labels into a corresponding vector or array of colors corresponding to the labels.
- Parameters:
labels (list or matrix) – list or matrix of non-negative integer or other (such as character) labels.
zeroIsGrey (bool) – If TRUE, labels 0 will be assigned color grey. Otherwise, labels below 1 will trigger an error. (default = True)
colorSeq (list or matrix) – Color sequence corresponding to labels. If not given, a standard sequence will be used.
naColor (str) – Color that will encode missing values.
- Returns:
An array of character strings of the same length or dimensions as labels.
- Return type:
ndarray
- static mergeCloseModules(exprData, colors, MEs=None, useSets=None, impute=True, checkDataFormat=True, unassdColor='grey', useAbs=False, equalizeQuantiles=False, quantileSummary='mean', consensusQuantile=0, cutHeight=0.2, iterate=True, relabel=False, colorSeq=None, getNewMEs=True, getNewUnassdME=True, trapErrors=False)[source]¶
Merges modules in gene expression networks that are too close as measured by the correlation of their eigengenes.
- Parameters:
exprData (pandas dataframe) – Expression data, either a single data frame with rows corresponding to samples and columns to genes, or in a multi-set format.
colors (list) – A list (numeric, character or a factor) giving module colors for genes. The method only makes sense when genes have the same color label in all sets, hence a single list.
MEs (dict) – If module eigengenes have been calculated before, the user can save some computational time by inputting them. MEs should have the same format as exprData. If they are not given, they will be calculated.
useSets (list) – A list of scalar allowing the user to specify which sets will be used to calculate the consensus dissimilarity of module eigengenes. Defaults to all given sets.
impute (bool) – Should missing values be imputed in eigengene calculation? If imputation is disabled, the presence of NA entries will cause the eigengene calculation to fail and eigengenes will be replaced by their hubgene approximation. (defualt = True)
checkDataFormat (bool) – If TRUE, the function will check exprData and MEs for correct multi-set structure. If single set data is given, it will be converted into a format usable for the function. If FALSE, incorrect structure of input data will trigger an error. (defualt = True)
unassdColor (str) – Specifies the string that labels unassigned genes. Module of this color will not enter the module eigengene clustering and will not be merged with other modules. (default = “grey”)
useAbs (bool) – Specifies whether absolute value of correlation or plain correlation (of module eigengenes) should be used in calculating module dissimilarity. (defualt = False)
equalizeQuantiles (bool) – should quantiles of the eigengene dissimilarity matrix be equalized (“quantile normalized”)? The default is FALSE for reproducibility of old code; when there are many eigengenes (e.g., at least 50), better results may be achieved if quantile equalization is used. (defualt = False)
quantileSummary (str) – One of “mean” or “median”. Controls how a reference dissimilarity is computed from the input ones (using mean or median, respectively). (default = “mean”)
consensusQuantile (int) – A number giving the desired quantile to use in the consensus similarity calculation. (defualt = 0)
cutHeight (float) – Maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging. (defualt = 0.2)
iterate (bool) – Controls whether the merging procedure should be repeated until there is no change. If FALSE, only one iteration will be executed. (defualt = True)
relabel (bool) – Controls whether, after merging, color labels should be ordered by module size. (defualt = False)
colorSeq (list) – Color labels to be used for relabeling. Defaults to the standard color order used in this package if colors are not numeric, and to integers starting from 1 if colors is numeric.
getNewMEs (bool) – Controls whether module eigengenes of merged modules should be calculated and returned. (defualt = True)
getNewUnassdME (bool) – When doing module eigengene manipulations, the function does not normally calculate the eigengene of the ‘module’ of unassigned (‘grey’) genes. Setting this option to TRUE will force the calculation of the unassigned eigengene in the returned newMEs, but not in the returned oldMEs. (defualt = True)
trapErrors – Controls whether computational errors in calculating module eigengenes, their dissimilarity, and merging trees should be trapped. If TRUE, errors will be trapped and the function will return the input colors. If FALSE, errors will cause the function to stop. (defualt = False)
- Returns:
A dictionaty contains: “colors”: Color labels for the genes corresponding to merged modules. The function attempts to mimic the mode of the input colors: if the input colors is numeric, character and factor, respectively, so is the output. Note, however, that if the fnction performs relabeling, a standard sequence of labels will be used: integers starting at 1 if the input colors is numeric, and a sequence of color labels otherwise. “dendro”: Hierarchical clustering dendrogram (average linkage) of the eigengenes of the most recently computed tree. If iterate was set TRUE, this will be the dendrogram of the merged modules, otherwise it will be the dendrogram of the original modules. “oldDendro”: Hierarchical clustering dendrogram (average linkage) of the eigengenes of the original modules. “cutHeight”: The input cutHeight. “oldMEs”: Module eigengenes of the original modules in the sets given by useSets. “newMEs”: Module eigengenes of the merged modules in the sets given by useSets. “allOK”: A boolean set to TRUE.
- Raises:
trapErrors==TRUE – A dictionaty contains: “colors”: A copy of the input colors. “allOK”: a boolean set to FALSE.
- Return type:
dict
- static moduleEigengenes(expr, colors, impute=True, nPC=1, align='along average', excludeGrey=False, grey='grey', subHubs=True, softPower=6, scaleVar=True, trapErrors=False)[source]¶
Calculates module eigengenes (1st principal component) of modules in a given single dataset.
- Parameters:
expr (pandas dataframe) – Expression data for a single set in the form of a data frame where rows are samples and columns are genes (probes).
colors (list) – A list of the same length as the number of probes in expr, giving module color for all probes (genes). Color “grey” is reserved for unassigned genes.
impute (bool) – If TRUE, expression data will be checked for the presence of NA entries and if the latter are present, numerical data will be imputed. (defualt = True)
nPC (int) – Number of principal components and variance explained entries to be calculated. Note that only the first principal component is returned; the rest are used only for the calculation of proportion of variance explained. If given nPC is greater than 10, a warning is issued. (default = 1)
align (str) – Controls whether eigengenes, whose orientation is undetermined, should be aligned with average expression (align = “along average”) or left as they are (align = “”). Any other value will trigger an error. (default = “along average”)
excludeGrey (bool) – Should the improper module consisting of ‘grey’ genes be excluded from the eigengenes (default = False)
grey (str) – Value of colors designating the improper module. Note that if colors is a factor of numbers, the default value will be incorrect. (default = grey)
subHubs (bool) – Controls whether hub genes should be substituted for missing eigengenes. If TRUE, each missing eigengene (i.e., eigengene whose calculation failed and the error was trapped) will be replaced by a weighted average of the most connected hub genes in the corresponding module. If this calculation fails, or if subHubs==FALSE, the value of trapErrors will determine whether the offending module will be removed or whether the function will issue an error and stop. (default = True)
softPower (int) – The power used in soft-thresholding the adjacency matrix. Only used when the hubgene approximation is necessary because the principal component calculation failed. It must be non-negative. The default value should only be changed if there is a clear indication that it leads to incorrect results. (default = 6)
trapErrors (bool) – Controls handling of errors from that may arise when there are too many NA entries in expression data. If TRUE, errors from calling these functions will be trapped without abnormal exit. If FALSE, errors will cause the function to stop. Note, however, that subHubs takes precedence in the sense that if subHubs==TRUE and trapErrors==FALSE, an error will be issued only if both the principal component and the hubgene calculations have failed. (default = False)
scaleVar (bool) – can be used to turn off scaling of the expression data before calculating the singular value decomposition. The scaling should only be turned off if the data has been scaled previously, in which case the function can run a bit faster. Note however that the function first imputes, then scales the expression data in each module. If the expression contain missing data, scaling outside of the function and letting the function impute missing data may lead to slightly different results than if the data is scaled within the function. (default = True)
- Returns:
A dictionary containing: “eigengenes”: Module eigengenes in a dataframe, with each column corresponding to one eigengene. The columns are named by the corresponding color with an “ME” prepended, e.g., MEturquoise etc. If returnValidOnly==FALSE, module eigengenes whose calculation failed have all components set to NA. “averageExpr”: If align == “along average”, a dataframe containing average normalized expression in each module. The columns are named by the corresponding color with an “AE” prepended, e.g., AEturquoise etc. “varExplained”: A dataframe in which each column corresponds to a module, with the component varExplained[PC, module] giving the variance of module module explained by the principal component no. PC. The calculation is exact irrespective of the number of computed principal components. At most 10 variance explained values are recorded in this dataframe. “nPC”: A copy of the input nPC. “validMEs”: A boolean vector. Each component (corresponding to the columns in data) is TRUE if the corresponding eigengene is valid, and FALSE if it is invalid. Valid eigengenes include both principal components and their hubgene approximations. When returnValidOnly==FALSE, by definition all returned eigengenes are valid and the entries of validMEs are all TRUE. “validColors”: A copy of the input colors with entries corresponding to invalid modules set to grey if given, otherwise 0 if colors is numeric and “grey” otherwise. “allOK”: Boolean flag signalling whether all eigengenes have been calculated correctly, either as principal components or as the hubgene average approximation. “allPC”: Boolean flag signalling whether all returned eigengenes are principal components. “isPC”: Boolean vector. Each component (corresponding to the columns in eigengenes) is TRUE if the corresponding eigengene is the first principal component and FALSE if it is the hubgene approximation or is invalid. “isHub”: Boolean vector. Each component (corresponding to the columns in eigengenes) is TRUE if the corresponding eigengene is the hubgene approximation and FALSE if it is the first principal component or is invalid. “validAEs”: Boolean vector. Each component (corresponding to the columns in eigengenes) is TRUE if the corresponding module average expression is valid. “allAEOK”: Boolean flag signalling whether all returned module average expressions contain valid data. Note that returnValidOnly==TRUE does not imply allAEOK==TRUE: some invalid average expressions may be returned if their corresponding eigengenes have been calculated correctly.
- Return type:
dict
- module_trait_relationships_heatmap(metaData, alternative='two-sided', figsize=None, show=True, file_name='module-traitRelationships')[source]¶
plot topic-trait relationship heatmap
- Parameters:
metaData (list) – traits you would like to see the relationship with topics (must be column name of datExpr.obs)
alternative (str) – Defines the alternative hypothesis for calculating correlation for module-trait relationship. Default is ‘two-sided’. The following options are available: ‘two-sided’: the correlation is nonzero, ‘less’: the correlation is negative (less than zero), ‘greater’: the correlation is positive (greater than zero)
figsize (tuple of float) – indicate the size of plot
show (bool) – indicate if you want to show the plot or not (default: True)
file_name (str) – name and path of the plot use for save (default: topic-traitRelationships)
- static multiSetMEs(exprData, colors, universalColors=None, useSets=None, useGenes=None, impute=True, nPC=1, align='along average', excludeGrey=False, subHubs=True, trapErrors=False, softPower=6, grey=None)[source]¶
Calculates module eigengenes for several sets.
- Parameters:
exprData (pandas dataframe) – Expression data in a multi-set format
colors (list) – A list of the same length as the number of probes in expr, giving module color for all probes (genes). Color “grey” is reserved for unassigned genes.
universalColors (list) – Alternative specification of module assignment
useSets (list) – If calculations are requested in (a) selected set(s) only, the set(s) can be specified here. Defaults to all sets.
useGenes (list) – Can be used to restrict calculation to a subset of genes
impute (bool) – If TRUE, expression data will be checked for the presence of NA entries and if the latter are present, numerical data will be imputed. (defualt = True)
nPC (int) – Number of principal components and variance explained entries to be calculated. Note that only the first principal component is returned; the rest are used only for the calculation of proportion of variance explained. If given nPC is greater than 10, a warning is issued. (default = 1)
align (str) – Controls whether eigengenes, whose orientation is undetermined, should be aligned with average expression (align = “along average”) or left as they are (align = “”). Any other value will trigger an error. (default = “along average”)
excludeGrey (bool) – Should the improper module consisting of ‘grey’ genes be excluded from the eigengenes (default = False)
subHubs (bool) – Controls whether hub genes should be substituted for missing eigengenes. If TRUE, each missing eigengene (i.e., eigengene whose calculation failed and the error was trapped) will be replaced by a weighted average of the most connected hub genes in the corresponding module. If this calculation fails, or if subHubs==FALSE, the value of trapErrors will determine whether the offending module will be removed or whether the function will issue an error and stop. (default = True)
trapErrors (bool) – Controls handling of errors from that may arise when there are too many NA entries in expression data. If TRUE, errors from calling these functions will be trapped without abnormal exit. If FALSE, errors will cause the function to stop. Note, however, that subHubs takes precedence in the sense that if subHubs==TRUE and trapErrors==FALSE, an error will be issued only if both the principal component and the hubgene calculations have failed. (default = False)
softPower (int) – The power used in soft-thresholding the adjacency matrix. Only used when the hubgene approximation is necessary because the principal component calculation failed. It must be non-negative. The default value should only be changed if there is a clear indication that it leads to incorrect results. (default = 6)
grey (str) – Value of colors or universalColors (whichever applies) designating the improper module
- Returns:
A dictionary similar in spirit to the input exprData
- Return type:
dict
- static orderMEs(MEs, greyLast=True, greyName='MEgrey', orderBy=0, order=None, useSets=None)[source]¶
Reorder given (eigen-)vectors such that similar ones (as measured by correlation) are next to each other.
- Parameters:
MEs (dict) – Module eigengenes in a multi-set format.
greyLast (bool) – Normally the color grey is reserved for unassigned genes; hence the grey module is not a proper module and it is conventional to put it last. If this is not desired, set the parameter to FALSE. (default = True)
greyName (str) – Name of the grey module eigengene. (default = “MEgrey”)
orderBy (int) – Specifies the set by which the eigengenes are to be ordered (in all other sets as well). Defaults to the first set in useSets (or the first set, if useSets is not given). (defualt = 0)
order (list) – Allows the user to specify a custom ordering.
useSets (list) – Allows the user to specify for which sets the eigengene ordering is to be performed.
- Returns:
A dictionary of the same type as MEs containing the re-ordered eigengenes.
- Return type:
dict
- static pickSoftThreshold(data, dataIsExpr=True, weights=None, RsquaredCut=0.9, MeanCut=100, powerVector=None, nBreaks=10, blockSize=None, corOptions=None, networkType='unsigned', moreNetworkConcepts=False, gcInterval=None)[source]¶
Analysis of scale free topology for multiple soft thresholding powers.
- Parameters:
data – expression data in a matrix or data frame. Rows correspond to samples and columns to genes.
data – pandas dataframe
dataIsExpr (bool) – should the data be interpreted as expression (or other numeric) data, or as a similarity matrix of network nodes?
weights (pandas dataframe) – optional observation weights for data to be used in correlation calculation. A matrix of the same dimensions as datExpr, containing non-negative weights. Only used with Pearson correlation.
RsquaredCut (float) – desired minimum scale free topology fitting index (R^2). (default = 0.9)
MeanCut (int) – desired maximum mean connectivity scale free topology fitting index. (default = 100)
powerVector (list of int) – A list of soft thresholding powers for which the scale free topology fit indices are to be calculated.
nBreaks (int) – number of bins in connectivity histograms (default = 10)
blockSize (int) – block size into which the calculation of connectivity should be broken up. If not given, a suitable value will be calculated using function blockSize and printed if verbose>0. If R runs into memory problems, decrease this value.
corOptions (list) – a list giving further options to the correlation function specified in corFnc.
networkType (str) – network type. Allowed values are (unique abbreviations of) “unsigned”, “signed”, “signed hybrid”. (default = unsigned)
moreNetworkConcepts (bool) – should additional network concepts be calculated? If TRUE, the function will calculate how the network density, the network heterogeneity, and the network centralization depend on the power. For the definition of these additional network concepts, see Horvath and Dong (2008). PloS Comp Biol.
gcInterval (int) – a number specifying in interval (in terms of individual genes) in which garbage collection will be performed. The actual interval will never be less than blockSize.
- Returns:
tuple including powerEstimate: estimate of an appropriate soft-thresholding power which is the lowest power for which the scale free topology fit (R^2) exceeds RsquaredCut and conectivity is less than MeanCut. If (R^2) is below RsquaredCut for all powers maximum will re returned and datout which is a data frame containing the fit indices for scale free topology. The columns contain the soft-thresholding power, adjusted (R^2) for the linear fit, the linear coefficient, adjusted (R^2) for a more complicated fit models, mean connectivity, median connectivity and maximum connectivity. If input moreNetworkConcepts is TRUE, 3 additional columns containing network density, centralization, and heterogeneity.
- Type:
int and pandas dataframe
- plotModuleEigenGene(moduleName, metadata, show=True)[source]¶
plot module eigen gene figure in given module
- Parameters:
moduleName (str) – module name
metadata (list) – list of metadata you want to be plotted
show (bool) – indicate if you want to see plots in when you run your code
- preprocess(show=True)[source]¶
Preprocessing PyWGCNA object including removing obvious outlier on genes and samples
- Parameters:
show (bool) – indicate if you want to show your plot or not (if you put this to False it will not either show and save the plot)
- static replaceMissing(x, replaceWith)[source]¶
Replacing missing (NA) value with appropriate value (for integer number replace with 0 and for string replace with “”)
- Parameters:
x (object) – value want to replace (single item)
replaceWith (object) – define character you want to replace na value by looking at type of data
- Returns:
object without any missing (NA) value
- static request_PPI(genes, species)[source]¶
Getting all the STRING interaction partners of the protein set
- Parameters:
genes (list) – list of genes you want to find interaction for
species (int) – NCBI taxon identifiers (e.g. Human is 9606, see: https://string-db.org/cgi/input.pl?input_page_active_form=organisms).
- Returns:
dataframe contains genes interact with each other
- Return type:
pandas dataframe
- static request_PPI_image(params, genes, file_name, request_url='https://version-11-5.string-db.org/api/image/network')[source]¶
plot PPI interaction along with link that direct you to the STRING webpage
- Parameters:
params (dict) – parameters for requesting
genes (list) – list of genes you want to find interaction for
file_name (str) – name of the output file
request_url (str) – suitable url for using STRING API
- static request_PPI_subset(params, request_url='https://version-11-5.string-db.org/api/tsv-no-header/interaction_partners')[source]¶
request STRING to find genes interact with our gene list base
- Parameters:
request_url (str) – suitable url for using STRING API
params (dict) – parameters for requesting
- Returns:
dataframe contains genes interact with each other
- Return type:
pandas dataframe
- static scaleFreeFitIndex(k, nBreaks=10)[source]¶
calculates several indices (fitting statistics) for evaluating scale free topology fit.
- Parameters:
k (list) – numeric list whose components contain non-negative values
nBreaks (int) – (default = 10)
- setMetadataColor(col, cmap)[source]¶
set color pallete for each group of metadata
- Parameters:
col (str) – name of metadata
cmap (list) – color pallet
- static softConnectivity(datExpr, corOptions=Empty DataFrame Columns: [] Index: [], weights=None, type='unsigned', power=6, blockSize=1500, minNSamples=None)[source]¶
Given expression data or a similarity, the function constructs the adjacency matrix and for each node calculates its connectivity, that is the sum of the adjacency to the other nodes.
- Parameters:
datExpr (pandas dataframe) – a data frame containing the expression data, with rows corresponding to samples and columns to genes.
corOptions (pandas dataframe) – character string giving further options to be passed to the correlation function.
weights (pandas dataframe) – optional observation weights for datExpr to be used in correlation calculation. A matrix of the same dimensions as datExpr, containing non-negative weights. Only used with Pearson correlation.
type (str) – network type. Allowed values are (unique abbreviations of) “unsigned”, “signed”, “signed hybrid”.
power (int) – soft thresholding power.
blockSize (int) – block size in which adjacency is to be calculated. Too low (say below 100) may make the calculation inefficient, while too high may cause R to run out of physical memory and slow down the computer. Should be chosen such that an array of doubles of size (number of genes) * (block size) fits into available physical memory.
minNSamples (int) – minimum number of samples available for the calculation of adjacency for the adjacency to be considered valid. If not given, defaults to the greater of ..minNSamples (currently 4) and number of samples divided by 3. If the number of samples falls below this threshold, the connectivity of the corresponding gene will be returned as NA.
- Returns:
A list with one entry per gene giving the connectivity of each gene in the weighted network.
- Return type:
ndarray
- top_n_hub_genes(moduleName, n=10)[source]¶
find top n hub genes based on connectivity in given module
- Parameters:
moduleName (str) – name of module you want to top n hub genes
n (int) – number of top hub genes
- Returns:
dataframe contains top n hun genes along with connectivity score and additional gene information you added to your expression matrix
- Return type:
pandas dataframe
- updateGeneInfo(geneInfo=None, path=None, sep=',')[source]¶
add/update genes info in datExpr and geneExpr anndata
- Parameters:
geneInfo (pandas dataframe) – gene information table you want to add to your data
path (str) – path of geneInfo
sep (str) – separation symbol to use for reading data in path properly (default: “,”)
- updateSampleInfo(sampleInfo=None, path=None, sep=',')[source]¶
add/update metadata in datExpr and geneExpr anndata
- Parameters:
sampleInfo (pandas dataframe) – Sample information table you want to add to your data
path (str) – path of metaData
sep (str) – separation symbol to use for reading data in path properly (default: “,”)
- class PyWGCNA.comparison.Comparison(geneModules=None)[source]¶
A class used to compare PyWGCNA to another PyWGCNA or any gene marker table
- Parameters:
geneModules (dict) – gene modules of networks
jaccard_similarity (pandas dataframe) – jaccard similarity of common genes between each modules
P_value (pandas dataframe) – P value of common genes between each modules
fraction (pandas dataframe) – fraction of common genes between each modules
- calculateFraction()[source]¶
Calculate common fraction along multiple networks
- Returns:
dataframe containing fraction between all modules in all netwroks
- Return type:
pandas dataframe
- calculateJaccardSimilarity()[source]¶
Calculate jaccard similarity matrix along multiple networks
- Returns:
dataframe containing jaccard similarity between all modules in all PyWGCNA objects
- Return type:
pandas dataframe
- calculatePvalue(alternative='greater')[source]¶
Calculate pvalue of modules overlap along multiple networks using fisher exact test
- Parameters:
alternative (str) – {‘two-sided’, ‘less’, ‘greater’}, alternative hypothesis, use ‘greater’ to detect overlapping modules, ‘less’ to detect mutually exclusive modules, ‘two-sided’ to detect both (default: greater)
- Returns:
dataframe containing pvalue between all modules in all networks
- Return type:
pandas dataframe
- static jaccard(list1, list2)[source]¶
Calculate jaccard similarity matrix for two lists
- Parameters:
list1 (list) – first list containing the data
list2 (list) – second list containing the data
- Returns:
jaccard similarity
- Return type:
double
- plotBubbleComparison(bubble_size='jaccard_similarity', cutoff=0.01, color=None, order1=None, order2=None, figsize=None, save=True, plot_show=True, plot_format='png', file_name='bubble_comparison')[source]¶
plot comparison matrix as a bubble plot
- Parameters:
bubble_size (str) – which information you want to use for size of bubble (options: jaccard_similarity or fraction) default: jaccard_similarity
cutoff (double) – threshold you used for defining significant comparison
color (dict) – if you want to color tick labels for each networks separately
order1 (list of str) – order of modules in PyWGCNA1 you want to show in plot (name of each elements should mapped the name of modules in your first PyWGCNA)
order2 (list of str) – order of modules in PyWGCNA2 you want to show in plot (name of each elements should mapped the name of modules in your second PyWGCNA)
figsize (tuple of int) – indicate the size of plot (default is base on the number of modules)
save (bool) – if you want to save plot as comparison.png near to your script
save – indicate if you want to save the plot or not (default: True)
plot_show (bool) – indicate if you want to show the plot or not (default: True)
plot_format (str) – indicate the format of plot (default: png)
file_name (str) – name and path of the plot use for save (default: jaccard_similarity)
- plotHeatmapComparison(color='jaccard_similarity', row_cluster=True, col_cluster=True, save=True, plot_show=True, plot_format='pdf', file_name='heatmap_comparison')[source]¶
plot heatmap comparison
- Parameters:
color (str) – how to color heatmap (options: jaccard_similarity or fraction) default: jaccard_similarity
row_cluster (bool) – If True, cluster the rows. (default True)
col_cluster (bool) – If True, cluster the columns. (default True)
save (bool) – if you want to save plot as comparison.png near to your script
plot_show (bool) – indicate if you want to show the plot or not (default: True)
plot_format (str) – indicate the format of plot (default: pdf)
file_name (str) – name and path of the plot use for save (default: heatmap_comparison)
- plotJaccardSimilarity(color=None, cutoff=0.1, figsize=None, save=True, plot_show=True, plot_format='png', file_name='jaccard_similarity')[source]¶
Plot jaccard similarity matrix as a network
- Parameters:
color (dict) – if you want to color nodes for each networks separately
cutoff (double) – threshold you used for filtering jaccard similarity
figsize (tuple of int) – indicate the size of plot (default is base on the number of nodes that pass cutoff)
save (bool) – indicate if you want to save the plot or not (default: True)
plot_show (bool) – indicate if you want to show the plot or not (default: True)
plot_format (str) – indicate the format of plot (default: png)
file_name (str) – name and path of the plot use for save (default: jaccard_similarity)
- PyWGCNA.utils.compareNetworks(PyWGCNAs)[source]¶
Compare serveral PyWGCNA objects
- Parameters:
PyWGCNAs (list of PyWGCNA class) – list of PyWGCNA objects
- Returns:
compare object
- Return type:
Compare class
- PyWGCNA.utils.compareSingleCell(PyWGCNAs, sc)[source]¶
Compare WGCNA and gene marker from single cell experiment
- Parameters:
PyWGCNAs (PyWGCNA class) – WGCNA object
sc (pandas dataframe) – gene marker table which has ….
- Returns:
compare object
- Return type:
Compare class
- PyWGCNA.utils.getGeneList(dataset='mmusculus_gene_ensembl', attributes=['ensembl_gene_id', 'external_gene_name', 'gene_biotype'], maps=['gene_id', 'gene_name', 'go_id'], server_domain='http://ensembl.org/biomart')[source]¶
get table that map gene ensembl id to gene name from biomart
- Parameters:
dataset (string) – name of the dataset we used from biomart; mouse: mmusculus_gene_ensembl and human: hsapiens_gene_ensembl you can find more information here: https://bioconductor.riken.jp/packages/3.4/bioc/vignettes/biomaRt/inst/doc/biomaRt.html#selecting-a-biomart-database-and-dataset
attributes (list) – List the types of data we want
maps (list) – mapping between attributes and column names of gene information you want to show
server_domain (string) – URL of ensembl biomart server that you want to use to pull out the information (options: [‘’, ‘uswest’, ‘asia’])
- Returns:
table extracted from biomart related to the datasets including information from attributes
- Return type:
pandas dataframe
- PyWGCNA.utils.getGeneListGOid(dataset='mmusculus_gene_ensembl', attributes=['ensembl_gene_id', 'external_gene_name', 'go_id'], Goid='GO:0003700', server_domain='http://ensembl.org/biomart')[source]¶
get table that find gene id and gene name to specific Go term from biomart
- Parameters:
dataset (string) – name of the dataset we used from biomart; mouse: mmusculus_gene_ensembl and human: hsapiens_gene_ensembl you can find more information here: https://bioconductor.riken.jp/packages/3.4/bioc/vignettes/biomaRt/inst/doc/biomaRt.html#selecting-a-biomart-database-and-dataset
attributes (list) – List the types of data we want
Goid (list or str) – GO term id you would like to get genes from them
server_domain (string) – URL of ensembl biomart server that you want to use to pull out the inforamtion
- Returns:
table extracted from biomart related to the datasets including information from attributes with filtering
- Return type:
pandas dataframe