Protein function prediction with Attentive Multimodal Tied Autoencoders.

A PyTorch implementation of Multimodal Tied Autoencoder.

Abstract

A recent method (deepnf) uses a multimodal autoencoder to learn the representation for each protein. The state-of-the-art method has hundreds of millions of parameters to integrate multiple networks. In this project, we propose a multimodal tied autoencoder that constrains the decoder to share parameters with encoders.

Requirements

The codebase is implemented in Python 3.6.9. package versions used for development are just below.

tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
torch             0.4.1

Datasets

The interaction networks of Yeast and Human are collected from the STRING database v9.1.
The functional labels for yeast are obtained from the Munich Information Center for Protein Sequences (MIPS). The functional categories in MIPS are organized in a three-layered hierarchy.
The functional labels for humans are obtained from the Gene Ontology database. The GO terms for humans are grouped to obtain three distinct levels of functional categories for different specificities.

Preprocessing

To obtain global structure information, we perform random walk with restart followed by Positive Pointwise Mutual Information.

Adjacency matrix A -> Random walk with restart (RWR) -> Positive Pointwise Mutual Information (PPMI) -> feature matrix X

Presenting it as python code:

X = PPMI(RWR(A))

Network Type	Yeast	Human
Co-expression	314,013	1,576,332
cooccurence	2,664	36,128
database	33,486	319,004
experimental	219,995	618,574
fusion	1,361	3760
neighborhood	45,610	104958

Dataset	category	# number of labels
	level-1	17
Yeast	level-2	74
	level-3	154

Dataset	category	Biological Process	Cellular Component	Molecular Function
	11-30	262	82	153
Human	31-100	100	46	72
	101-300	28	20	18

Options

Training the model is handled by the main.py script which provides the following command-line arguments.

Input and output options

  --data-folder         STR    The data folder        Default is `data/`.
  --dataset             STR    The name of dataset    Default is `yeast`.
  --annotations-path    STR    Functional labels.     Default is `annotations/`.
  --networks-path       STR    Network datasets.      Default is `networks/`.

Model options

  --attn-type         STR     Type of attention              Default is `softmax`.               
  --seed              INT     Random seed.                   Default is 42.
  --epochs            INT     Number of training epochs.     Default is 20.
  --early-stopping    INT     Early stopping rounds.         Default is 10.
  --training-size     INT     Training set size.             Default is 1500.
  --validation-size   INT     Validation set size.           Default is 500.
  --learning-rate     FLOAT   Adam learning rate.            Default is 0.001.
  --dropout           FLOAT   Dropout rate value.            Default is 0.5.
  --hidden-size       INT     Layer sizes (Hidden).          Default is 2000. 
  --latent-size       INT     Layer sizes (LATENT).          Default is 600.
  --use-cuda          BOOL    Flag to use cuda               Default is False.

Examples

The following commands learn a neural network and score on the test set. Training a model on the default dataset.

python main.py

Training a model for 100 epochs.

python main.py --epochs 100

Increasing the learning rate and dropout.

python main.py --learning-rate 0.1 --dropout 0.9

Training a model with latent-size 64:

python main.py --latent-size 64

Training with Cuda:

python main.py --use-cuda

Results

The comparison of the model’s performance with the SOTA method.

Model	# number of parameters
DeepNF	312 millions
TiedDeepNF	82 millions

Our model shows a similar performance with 230 million fewer parameters.

The attention map that indicates the weights given to different networks during feature learning.

AMTAE