Specific projects (2000-2015)

  • National Repository of Molecules: Molecular inventory: Inventory of organic molecules synthesized by scientists needs to be analyzed for purity, authentication and activity. During this process tracking of the sample requires unique identification system. NCL has established a unique and direct barcording of truly computable chemical structures for direct molecular entries into various computational programs and inventory systems. The linear representation of molecules either in standard SMILES format or in-house developed compact ACS (Automatic Chemical Structure) format is directly barcoded and attached with chemical samples for inventory and tracking applications. The same encoding strategy can be used for emerging wireless based RF. (http://moltable.ncl.res.in)
  • QSPR studies: Recently we successfully established a QSPR strategy for Predicting Melting point of diverse class of organic molecules from chemical structures using artificial neural network. The machine learning technique with sample of 5000+ organic molecules with experimental melting points along with computed 2D and 3D molecular descriptors were used for building the model. The similar strategy is being applied for the analysis of biological activity data on molecular structures to predict biological activity (QSAR): Study on NCI-Cancer dataset of about 32,000 molecular structures and NCI-AIDS dataset of about 40,000 molecular structures with activity profiles. Study on druglike compounds ~53,000 molecules with multiple “Therapeutic category”
  • Cloud computing and Chemoinformatics Initiatives. Project Sponsored by External Funding Agency: (Department of Science and Technology) Here we proposed to build, design, manage and operate a open source toolkit for cloud computing system for chemoinformatics interconnecting client computers in a network environment. The collated computing power available from the clients and higher bandwidth of network would be explored for the scientific computing in chemistry and biology focused on the design of new functional materials with higher efficacy. The information available on development of new materials in chemistry and biology from scientific literature which is growing rapidly would require computing architecture for analysis and simulations with capability of handling terabytes of data in much faster rate. Usually grid computing architecture is used successfully in weather forecasting and analysis of astronomical data. The previous successful experience in the design and development of open source based distributed computing application developed for harvesting chemical data from internet using Google[1] and chemical computing 12 million molecular structure from Pubchem[2] would be extended for design of an open source cloud computing system for generic chemoinformatics applications including design of new materials and molecules.
  • Pattern Recognition based inter-conversion of chemical images into 3D structures Traditionally, in the literature, chemical structures are identified by textual names or images of structures. Chemical structures are international language of chemistry and very explicit to convey a great deal of information to a chemist, but not for computers. To make these images machine readable, the process would involve a chemical image recognition capability. Two dimensional chemical structures published in the literature are usually stored as bitmap images. Although neat chemical structures for publications are usually created using chemical drawing programs in a reusable format, this information is lost in the publication process. Since the published data is available in PDF format, extracting molecular structure for reusability in vector graphics format back from these bitmap images is a challenging task. Here we propose to build a pattern recognition based chemoinformatics application based on our prelimanry studies to convert images of graphical chemical representation as they appear in journal articles, patent documents, text books, trade magazines into truly computable formats in both 2D and 3D conformations. Currently this program is capable of reading several image formats like GIF, JPEG, PNG, TIFF, BMP, PDF etc., and able to generate the molecular structures in SMILES, SDF and MOL formats suitable for chemical structure databases. Since the optical recognition is unlikely to be perfect attempts are being made to improve the quality of reproducing chemical structures. Here we envision the application of this strategy to extract truly computable molecular structures in 3D format with optimized lowest energy conformation by processing pdf files which are downloaded from chemical journals (for example J. Org. chem. / Org. Lett. Etc.,) . This would save several hours of manually drawing chemical structures using traditional drawing tools, and also expected to correct the errors automatically by chemical structure validation algorithms. One of the successful strategies in drug designing is based on modification and improvement of already existing molecules. In the pharmaceutical industry, motivation for analog design are often driven by competitive and economic factors in order to search for ways to modify its structure and some of its physical and chemical properties while retaining or improving its therapeutic use. This project can become a platform for designing novel as well as analogous anticancer agents. Further this methodology will be used for harvesting and generating inventory of novel anti-cancer agents from full text articles of scientific literature (pdf) for QSAR, QSTR related studies.
  • Database Development : MIMMS: Discovering drug-like scaffolds of Medicinally Important Molecules from Marine Species The ocean is considered to be a great source of potential drugs in spite of their accessibility. In this we work, we developed a database MIMMS (Medicinally Important Molecules from Marine Species) by harvesting chemical, biological and medical data from scientific literature especially related to marine species. Chemical and Biological Literature covering a period of 1970-2009 was text mined to find the relationships between molecules of biological interest and marine species. In-house database containing truly computable molecular structure and chemical names were linked to the literature data for finding the most frequently occurring molecules. For this particular study 43 marine species and most frequently occurred molecules 10,378 were stored in the database format. Application of inhouse developed chemoinformatics tools BEST (Basic and Extended Scaffold Translator) utilizing high performance computing environment for the identification of scaffolds from these molecules. MIMMS comprise top ranking marine species with associated structures. MIMMS links species, molecules, scaffolds, drugs, diseases and literature to form a biological network and further it can be used to explore the relationship between molecular scaffolds and marine species diversity. MIMMS can be used as a resource for design of virtual library from molecular scaffolds and also for the design and development of inhibitors for potentially devastating diseases using in-silico methods.
  • DoMINE: A Chemoinformatics oriented Database of Indian Medicinal Plants and their Chemical Ingredients Natural products have been the single most productive source of leads for the development of drugs. Over twenty natural product related drugs were approved in the last five years by FDA. They cover a range of therapeutic indications: anti-cancer, anti-infective, anti-diabetic among others, and they show a great diversity of chemical structures. Over a 100 natural product-derived compounds are currently undergoing clinical trials and at least 100 similar projects are in preclinical development. The inspiration of use of natural products in the drug discovery pipeline encouraged us to undertake this project of identifying and cataloging the chemicals of natural products cited in the literature. This work highlights on the main theme of linking species, molecule and disease. Profuse data about the species, chemicals and drugs was collected and made useful by filtering the data for extracting the necessary information from medical literature covering a period of last fourty years. Many chemoinformatics tools and techniques were employed to achieve the desired results at different steps of the procedure. A GUI named ELitE (Electronic Literature Expert) provided access to DoMINE (Database of Medicinally Important natural products from plantaE). DoMINE is the database which consists of the data relating the plants species, chemical, scaffold, drug, disease, use and allows user to access the data easily. Application of in-house developed chemoinformatics tools BEST (Basic and Extended Scaffold Translator) utilizing high performance computing environment for the identification of scaffolds from these molecules. A huge network was built showing the relation between the species, chemical, scaffold, drug, disease or therapeutic use. The database is compatible for chemoinformatics oriented sub-structure, exact structure and similar-structure queries to retrieve the details of medicinal plants with active chemical ingredients along with their therapeutic importance.
  • J-Proline : Java based open source toolkit for Analysis of Protein Ligand Complexes Analysis of Protein-ligand complexes plays an important role in drug discovery research. Knowledge based inhibitor design based on the information or analysis of protein-ligand complexes likely to be an efficient approach. Here we present a Java based open source toolkit for building and analysis of Protein-Ligand complexes employing in-house built scaffold extraction method. We also used several conventional similarity analysis and clustering tools in distributed computing environment (DCE) to handle the massive computational overhead. The program was used to generate the relationship between proteins, ligands and scaffolds. The similarity scores among proteins were identified by sequence similarity methods utilizing local, global and multiple alignments. The ligand similarities between ligands were identified using fingerprint based scores. The scores generated for proteins and ligands were used for classification, network and tree building. The links (edges) established between the nodes (proteins and ligands) were used for identification of common scaffolds and their occurrences in databases. Several open source programs for handling protein data and molecular data were integrated on a in-house developed Distributed Computing Environment (DCE). Through our study, it is found that certain scaffolds (promiscuous) are most common for multiple classes of proteins targets which are likely to interfere with the activity and function of competitive proteins. We also determined to identify scaffolds which are selective for certain protein targets through this approach. The data and methods developed as part of this project will be made available through public web resource (http://moltable.ncl.res.in/.) Several ligands with large molecular complexity were identified in the protein-ligand network analysis. (Figure-3). The selected list of scaffolds from ligand library were used for screening large chemical and biologically relevant drug databases to identify similar compounds with potential inhibitory activities as leads. The protein-ligand network analysis tool was successfully implemented for identification of efficient kinase inhibitors. Development of Organic Reactions Fingerprints to evaluate Reactivity Score and synthetic feasibility of new Molecules.The purpose of this project is to identify whether the molecule in question is capable of undergoing certain chemical reactions based on known Organic Name Reactions. This would facilitate to annotate the molecule either as a reactant or as a product based on their functional groups.
  • CHEMICAL DECODING AYURVEDIC MEDICINAL PLANTS ( J. Ayurveda and Integrated Medicine (2010) in press)Ayurveda is the most ancient science to lead a pure and healthy life originated in India and practiced more than five thousand years especially for curing diseases. There are several scientific and public data available about the list of medicinal plants used in Ayurveda. It is a common belief and scientific understanding that the active chemical ingredients or secondary metabolites of plants are usually responsible for biological activity of the plants. In this study we selected list of over two thousands plants known to be used in Ayurveda, especially through their unique latin names (genus and species) along with local names commonly used in India. Our major objective of this initiative is to use inhouse built robust chemoinformatics and textmining tools to identify list of chemicals associated with these medicinal plants as cited in the scientific literature. The molecules extracted using this procedure is further analyzed for their common frameworks comparable with drugs used for common diseases. This decoding approach of chemical information related to medicinal plants used in Ayurveda would certainly rationalize their use in modern medicine
  • ChemRobot:Design of a chemically intelligent digital vision based robot (hardware) to understand, interpret and guide ‘molecular informatics’ research initiatives. The ChemRobot (hardware) equipped with dual high resolution camcorders (professional) / webcams (academic) to digitally capture and analyze hand drawn / computer generated molecular structures from plain papers. (Specifications available) The software (opensource) is capable of extracting images from streaming video and converting them into raster graphics images and then transform them into to vector graphics. The edges, nodes detected were then interpreted as bonds and atoms (default atoms: carbon). Optical character recognition tool integrated with the program is able to translate the alphabetic characters as heteroatoms (N,P,S,O,Cl,Br,I,F etc.,). Coming soon.. a) Extract molecules from Reactions. b) Predict the physico-chemical, biological properties. Robot Capabilities 1) Capable of processing 20-30 images per second in a distributed computing environment. 2) Capture molecular structures from hand drawn images and the interpret the images into 3D molecular structures with high precision 3) Generation of IUPAC Name from molecular structure 4) Predict the properties (Druglike, toxicity score, novelty/complexity score, Ease of synthesis, similar known molecules from published literature (research papers, patents, PhD thesis, reports, web pages etc.,) 5) Success story: Recognition of Cyclic and complex molecular structures including spiro systems.
  • Design of Efficient Lead Molecules by Scaffold and Pharmacophore Analysis of Kinase Inhibitors using Chemoinformatics Tools Protein kinases are excellent targets in cancer research. Kinase inhibitors are of considerable interest in the several phases of drug discovery today. We present a computationally intensive scaffold extraction strategy applied on the entire libraries of publicly available published kinase inhibitors and used the selected list of candidate scaffolds for the design of efficient lead molecules targeting kinases. The scaffold-target information generated by the above method is used for building biological network connecting protein targets, ligands and scaffolds. The list of scaffolds identified from kinase inhibitors are distinct from the scaffolds generated from other class of protein inhibitors. We focused our research on analysis of these kinase inhibitor scaffolds and used them for pharmacophore model based screening databases of commercially available chemicals, publicly available databases like Pubchem bioassay and other custom designed in-house databases at NCL (http://moltable.ncl.res.in/ ). Based on our study we also found several new and promising lead structures for further computational and activity prediction studies. Workflow of entire process to identify efficient kinase inhibitors through scaffold based pharmacophore modeling. One of the important observations in our study revealed the promiscuous nature of scaffolds and corresponding ligands extracted from kinase inhibitors. The details of list of unique kinase inhibitor scaffolds, QSAR studies, pharmacophore modeling and screening results, molecular clustering and frequency of occurrences of candidate structures from selected scaffolds from other databases etc., are studied. The results of combined docking and QSAR studies to prioritize the efficient lead compounds are under investigation.
  • ChemScreener : Distributed In silico Library Design under Drug-Likeness Constraints In this project we are highlighting the progress made in developing a chemoinformatics application for distributed in silico library generation from known biologically active molecules under the additional constraint of drug-likeness and lead likeness. The distributed and grid computing infrastructure optimized for harvesting chemical data and massing molecular properties computing was already published [1,2]. In this work we attempted to expand the scope of distributed computing environment using the server/client communication infrastructure Java RMI technology to design focused virtual library of molecules from molecules known to have biological active properties published in scientific literature. This virtual library of molecules could be effectively used for further in-silico modeling and simulation studies related to drug design or material science in a virtually unlimited size of “chemical space”, which can’t even be sampled exhaustively for typical small molecules containing up to about 30 heavy atoms. The server/client communication infrastructure employed is based on Java RMI and distributed as open source. It is adoptable to virtually every computing task that can be parallelized, as earlier applications on such diverse tasks as distributed computing and distributed data mining of chemical information from the Internet demonstrated. Early high-throughput screening and combinatorial library design both suffered from unfavorable physicochemical properties of the molecules they contained, giving for example too large or poorly permeable compounds unsuitable as leads. To circumvent this problem, the generation of structures in silico is combined with a drug-likeness filter which currently obeys the ‘rule of five’ but can be set to any user-definable filter and other advanced QSAR/QSPR/QSTR components. This enables the generation of libraries of compounds which are tailored to specific targets or target classes, given that both the fragments employed for virtual synthesis as well as the drug-likeness filter can incorporate knowledge about the drug class considered. In addition to calculating, as it is performed in the current implementation, drug-likeness for each structure in a distributed manner, further evaluations such as docking can also be implemented in this fashion.


Photoinduced Electron Transfer (PET) promoted carbo-annulation Strategy: Arene Radical Cation in carbon-carbon bond formation reaction." Successfully completed the research methodology on carbocyclisation and spirocyclization reactions. The ketone generated reactive (kinetic/thermodynamic) silyl-enol ethers were used as efficient nucleophiles for the construction of novel complex carbocyclic (5-8), spirocyclic (spiro[4.5], spiro[5.5]) compounds. The spirocyclic diketones prepared have structural identity with cannabis spirane frameworks. For further details please refer to the corresponding publications [1&2]


Recent work

  • BEST: Basic and Extended Scaffold Translator (Used for building 1 million distinct molecular scaffolds covering entire chemical space) (2009-10)
  • TextHydra: Chemical Textmining Toolkit for Medical literature covering over 17 million abstracts. (2008-9)
  • DCE: Distributed Chemical Computing Environment (2008-9) The java based Distributed chemical computing architecture employed for building ChemStar and ChemXtreme is being optimized for Medical literature Mining in terms of performance and portability. The details would be published in the relevant journal.
  • ICBIS: Interactive Chemical - Biological Information System (Linking Molecules to Species) (2008-9) This initiative is to link all the known species to molecules, sequences and protein structures.
  • J-Proline: Java-Protein-Ligand Network Environment : Analysis of Protein-Ligand Complexes using similarity fingerprints (2008-9) The purpose of J-Proline is to create the network consists of diseases, drugs, scaffolds.
  • Chemoinformatics for Drug Discovery (J. combi.chem HTS 2015, Issue 6 & 7, 10 research papers)
  • BigData Analytics in Chemistry and Biology using Cloud computing, HPC and GPU computing (2015)