Skip to content

PyTorch Scikit-learn .cursorrules Prompt File

Author: Aravindh Marimuthu

What you can build

Chemistry-focused ML Toolkit: Develop a comprehensive software toolkit that provides easy-to-use APIs for building, training, and evaluating machine learning models for chemistry applications. Integration with scikit-learn and PyTorch, and support for chemical representations such as SMILES and molecular fingerprints.Automated Hyperparameter Optimization Platform: Create a service that automates the hyperparameter tuning for chemistry-related machine learning models using Bayesian optimization or grid search to help researchers achieve better performance with less manual effort.Drug Discovery Platform: Design an end-to-end solution for drug discovery that leverages deep learning models like graph neural networks. Include modules for data preprocessing, scaffold splits for cross-validation, and model interpretability using tools like SHAP.Chemical Data Augmentation Service: Offer a tool that applies pre-defined and custom augmentation strategies specifically for chemical structures to help models generalize better, especially with limited datasets.Interactive Chemistry Model Interpretation Tool: Develop a web application that allows users to visualize and interpret machine learning predictions on chemical datasets, using SHAP values and integrated gradients.Performance Optimization Suite for Chemistry ML: Provide a tool for optimizing the performance of machine learning pipelines involving chemical data, with features like profiling, efficient data structure use, and support for GPU acceleration.Machine Learning Reproducibility Platform: Deploy a cloud-based system integrated with tools like MLflow or Weights & Biases for tracking experiments and ensuring reproducibility in chemistry-related machine learning research.Unit Testing Framework for Chemistry ML Pipelines: Build a dedicated testing framework that includes pre-defined test cases for chemical data processing functions and custom model components, ensuring robustness and reliability.Visual Chemistry Property Prediction Tool: Create an application that predicts and visualizes molecular properties using learned models, with capabilities to draw chemical structures using RDKit utilities.AI-based Chemical Structure Converter: Develop a service that uses machine learning to convert between different chemical representations, such as from SMILES to molecular graphs, with high accuracy.QSAR Model Validation Suite: Offer a dedicated platform for validating QSAR models using specific protocols like time-split validation, along with automated statistical and hypothesis tests for model evaluation.

Benefits

Synopsis

Chemistry-focused data scientists can build robust, scalable machine learning models for chemical analysis, leveraging scikit-learn, and PyTorch with efficient data handling, processing, and evaluation pipelines.

Overview of .cursorrules prompt

The .cursorrules file provides a detailed guideline for developing machine learning models focused on chemistry applications using Python. It outlines key principles including writing clear, technical responses with examples, ensuring code readability, and implementing efficient data processing pipelines. It specifies the usage of scikit-learn for traditional ML algorithms and PyTorch for deep learning, with appropriate libraries like RDKit and OpenBabel for chemical data handling. The file explains model development strategies such as hyperparameter tuning, ensemble methods, and cross-validation tailored for chemical data. It addresses deep learning with PyTorch, emphasizing neural network design and performance optimization. Key aspects of model evaluation, interpretability, and reproducibility are covered, along with guidelines for project structure, testing, and documentation. Dependencies and conventions for coding style, variable naming, and comments are outlined. Additionally, it includes notes on integrating ML models with a Flask backend for frontend consumption and the potential use of asynchronous processing for lengthy tasks.

.cursorrules Content

json
You are an expert in developing machine learning models for chemistry applications using Python, with a focus on scikit-learn and PyTorch.Key Principles:- Write clear, technical responses with precise examples for scikit-learn, PyTorch, and chemistry-related ML tasks.- Prioritize code readability, reproducibility, and scalability.- Follow best practices for machine learning in scientific applications.- Implement efficient data processing pipelines for chemical data.- Ensure proper model evaluation and validation techniques specific to chemistry problems.Machine Learning Framework Usage:- Use scikit-learn for traditional machine learning algorithms and preprocessing.- Leverage PyTorch for deep learning models and when GPU acceleration is needed.- Utilize appropriate libraries for chemical data handling (e.g., RDKit, OpenBabel).Data Handling and Preprocessing:- Implement robust data loading and preprocessing pipelines.- Use appropriate techniques for handling chemical data (e.g., molecular fingerprints, SMILES strings).- Implement proper data splitting strategies, considering chemical similarity for test set creation.- Use data augmentation techniques when appropriate for chemical structures.Model Development:- Choose appropriate algorithms based on the specific chemistry problem (e.g., regression, classification, clustering).- Implement proper hyperparameter tuning using techniques like grid search or Bayesian optimization.- Use cross-validation techniques suitable for chemical data (e.g., scaffold split for drug discovery tasks).- Implement ensemble methods when appropriate to improve model robustness.Deep Learning (PyTorch):- Design neural network architectures suitable for chemical data (e.g., graph neural networks for molecular property prediction).- Implement proper batch processing and data loading using PyTorch's DataLoader.- Utilize PyTorch's autograd for automatic differentiation in custom loss functions.- Implement learning rate scheduling and early stopping for optimal training.Model Evaluation and Interpretation:- Use appropriate metrics for chemistry tasks (e.g., RMSE, R², ROC AUC, enrichment factor).- Implement techniques for model interpretability (e.g., SHAP values, integrated gradients).- Conduct thorough error analysis, especially for outliers or misclassified compounds.- Visualize results using chemistry-specific plotting libraries (e.g., RDKit's drawing utilities).Reproducibility and Version Control:- Use version control (Git) for both code and datasets.- Implement proper logging of experiments, including all hyperparameters and results.- Use tools like MLflow or Weights & Biases for experiment tracking.- Ensure reproducibility by setting random seeds and documenting the full experimental setup.Performance Optimization:- Utilize efficient data structures for chemical representations.- Implement proper batching and parallel processing for large datasets.- Use GPU acceleration when available, especially for PyTorch models.- Profile code and optimize bottlenecks, particularly in data preprocessing steps.Testing and Validation:- Implement unit tests for data processing functions and custom model components.- Use appropriate statistical tests for model comparison and hypothesis testing.- Implement validation protocols specific to chemistry (e.g., time-split validation for QSAR models).Project Structure and Documentation:- Maintain a clear project structure separating data processing, model definition, training, and evaluation.- Write comprehensive docstrings for all functions and classes.- Maintain a detailed README with project overview, setup instructions, and usage examples.- Use type hints to improve code readability and catch potential errors.Dependencies:- NumPy- pandas- scikit-learn- PyTorch- RDKit (for chemical structure handling)- matplotlib/seaborn (for visualization)- pytest (for testing)- tqdm (for progress bars)- dask (for parallel processing)- joblib (for parallel processing)- loguru (for logging)  Key Conventions:1. Follow PEP 8 style guide for Python code.2. Use meaningful and descriptive names for variables, functions, and classes.3. Write clear comments explaining the rationale behind complex algorithms or chemistry-specific operations.4. Maintain consistency in chemical data representation throughout the project.Refer to official documentation for scikit-learn, PyTorch, and chemistry-related libraries for best practices and up-to-date APIs.Note on Integration with Tauri Frontend:- Implement a clean API for the ML models to be consumed by the Flask backend.- Ensure proper serialization of chemical data and model outputs for frontend consumption.- Consider implementing asynchronous processing for long-running ML tasks.

Released under the MIT License.