Spaces:

salvinjose
/

HNTAI

Paused

App Files Files Community

sachinchandrankallar commited on Jun 11, 2025

Commit

2aa5eb6

1 Parent(s): 4182313

Revert "changes"

Browse files

This reverts commit 4182313503b52e3b5687b6592adb1dcdd8d6f43d.

Files changed (6) hide show

Dockerfile +1 -4
README_SPACES.md +0 -135
ai_med_extract/api/routes.py +0 -165
ai_med_extract/app.py +9 -43
ai_med_extract/utils/file_utils.py +36 -45
requirements.txt +2 -4

Dockerfile CHANGED Viewed

@@ -5,7 +5,6 @@ RUN apt-get update && apt-get install -y \
     tesseract-ocr \
     poppler-utils \
     ffmpeg \
-    git \
     && rm -rf /var/lib/apt/lists/*
 # Set working directory
@@ -15,8 +14,7 @@ WORKDIR /app
 COPY requirements.txt .
 # Install Python dependencies
-RUN pip install --no-cache-dir -r requirements.txt && \
-    pip install --no-cache-dir modelscope==1.9.5 qwen2==0.1.0
 # Copy application code
 COPY . .
@@ -33,7 +31,6 @@ ENV XDG_CACHE_HOME=/tmp
 ENV TORCH_HOME=/tmp/torch
 ENV WHISPER_CACHE=/tmp/whisper
 ENV PYTHONPATH=/app
-ENV MODELSCOPE_CACHE=/tmp/huggingface
 # Expose port
 EXPOSE 7860

     tesseract-ocr \
     poppler-utils \
     ffmpeg \
     && rm -rf /var/lib/apt/lists/*
 # Set working directory
 COPY requirements.txt .
 # Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
 # Copy application code
 COPY . .
 ENV TORCH_HOME=/tmp/torch
 ENV WHISPER_CACHE=/tmp/whisper
 ENV PYTHONPATH=/app
 # Expose port
 EXPOSE 7860

README_SPACES.md CHANGED Viewed

@@ -44,138 +44,3 @@ This Hugging Face Space provides an AI-powered medical document processing syste
 ## Privacy
 All processing is done securely within the Hugging Face Space environment. No data is stored permanently.
-# Medical Data Extraction API - Hugging Face Spaces Deployment
-This document provides instructions for deploying the Medical Data Extraction API on Hugging Face Spaces.
-## Prerequisites
-1. A Hugging Face account
-2. Git installed on your local machine
-3. Docker installed on your local machine (for local testing)
-## Local Testing
-Before deploying to Spaces, you can test the application locally:
-1. Build the Docker image:
-```bash
-docker build -t ai-med-extract .
-```
-2. Run the container:
-```bash
-docker run -p 7860:7860 ai-med-extract
-```
-3. Test the API endpoints:
-```bash
-# Health check
-curl http://localhost:7860/health
-# Extract medical data
-curl -X POST http://localhost:7860/api/extract_medical_data \
-  -H "Content-Type: application/json" \
-  -d '{"text": "Patient presents with fever and cough..."}'
-# Transcribe audio
-curl -X POST http://localhost:7860/api/voice_to_text_extraction \
-  -F "audio=@recording.wav"
-```
-## Deploying to Hugging Face Spaces
-1. Create a new Space:
-   - Go to huggingface.co/spaces
-   - Click "Create new Space"
-   - Choose "Docker" as the SDK
-   - Name your space (e.g., "medical-data-extraction")
-   - Set visibility (Public or Private)
-2. Connect your repository:
-   - Push your code to a Git repository
-   - In the Space settings, connect to your repository
-   - The Space will automatically build using the Dockerfile
-3. Monitor the deployment:
-   - Check the "Logs" tab for build progress
-   - Use the health check endpoint to verify the deployment:
-     ```
-     https://your-space-name.hf.space/health
-     ```
-## API Endpoints
-1. Health Check:
-```
-GET /health
-```
-2. Extract Medical Data:
-```
-POST /api/extract_medical_data
-Content-Type: application/json
-{
-    "text": "Patient presents with..."
-}
-```
-3. Generate Summary:
-```
-POST /api/summary_creation
-Content-Type: application/json
-{
-    "text": "Patient presents with..."
-}
-```
-4. Transcribe Audio:
-```
-POST /api/voice_to_text_extraction
-Content-Type: multipart/form-data
-audio: [audio file]
-```
-5. Audio to Chart:
-```
-POST /api/audio_to_chart
-Content-Type: multipart/form-data
-audio: [audio file]
-```
-## Environment Variables
-The following environment variables are automatically set in the Dockerfile:
-- `PYTHONUNBUFFERED=1`
-- `HF_HOME=/tmp/huggingface`
-- `TRANSFORMERS_CACHE=/tmp/huggingface`
-- `XDG_CACHE_HOME=/tmp`
-- `TORCH_HOME=/tmp/torch`
-- `WHISPER_CACHE=/tmp/whisper`
-- `PYTHONPATH=/app`
-- `MODELSCOPE_CACHE=/tmp/huggingface`
-## Troubleshooting
-1. If models fail to load:
-   - Check the logs in the Spaces interface
-   - Verify disk space using the health check endpoint
-   - Ensure all dependencies are correctly specified in requirements.txt
-2. If audio transcription fails:
-   - Verify the audio file format (supported: wav, mp3, m4a, ogg)
-   - Check file size (max 100MB)
-   - Ensure the Whisper model is loaded (check health endpoint)
-3. If CORS errors occur:
-   - Verify the frontend URL is correctly configured
-   - Check the CORS settings in app.py
-   - Ensure proper headers are set in the frontend requests
-## Support
-For issues or questions:
-1. Check the application logs in the Spaces interface
-2. Use the health check endpoint to diagnose problems
-3. Review the error messages in the API responses


44	## Privacy
45
46	All processing is done securely within the Hugging Face Space environment. No data is stored permanently.

ai_med_extract/api/routes.py CHANGED Viewed

@@ -757,168 +757,3 @@ def register_routes(app, agents):
     @app.route("/")
     def home():
         return "Medical Data Extraction API is running!", 200
-    @app.route("/health", methods=["GET"])
-    def health_check():
-        try:
-            # Check if models are loaded
-            models_status = {
-                "medical_data_extractor": MedicalDataExtractorAgent.gen_model_loader._model is not None,
-                "summarizer": SummarizerAgent.summarization_model_loader._model is not None,
-                "whisper": whisper_model._model is not None
-            }
-            # Check disk space
-            import shutil
-            total, used, free = shutil.disk_usage("/")
-            disk_space = {
-                "total_gb": round(total / (1024**3), 2),
-                "used_gb": round(used / (1024**3), 2),
-                "free_gb": round(free / (1024**3), 2)
-            }
-            return jsonify({
-                "status": "healthy",
-                "models": models_status,
-                "disk_space": disk_space,
-                "version": "1.0.0"
-            })
-        except Exception as e:
-            logging.error(f"Health check failed: {str(e)}", exc_info=True)
-            return jsonify({
-                "status": "unhealthy",
-                "error": str(e)
-            }), 500
-    @app.route("/api/extract_medical_data", methods=["POST"])
-    def extract_medical_data():
-        try:
-            data = request.json
-            if not data or "text" not in data:
-                return jsonify({"error": "No text provided"}), 400
-            text = data["text"]
-            if not text.strip():
-                return jsonify({"error": "Empty text provided"}), 400
-            # Extract medical data
-            result = MedicalDataExtractorAgent.extract_medical_data(text)
-            return jsonify(result)
-        except Exception as e:
-            logging.error(f"Error in extract_medical_data: {str(e)}", exc_info=True)
-            return jsonify({"error": str(e)}), 500
-    @app.route("/api/summary_creation", methods=["POST"])
-    def generate_summary():
-        try:
-            data = request.json
-            if not data or "text" not in data:
-                return jsonify({"error": "No text provided"}), 400
-            text = data["text"]
-            if not text.strip():
-                return jsonify({"error": "Empty text provided"}), 400
-            # Generate summary
-            summary = SummarizerAgent.generate_summary(text)
-            return jsonify({"summary": summary})
-        except Exception as e:
-            logging.error(f"Error in generate_summary: {str(e)}", exc_info=True)
-            return jsonify({"error": str(e)}), 500
-    @app.route("/api/voice_to_text_extraction", methods=["POST"])
-    def transcribe_audio():
-        try:
-            if "audio" not in request.files:
-                return jsonify({"error": "No audio file provided"}), 400
-            audio_file = request.files["audio"]
-            if audio_file.filename == "":
-                return jsonify({"error": "No selected audio file"}), 400
-            # Validate file extension
-            if not allowed_file(audio_file.filename):
-                return jsonify({
-                    "error": "Unsupported audio format. Allowed formats: wav, mp3, m4a, ogg"
-                }), 400
-            # Check file size
-            valid_size, error_message = check_file_size(audio_file)
-            if not valid_size:
-                return jsonify({"error": error_message}), 400
-            # Save audio file temporarily
-            temp_dir = os.path.join(tempfile.gettempdir(), 'audio_uploads')
-            os.makedirs(temp_dir, exist_ok=True)
-            temp_filename = f"{uuid.uuid4()}_{secure_filename(audio_file.filename)}"
-            temp_path = os.path.join(temp_dir, temp_filename)
-            try:
-                audio_file.save(temp_path)
-                # Transcribe audio
-                result = whisper_model.transcribe(temp_path)
-                return jsonify({"text": result["text"]})
-            finally:
-                # Clean up temporary file
-                if os.path.exists(temp_path):
-                    os.remove(temp_path)
-        except Exception as e:
-            logging.error(f"Error in transcribe_audio: {str(e)}", exc_info=True)
-            return jsonify({"error": str(e)}), 500
-    @app.route("/api/audio_to_chart", methods=["POST"])
-    def audio_to_chart():
-        try:
-            if "audio" not in request.files:
-                return jsonify({"error": "No audio file provided"}), 400
-            audio_file = request.files["audio"]
-            if audio_file.filename == "":
-                return jsonify({"error": "No selected audio file"}), 400
-            # Validate file extension
-            if not allowed_file(audio_file.filename):
-                return jsonify({
-                    "error": "Unsupported audio format. Allowed formats: wav, mp3, m4a, ogg"
-                }), 400
-            # Check file size
-            valid_size, error_message = check_file_size(audio_file)
-            if not valid_size:
-                return jsonify({"error": error_message}), 400
-            # Save audio file temporarily
-            temp_dir = os.path.join(tempfile.gettempdir(), 'audio_uploads')
-            os.makedirs(temp_dir, exist_ok=True)
-            temp_filename = f"{uuid.uuid4()}_{secure_filename(audio_file.filename)}"
-            temp_path = os.path.join(temp_dir, temp_filename)
-            try:
-                audio_file.save(temp_path)
-                # Transcribe audio
-                transcription = whisper_model.transcribe(temp_path)
-                transcribed_text = transcription["text"]
-                # Extract medical data from transcription
-                medical_data = MedicalDataExtractorAgent.extract_medical_data(transcribed_text)
-                # Generate summary
-                summary = SummarizerAgent.generate_summary(transcribed_text)
-                return jsonify({
-                    "transcription": transcribed_text,
-                    "medical_data": medical_data,
-                    "summary": summary
-                })
-            finally:
-                # Clean up temporary file
-                if os.path.exists(temp_path):
-                    os.remove(temp_path)
-        except Exception as e:
-            logging.error(f"Error in audio_to_chart: {str(e)}", exc_info=True)
-            return jsonify({"error": str(e)}), 500

     @app.route("/")
     def home():
         return "Medical Data Extraction API is running!", 200

ai_med_extract/app.py CHANGED Viewed

@@ -10,7 +10,6 @@ from .agents.phi_scrubber import MedicalTextUtils
 from .agents.summarizer import SummarizerAgent
 from .agents.medical_data_extractor import MedicalDataExtractorAgent
 from .agents.medical_data_extractor import MedicalDocDataExtractorAgent
-import torch
 # Load environment variables
@@ -27,16 +26,7 @@ logging.basicConfig(
 )
 app = Flask(__name__)
-# Configure CORS
-CORS(app, resources={
-    r"/api/*": {
-        "origins": ["*"],  # Allow all origins in development
-        "methods": ["GET", "POST", "OPTIONS"],
-        "allow_headers": ["Content-Type", "Authorization"],
-        "max_age": 3600
-    }
-})
 # Configure upload directory
 UPLOAD_DIR = '/data/uploads'
@@ -70,37 +60,13 @@ class LazyModelLoader:
         if self._model is None:
             try:
                 logging.info(f"Loading {self.model_name}...")
-                # Special handling for Qwen2 models
-                if "qwen2" in self.model_name.lower():
-                    from modelscope import AutoModelForCausalLM, AutoTokenizer
-                    tokenizer = AutoTokenizer.from_pretrained(
-                        self.model_name,
-                        trust_remote_code=True,
-                        cache_dir=os.environ.get('TRANSFORMERS_CACHE', '/tmp/huggingface')
-                    )
-                    model = AutoModelForCausalLM.from_pretrained(
-                        self.model_name,
-                        trust_remote_code=True,
-                        device_map="auto",
-                        torch_dtype=torch.float16,
-                        cache_dir=os.environ.get('TRANSFORMERS_CACHE', '/tmp/huggingface')
-                    )
-                    self._model = pipeline(
-                        task=self.model_type,
-                        model=model,
-                        tokenizer=tokenizer,
-                        device_map="auto",
-                        torch_dtype=torch.float16
-                    )
-                else:
-                    self._model = pipeline(
-                        task=self.model_type,
-                        model=self.model_name,
-                        trust_remote_code=True,
-                        device_map="auto",
-                        low_cpu_mem_usage=True
-                    )
                 logging.info(f"Successfully loaded {self.model_name}")
             except Exception as e:
                 if self.fallback_model:
@@ -145,7 +111,7 @@ class WhisperModelLoader:
 try:
     # Use smaller models for Hugging Face Spaces
     medalpaca_model_loader = LazyModelLoader(
-        "Qwen/Qwen2-7B-Chat",  # Using Qwen2 model
         "text-generation",
         fallback_model="facebook/bart-base"  # Fallback model
     )

 from .agents.summarizer import SummarizerAgent
 from .agents.medical_data_extractor import MedicalDataExtractorAgent
 from .agents.medical_data_extractor import MedicalDocDataExtractorAgent
 # Load environment variables
 )
 app = Flask(__name__)
+CORS(app)
 # Configure upload directory
 UPLOAD_DIR = '/data/uploads'
         if self._model is None:
             try:
                 logging.info(f"Loading {self.model_name}...")
+                self._model = pipeline(
+                    task=self.model_type,
+                    model=self.model_name,
+                    trust_remote_code=True,
+                    device_map="auto",
+                    low_cpu_mem_usage=True
+                )
                 logging.info(f"Successfully loaded {self.model_name}")
             except Exception as e:
                 if self.fallback_model:
 try:
     # Use smaller models for Hugging Face Spaces
     medalpaca_model_loader = LazyModelLoader(
+        "medalpaca/medalpaca-7b",  # Smaller model
         "text-generation",
         fallback_model="facebook/bart-base"  # Fallback model
     )

ai_med_extract/utils/file_utils.py CHANGED Viewed

@@ -5,72 +5,63 @@ import logging
 from werkzeug.utils import secure_filename
 from flask import current_app
-# Configure logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-# Allowed file extensions
-ALLOWED_EXTENSIONS = {
-    'pdf': ['pdf'],
-    'image': ['png', 'jpg', 'jpeg', 'gif', 'bmp', 'tiff'],
-    'audio': ['wav', 'mp3', 'm4a', 'ogg'],
-    'document': ['doc', 'docx', 'txt', 'rtf']
-}
 MAX_SIZE_PDF_DOCS = 1 * 1024 * 1024 * 1024    # 1GB
 MAX_SIZE_IMAGES = 500 * 1024 * 1024           # 500MB
 MAX_SIZE_AUDIO = 100 * 1024 * 1024            # 100MB
-def allowed_file(filename, file_type='audio'):
-    """Check if the file extension is allowed"""
-    if not filename:
-        return False
-    return '.' in filename and \
-           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS.get(file_type, [])
-def check_file_size(file, max_size_mb=100):
-    """Check if the file size is within limits"""
     try:
         file.seek(0, os.SEEK_END)
         size = file.tell()
-        file.seek(0)  # Reset file pointer
-        max_size_bytes = max_size_mb * 1024 * 1024
-        if size > max_size_bytes:
-            return False, f"File size exceeds {max_size_mb}MB limit"
         return True, None
     except Exception as e:
-        logger.error(f"Error checking file size: {str(e)}")
-        return False, "Error checking file size"
 def save_data_to_storage(filename, data):
-    """Save extracted data to storage"""
     try:
-        storage_dir = os.path.join(os.getcwd(), 'storage')
-        os.makedirs(storage_dir, exist_ok=True)
-        file_path = os.path.join(storage_dir, f"{secure_filename(filename)}.json")
-        with open(file_path, 'w') as f:
-            json.dump(data, f)
-        return True
     except Exception as e:
-        logger.error(f"Error saving data to storage: {str(e)}")
-        return False
 def get_data_from_storage(filename):
-    """Retrieve data from storage"""
     try:
-        storage_dir = os.path.join(os.getcwd(), 'storage')
-        file_path = os.path.join(storage_dir, f"{secure_filename(filename)}.json")
-        if os.path.exists(file_path):
-            with open(file_path, 'r') as f:
-                return json.load(f)
-        return None
     except Exception as e:
-        logger.error(f"Error retrieving data from storage: {str(e)}")
         return None

 from werkzeug.utils import secure_filename
 from flask import current_app
+ALLOWED_EXTENSIONS = {"pdf", "jpg", "jpeg", "png", "svg", "docx", "doc", "xlsx", "xls", "wav", "mp3", "m4a", "ogg"}
 MAX_SIZE_PDF_DOCS = 1 * 1024 * 1024 * 1024    # 1GB
 MAX_SIZE_IMAGES = 500 * 1024 * 1024           # 500MB
 MAX_SIZE_AUDIO = 100 * 1024 * 1024            # 100MB
+def allowed_file(filename):
+    return "." in filename and filename.rsplit(".", 1)[1].lower() in ALLOWED_EXTENSIONS
+def check_file_size(file):
     try:
+        # Store current position
+        current_pos = file.tell()
+        # Check size
         file.seek(0, os.SEEK_END)
         size = file.tell()
+        # Return to original position
+        file.seek(current_pos)
+        extension = file.filename.rsplit('.', 1)[-1].lower()
+        if extension in {"pdf", "docx"} and size > MAX_SIZE_PDF_DOCS:
+            return False, f"File {file.filename} exceeds 1GB size limit"
+        elif extension in {"jpg", "jpeg", "png"} and size > MAX_SIZE_IMAGES:
+            return False, f"Image {file.filename} exceeds 500MB size limit"
+        elif extension in {"wav", "mp3", "m4a", "ogg"} and size > MAX_SIZE_AUDIO:
+            return False, f"Audio file {file.filename} exceeds 100MB size limit"
         return True, None
     except Exception as e:
+        logging.error(f"Error checking file size: {e}", exc_info=True)
+        return False, f"Error checking file size: {str(e)}"
 def save_data_to_storage(filename, data):
     try:
+        upload_folder = current_app.config.get("UPLOAD_FOLDER", "uploads")
+        if not os.path.exists(upload_folder):
+            os.makedirs(upload_folder, exist_ok=True)
+        filename = filename.rsplit(".", 1)[0]
+        filepath = os.path.join(upload_folder, f"{filename}.json")
+        with open(filepath, "w") as file:
+            json.dump(data, file)
     except Exception as e:
+        logging.error(f"Exception during save: {e}")
 def get_data_from_storage(filename):
     try:
+        upload_folder = current_app.config.get("UPLOAD_FOLDER", "uploads")
+        filepath = os.path.join(upload_folder, f"{filename}.json")
+        if not os.path.exists(filepath):
+            return None
+        with open(filepath, "r") as file:
+            data = json.load(file)
+        return data
     except Exception as e:
+        logging.error(f"Error loading data: {e}")
         return None

requirements.txt CHANGED Viewed

@@ -8,7 +8,7 @@ python-dotenv==1.0.1
 torch==2.1.0
 torchaudio==2.1.0
 torchvision==0.16.0
-transformers==4.37.2
 sentence-transformers==2.2.2
 scikit-learn==1.3.2
 numpy==1.24.3
@@ -16,9 +16,7 @@ pandas==2.1.4
 scipy==1.11.4
 accelerate==0.25.0
-# Qwen2 dependencies
-modelscope==1.9.5
-qwen2==0.1.0
 # NLP
 spacy==3.7.2

 torch==2.1.0
 torchaudio==2.1.0
 torchvision==0.16.0
+transformers==4.36.2
 sentence-transformers==2.2.2
 scikit-learn==1.3.2
 numpy==1.24.3
 scipy==1.11.4
 accelerate==0.25.0
 # NLP
 spacy==3.7.2