Spaces:
Paused
Paused
Commit ·
2aa5eb6
1
Parent(s): 4182313
Revert "changes"
Browse filesThis reverts commit 4182313503b52e3b5687b6592adb1dcdd8d6f43d.
- Dockerfile +1 -4
- README_SPACES.md +0 -135
- ai_med_extract/api/routes.py +0 -165
- ai_med_extract/app.py +9 -43
- ai_med_extract/utils/file_utils.py +36 -45
- requirements.txt +2 -4
Dockerfile
CHANGED
|
@@ -5,7 +5,6 @@ RUN apt-get update && apt-get install -y \
|
|
| 5 |
tesseract-ocr \
|
| 6 |
poppler-utils \
|
| 7 |
ffmpeg \
|
| 8 |
-
git \
|
| 9 |
&& rm -rf /var/lib/apt/lists/*
|
| 10 |
|
| 11 |
# Set working directory
|
|
@@ -15,8 +14,7 @@ WORKDIR /app
|
|
| 15 |
COPY requirements.txt .
|
| 16 |
|
| 17 |
# Install Python dependencies
|
| 18 |
-
RUN pip install --no-cache-dir -r requirements.txt
|
| 19 |
-
pip install --no-cache-dir modelscope==1.9.5 qwen2==0.1.0
|
| 20 |
|
| 21 |
# Copy application code
|
| 22 |
COPY . .
|
|
@@ -33,7 +31,6 @@ ENV XDG_CACHE_HOME=/tmp
|
|
| 33 |
ENV TORCH_HOME=/tmp/torch
|
| 34 |
ENV WHISPER_CACHE=/tmp/whisper
|
| 35 |
ENV PYTHONPATH=/app
|
| 36 |
-
ENV MODELSCOPE_CACHE=/tmp/huggingface
|
| 37 |
|
| 38 |
# Expose port
|
| 39 |
EXPOSE 7860
|
|
|
|
| 5 |
tesseract-ocr \
|
| 6 |
poppler-utils \
|
| 7 |
ffmpeg \
|
|
|
|
| 8 |
&& rm -rf /var/lib/apt/lists/*
|
| 9 |
|
| 10 |
# Set working directory
|
|
|
|
| 14 |
COPY requirements.txt .
|
| 15 |
|
| 16 |
# Install Python dependencies
|
| 17 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
| 18 |
|
| 19 |
# Copy application code
|
| 20 |
COPY . .
|
|
|
|
| 31 |
ENV TORCH_HOME=/tmp/torch
|
| 32 |
ENV WHISPER_CACHE=/tmp/whisper
|
| 33 |
ENV PYTHONPATH=/app
|
|
|
|
| 34 |
|
| 35 |
# Expose port
|
| 36 |
EXPOSE 7860
|
README_SPACES.md
CHANGED
|
@@ -44,138 +44,3 @@ This Hugging Face Space provides an AI-powered medical document processing syste
|
|
| 44 |
## Privacy
|
| 45 |
|
| 46 |
All processing is done securely within the Hugging Face Space environment. No data is stored permanently.
|
| 47 |
-
|
| 48 |
-
# Medical Data Extraction API - Hugging Face Spaces Deployment
|
| 49 |
-
|
| 50 |
-
This document provides instructions for deploying the Medical Data Extraction API on Hugging Face Spaces.
|
| 51 |
-
|
| 52 |
-
## Prerequisites
|
| 53 |
-
|
| 54 |
-
1. A Hugging Face account
|
| 55 |
-
2. Git installed on your local machine
|
| 56 |
-
3. Docker installed on your local machine (for local testing)
|
| 57 |
-
|
| 58 |
-
## Local Testing
|
| 59 |
-
|
| 60 |
-
Before deploying to Spaces, you can test the application locally:
|
| 61 |
-
|
| 62 |
-
1. Build the Docker image:
|
| 63 |
-
```bash
|
| 64 |
-
docker build -t ai-med-extract .
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
2. Run the container:
|
| 68 |
-
```bash
|
| 69 |
-
docker run -p 7860:7860 ai-med-extract
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
3. Test the API endpoints:
|
| 73 |
-
```bash
|
| 74 |
-
# Health check
|
| 75 |
-
curl http://localhost:7860/health
|
| 76 |
-
|
| 77 |
-
# Extract medical data
|
| 78 |
-
curl -X POST http://localhost:7860/api/extract_medical_data \
|
| 79 |
-
-H "Content-Type: application/json" \
|
| 80 |
-
-d '{"text": "Patient presents with fever and cough..."}'
|
| 81 |
-
|
| 82 |
-
# Transcribe audio
|
| 83 |
-
curl -X POST http://localhost:7860/api/voice_to_text_extraction \
|
| 84 |
-
-F "audio=@recording.wav"
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
## Deploying to Hugging Face Spaces
|
| 88 |
-
|
| 89 |
-
1. Create a new Space:
|
| 90 |
-
- Go to huggingface.co/spaces
|
| 91 |
-
- Click "Create new Space"
|
| 92 |
-
- Choose "Docker" as the SDK
|
| 93 |
-
- Name your space (e.g., "medical-data-extraction")
|
| 94 |
-
- Set visibility (Public or Private)
|
| 95 |
-
|
| 96 |
-
2. Connect your repository:
|
| 97 |
-
- Push your code to a Git repository
|
| 98 |
-
- In the Space settings, connect to your repository
|
| 99 |
-
- The Space will automatically build using the Dockerfile
|
| 100 |
-
|
| 101 |
-
3. Monitor the deployment:
|
| 102 |
-
- Check the "Logs" tab for build progress
|
| 103 |
-
- Use the health check endpoint to verify the deployment:
|
| 104 |
-
```
|
| 105 |
-
https://your-space-name.hf.space/health
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
## API Endpoints
|
| 109 |
-
|
| 110 |
-
1. Health Check:
|
| 111 |
-
```
|
| 112 |
-
GET /health
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
2. Extract Medical Data:
|
| 116 |
-
```
|
| 117 |
-
POST /api/extract_medical_data
|
| 118 |
-
Content-Type: application/json
|
| 119 |
-
{
|
| 120 |
-
"text": "Patient presents with..."
|
| 121 |
-
}
|
| 122 |
-
```
|
| 123 |
-
|
| 124 |
-
3. Generate Summary:
|
| 125 |
-
```
|
| 126 |
-
POST /api/summary_creation
|
| 127 |
-
Content-Type: application/json
|
| 128 |
-
{
|
| 129 |
-
"text": "Patient presents with..."
|
| 130 |
-
}
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
4. Transcribe Audio:
|
| 134 |
-
```
|
| 135 |
-
POST /api/voice_to_text_extraction
|
| 136 |
-
Content-Type: multipart/form-data
|
| 137 |
-
audio: [audio file]
|
| 138 |
-
```
|
| 139 |
-
|
| 140 |
-
5. Audio to Chart:
|
| 141 |
-
```
|
| 142 |
-
POST /api/audio_to_chart
|
| 143 |
-
Content-Type: multipart/form-data
|
| 144 |
-
audio: [audio file]
|
| 145 |
-
```
|
| 146 |
-
|
| 147 |
-
## Environment Variables
|
| 148 |
-
|
| 149 |
-
The following environment variables are automatically set in the Dockerfile:
|
| 150 |
-
- `PYTHONUNBUFFERED=1`
|
| 151 |
-
- `HF_HOME=/tmp/huggingface`
|
| 152 |
-
- `TRANSFORMERS_CACHE=/tmp/huggingface`
|
| 153 |
-
- `XDG_CACHE_HOME=/tmp`
|
| 154 |
-
- `TORCH_HOME=/tmp/torch`
|
| 155 |
-
- `WHISPER_CACHE=/tmp/whisper`
|
| 156 |
-
- `PYTHONPATH=/app`
|
| 157 |
-
- `MODELSCOPE_CACHE=/tmp/huggingface`
|
| 158 |
-
|
| 159 |
-
## Troubleshooting
|
| 160 |
-
|
| 161 |
-
1. If models fail to load:
|
| 162 |
-
- Check the logs in the Spaces interface
|
| 163 |
-
- Verify disk space using the health check endpoint
|
| 164 |
-
- Ensure all dependencies are correctly specified in requirements.txt
|
| 165 |
-
|
| 166 |
-
2. If audio transcription fails:
|
| 167 |
-
- Verify the audio file format (supported: wav, mp3, m4a, ogg)
|
| 168 |
-
- Check file size (max 100MB)
|
| 169 |
-
- Ensure the Whisper model is loaded (check health endpoint)
|
| 170 |
-
|
| 171 |
-
3. If CORS errors occur:
|
| 172 |
-
- Verify the frontend URL is correctly configured
|
| 173 |
-
- Check the CORS settings in app.py
|
| 174 |
-
- Ensure proper headers are set in the frontend requests
|
| 175 |
-
|
| 176 |
-
## Support
|
| 177 |
-
|
| 178 |
-
For issues or questions:
|
| 179 |
-
1. Check the application logs in the Spaces interface
|
| 180 |
-
2. Use the health check endpoint to diagnose problems
|
| 181 |
-
3. Review the error messages in the API responses
|
|
|
|
| 44 |
## Privacy
|
| 45 |
|
| 46 |
All processing is done securely within the Hugging Face Space environment. No data is stored permanently.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ai_med_extract/api/routes.py
CHANGED
|
@@ -757,168 +757,3 @@ def register_routes(app, agents):
|
|
| 757 |
@app.route("/")
|
| 758 |
def home():
|
| 759 |
return "Medical Data Extraction API is running!", 200
|
| 760 |
-
|
| 761 |
-
@app.route("/health", methods=["GET"])
|
| 762 |
-
def health_check():
|
| 763 |
-
try:
|
| 764 |
-
# Check if models are loaded
|
| 765 |
-
models_status = {
|
| 766 |
-
"medical_data_extractor": MedicalDataExtractorAgent.gen_model_loader._model is not None,
|
| 767 |
-
"summarizer": SummarizerAgent.summarization_model_loader._model is not None,
|
| 768 |
-
"whisper": whisper_model._model is not None
|
| 769 |
-
}
|
| 770 |
-
|
| 771 |
-
# Check disk space
|
| 772 |
-
import shutil
|
| 773 |
-
total, used, free = shutil.disk_usage("/")
|
| 774 |
-
disk_space = {
|
| 775 |
-
"total_gb": round(total / (1024**3), 2),
|
| 776 |
-
"used_gb": round(used / (1024**3), 2),
|
| 777 |
-
"free_gb": round(free / (1024**3), 2)
|
| 778 |
-
}
|
| 779 |
-
|
| 780 |
-
return jsonify({
|
| 781 |
-
"status": "healthy",
|
| 782 |
-
"models": models_status,
|
| 783 |
-
"disk_space": disk_space,
|
| 784 |
-
"version": "1.0.0"
|
| 785 |
-
})
|
| 786 |
-
except Exception as e:
|
| 787 |
-
logging.error(f"Health check failed: {str(e)}", exc_info=True)
|
| 788 |
-
return jsonify({
|
| 789 |
-
"status": "unhealthy",
|
| 790 |
-
"error": str(e)
|
| 791 |
-
}), 500
|
| 792 |
-
|
| 793 |
-
@app.route("/api/extract_medical_data", methods=["POST"])
|
| 794 |
-
def extract_medical_data():
|
| 795 |
-
try:
|
| 796 |
-
data = request.json
|
| 797 |
-
if not data or "text" not in data:
|
| 798 |
-
return jsonify({"error": "No text provided"}), 400
|
| 799 |
-
|
| 800 |
-
text = data["text"]
|
| 801 |
-
if not text.strip():
|
| 802 |
-
return jsonify({"error": "Empty text provided"}), 400
|
| 803 |
-
|
| 804 |
-
# Extract medical data
|
| 805 |
-
result = MedicalDataExtractorAgent.extract_medical_data(text)
|
| 806 |
-
return jsonify(result)
|
| 807 |
-
|
| 808 |
-
except Exception as e:
|
| 809 |
-
logging.error(f"Error in extract_medical_data: {str(e)}", exc_info=True)
|
| 810 |
-
return jsonify({"error": str(e)}), 500
|
| 811 |
-
|
| 812 |
-
@app.route("/api/summary_creation", methods=["POST"])
|
| 813 |
-
def generate_summary():
|
| 814 |
-
try:
|
| 815 |
-
data = request.json
|
| 816 |
-
if not data or "text" not in data:
|
| 817 |
-
return jsonify({"error": "No text provided"}), 400
|
| 818 |
-
|
| 819 |
-
text = data["text"]
|
| 820 |
-
if not text.strip():
|
| 821 |
-
return jsonify({"error": "Empty text provided"}), 400
|
| 822 |
-
|
| 823 |
-
# Generate summary
|
| 824 |
-
summary = SummarizerAgent.generate_summary(text)
|
| 825 |
-
return jsonify({"summary": summary})
|
| 826 |
-
|
| 827 |
-
except Exception as e:
|
| 828 |
-
logging.error(f"Error in generate_summary: {str(e)}", exc_info=True)
|
| 829 |
-
return jsonify({"error": str(e)}), 500
|
| 830 |
-
|
| 831 |
-
@app.route("/api/voice_to_text_extraction", methods=["POST"])
|
| 832 |
-
def transcribe_audio():
|
| 833 |
-
try:
|
| 834 |
-
if "audio" not in request.files:
|
| 835 |
-
return jsonify({"error": "No audio file provided"}), 400
|
| 836 |
-
|
| 837 |
-
audio_file = request.files["audio"]
|
| 838 |
-
if audio_file.filename == "":
|
| 839 |
-
return jsonify({"error": "No selected audio file"}), 400
|
| 840 |
-
|
| 841 |
-
# Validate file extension
|
| 842 |
-
if not allowed_file(audio_file.filename):
|
| 843 |
-
return jsonify({
|
| 844 |
-
"error": "Unsupported audio format. Allowed formats: wav, mp3, m4a, ogg"
|
| 845 |
-
}), 400
|
| 846 |
-
|
| 847 |
-
# Check file size
|
| 848 |
-
valid_size, error_message = check_file_size(audio_file)
|
| 849 |
-
if not valid_size:
|
| 850 |
-
return jsonify({"error": error_message}), 400
|
| 851 |
-
|
| 852 |
-
# Save audio file temporarily
|
| 853 |
-
temp_dir = os.path.join(tempfile.gettempdir(), 'audio_uploads')
|
| 854 |
-
os.makedirs(temp_dir, exist_ok=True)
|
| 855 |
-
temp_filename = f"{uuid.uuid4()}_{secure_filename(audio_file.filename)}"
|
| 856 |
-
temp_path = os.path.join(temp_dir, temp_filename)
|
| 857 |
-
|
| 858 |
-
try:
|
| 859 |
-
audio_file.save(temp_path)
|
| 860 |
-
# Transcribe audio
|
| 861 |
-
result = whisper_model.transcribe(temp_path)
|
| 862 |
-
return jsonify({"text": result["text"]})
|
| 863 |
-
finally:
|
| 864 |
-
# Clean up temporary file
|
| 865 |
-
if os.path.exists(temp_path):
|
| 866 |
-
os.remove(temp_path)
|
| 867 |
-
|
| 868 |
-
except Exception as e:
|
| 869 |
-
logging.error(f"Error in transcribe_audio: {str(e)}", exc_info=True)
|
| 870 |
-
return jsonify({"error": str(e)}), 500
|
| 871 |
-
|
| 872 |
-
@app.route("/api/audio_to_chart", methods=["POST"])
|
| 873 |
-
def audio_to_chart():
|
| 874 |
-
try:
|
| 875 |
-
if "audio" not in request.files:
|
| 876 |
-
return jsonify({"error": "No audio file provided"}), 400
|
| 877 |
-
|
| 878 |
-
audio_file = request.files["audio"]
|
| 879 |
-
if audio_file.filename == "":
|
| 880 |
-
return jsonify({"error": "No selected audio file"}), 400
|
| 881 |
-
|
| 882 |
-
# Validate file extension
|
| 883 |
-
if not allowed_file(audio_file.filename):
|
| 884 |
-
return jsonify({
|
| 885 |
-
"error": "Unsupported audio format. Allowed formats: wav, mp3, m4a, ogg"
|
| 886 |
-
}), 400
|
| 887 |
-
|
| 888 |
-
# Check file size
|
| 889 |
-
valid_size, error_message = check_file_size(audio_file)
|
| 890 |
-
if not valid_size:
|
| 891 |
-
return jsonify({"error": error_message}), 400
|
| 892 |
-
|
| 893 |
-
# Save audio file temporarily
|
| 894 |
-
temp_dir = os.path.join(tempfile.gettempdir(), 'audio_uploads')
|
| 895 |
-
os.makedirs(temp_dir, exist_ok=True)
|
| 896 |
-
temp_filename = f"{uuid.uuid4()}_{secure_filename(audio_file.filename)}"
|
| 897 |
-
temp_path = os.path.join(temp_dir, temp_filename)
|
| 898 |
-
|
| 899 |
-
try:
|
| 900 |
-
audio_file.save(temp_path)
|
| 901 |
-
# Transcribe audio
|
| 902 |
-
transcription = whisper_model.transcribe(temp_path)
|
| 903 |
-
transcribed_text = transcription["text"]
|
| 904 |
-
|
| 905 |
-
# Extract medical data from transcription
|
| 906 |
-
medical_data = MedicalDataExtractorAgent.extract_medical_data(transcribed_text)
|
| 907 |
-
|
| 908 |
-
# Generate summary
|
| 909 |
-
summary = SummarizerAgent.generate_summary(transcribed_text)
|
| 910 |
-
|
| 911 |
-
return jsonify({
|
| 912 |
-
"transcription": transcribed_text,
|
| 913 |
-
"medical_data": medical_data,
|
| 914 |
-
"summary": summary
|
| 915 |
-
})
|
| 916 |
-
|
| 917 |
-
finally:
|
| 918 |
-
# Clean up temporary file
|
| 919 |
-
if os.path.exists(temp_path):
|
| 920 |
-
os.remove(temp_path)
|
| 921 |
-
|
| 922 |
-
except Exception as e:
|
| 923 |
-
logging.error(f"Error in audio_to_chart: {str(e)}", exc_info=True)
|
| 924 |
-
return jsonify({"error": str(e)}), 500
|
|
|
|
| 757 |
@app.route("/")
|
| 758 |
def home():
|
| 759 |
return "Medical Data Extraction API is running!", 200
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ai_med_extract/app.py
CHANGED
|
@@ -10,7 +10,6 @@ from .agents.phi_scrubber import MedicalTextUtils
|
|
| 10 |
from .agents.summarizer import SummarizerAgent
|
| 11 |
from .agents.medical_data_extractor import MedicalDataExtractorAgent
|
| 12 |
from .agents.medical_data_extractor import MedicalDocDataExtractorAgent
|
| 13 |
-
import torch
|
| 14 |
|
| 15 |
|
| 16 |
# Load environment variables
|
|
@@ -27,16 +26,7 @@ logging.basicConfig(
|
|
| 27 |
)
|
| 28 |
|
| 29 |
app = Flask(__name__)
|
| 30 |
-
|
| 31 |
-
# Configure CORS
|
| 32 |
-
CORS(app, resources={
|
| 33 |
-
r"/api/*": {
|
| 34 |
-
"origins": ["*"], # Allow all origins in development
|
| 35 |
-
"methods": ["GET", "POST", "OPTIONS"],
|
| 36 |
-
"allow_headers": ["Content-Type", "Authorization"],
|
| 37 |
-
"max_age": 3600
|
| 38 |
-
}
|
| 39 |
-
})
|
| 40 |
|
| 41 |
# Configure upload directory
|
| 42 |
UPLOAD_DIR = '/data/uploads'
|
|
@@ -70,37 +60,13 @@ class LazyModelLoader:
|
|
| 70 |
if self._model is None:
|
| 71 |
try:
|
| 72 |
logging.info(f"Loading {self.model_name}...")
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
cache_dir=os.environ.get('TRANSFORMERS_CACHE', '/tmp/huggingface')
|
| 81 |
-
)
|
| 82 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 83 |
-
self.model_name,
|
| 84 |
-
trust_remote_code=True,
|
| 85 |
-
device_map="auto",
|
| 86 |
-
torch_dtype=torch.float16,
|
| 87 |
-
cache_dir=os.environ.get('TRANSFORMERS_CACHE', '/tmp/huggingface')
|
| 88 |
-
)
|
| 89 |
-
self._model = pipeline(
|
| 90 |
-
task=self.model_type,
|
| 91 |
-
model=model,
|
| 92 |
-
tokenizer=tokenizer,
|
| 93 |
-
device_map="auto",
|
| 94 |
-
torch_dtype=torch.float16
|
| 95 |
-
)
|
| 96 |
-
else:
|
| 97 |
-
self._model = pipeline(
|
| 98 |
-
task=self.model_type,
|
| 99 |
-
model=self.model_name,
|
| 100 |
-
trust_remote_code=True,
|
| 101 |
-
device_map="auto",
|
| 102 |
-
low_cpu_mem_usage=True
|
| 103 |
-
)
|
| 104 |
logging.info(f"Successfully loaded {self.model_name}")
|
| 105 |
except Exception as e:
|
| 106 |
if self.fallback_model:
|
|
@@ -145,7 +111,7 @@ class WhisperModelLoader:
|
|
| 145 |
try:
|
| 146 |
# Use smaller models for Hugging Face Spaces
|
| 147 |
medalpaca_model_loader = LazyModelLoader(
|
| 148 |
-
"
|
| 149 |
"text-generation",
|
| 150 |
fallback_model="facebook/bart-base" # Fallback model
|
| 151 |
)
|
|
|
|
| 10 |
from .agents.summarizer import SummarizerAgent
|
| 11 |
from .agents.medical_data_extractor import MedicalDataExtractorAgent
|
| 12 |
from .agents.medical_data_extractor import MedicalDocDataExtractorAgent
|
|
|
|
| 13 |
|
| 14 |
|
| 15 |
# Load environment variables
|
|
|
|
| 26 |
)
|
| 27 |
|
| 28 |
app = Flask(__name__)
|
| 29 |
+
CORS(app)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
# Configure upload directory
|
| 32 |
UPLOAD_DIR = '/data/uploads'
|
|
|
|
| 60 |
if self._model is None:
|
| 61 |
try:
|
| 62 |
logging.info(f"Loading {self.model_name}...")
|
| 63 |
+
self._model = pipeline(
|
| 64 |
+
task=self.model_type,
|
| 65 |
+
model=self.model_name,
|
| 66 |
+
trust_remote_code=True,
|
| 67 |
+
device_map="auto",
|
| 68 |
+
low_cpu_mem_usage=True
|
| 69 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
logging.info(f"Successfully loaded {self.model_name}")
|
| 71 |
except Exception as e:
|
| 72 |
if self.fallback_model:
|
|
|
|
| 111 |
try:
|
| 112 |
# Use smaller models for Hugging Face Spaces
|
| 113 |
medalpaca_model_loader = LazyModelLoader(
|
| 114 |
+
"medalpaca/medalpaca-7b", # Smaller model
|
| 115 |
"text-generation",
|
| 116 |
fallback_model="facebook/bart-base" # Fallback model
|
| 117 |
)
|
ai_med_extract/utils/file_utils.py
CHANGED
|
@@ -5,72 +5,63 @@ import logging
|
|
| 5 |
from werkzeug.utils import secure_filename
|
| 6 |
from flask import current_app
|
| 7 |
|
| 8 |
-
|
| 9 |
-
logging.basicConfig(level=logging.INFO)
|
| 10 |
-
logger = logging.getLogger(__name__)
|
| 11 |
-
|
| 12 |
-
# Allowed file extensions
|
| 13 |
-
ALLOWED_EXTENSIONS = {
|
| 14 |
-
'pdf': ['pdf'],
|
| 15 |
-
'image': ['png', 'jpg', 'jpeg', 'gif', 'bmp', 'tiff'],
|
| 16 |
-
'audio': ['wav', 'mp3', 'm4a', 'ogg'],
|
| 17 |
-
'document': ['doc', 'docx', 'txt', 'rtf']
|
| 18 |
-
}
|
| 19 |
-
|
| 20 |
MAX_SIZE_PDF_DOCS = 1 * 1024 * 1024 * 1024 # 1GB
|
| 21 |
MAX_SIZE_IMAGES = 500 * 1024 * 1024 # 500MB
|
| 22 |
MAX_SIZE_AUDIO = 100 * 1024 * 1024 # 100MB
|
| 23 |
|
| 24 |
|
| 25 |
-
def allowed_file(filename
|
| 26 |
-
""
|
| 27 |
-
if not filename:
|
| 28 |
-
return False
|
| 29 |
-
return '.' in filename and \
|
| 30 |
-
filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS.get(file_type, [])
|
| 31 |
|
| 32 |
|
| 33 |
-
def check_file_size(file
|
| 34 |
-
"""Check if the file size is within limits"""
|
| 35 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
file.seek(0, os.SEEK_END)
|
| 37 |
size = file.tell()
|
| 38 |
-
file.seek(0) # Reset file pointer
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
return True, None
|
| 44 |
except Exception as e:
|
| 45 |
-
|
| 46 |
-
return False, "Error checking file size"
|
| 47 |
|
| 48 |
|
| 49 |
def save_data_to_storage(filename, data):
|
| 50 |
-
"""Save extracted data to storage"""
|
| 51 |
try:
|
| 52 |
-
|
| 53 |
-
os.
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
except Exception as e:
|
| 60 |
-
|
| 61 |
-
return False
|
| 62 |
|
| 63 |
|
| 64 |
def get_data_from_storage(filename):
|
| 65 |
-
"""Retrieve data from storage"""
|
| 66 |
try:
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
return
|
| 74 |
except Exception as e:
|
| 75 |
-
|
| 76 |
return None
|
|
|
|
| 5 |
from werkzeug.utils import secure_filename
|
| 6 |
from flask import current_app
|
| 7 |
|
| 8 |
+
ALLOWED_EXTENSIONS = {"pdf", "jpg", "jpeg", "png", "svg", "docx", "doc", "xlsx", "xls", "wav", "mp3", "m4a", "ogg"}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
MAX_SIZE_PDF_DOCS = 1 * 1024 * 1024 * 1024 # 1GB
|
| 10 |
MAX_SIZE_IMAGES = 500 * 1024 * 1024 # 500MB
|
| 11 |
MAX_SIZE_AUDIO = 100 * 1024 * 1024 # 100MB
|
| 12 |
|
| 13 |
|
| 14 |
+
def allowed_file(filename):
|
| 15 |
+
return "." in filename and filename.rsplit(".", 1)[1].lower() in ALLOWED_EXTENSIONS
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
+
def check_file_size(file):
|
|
|
|
| 19 |
try:
|
| 20 |
+
# Store current position
|
| 21 |
+
current_pos = file.tell()
|
| 22 |
+
|
| 23 |
+
# Check size
|
| 24 |
file.seek(0, os.SEEK_END)
|
| 25 |
size = file.tell()
|
|
|
|
| 26 |
|
| 27 |
+
# Return to original position
|
| 28 |
+
file.seek(current_pos)
|
| 29 |
+
|
| 30 |
+
extension = file.filename.rsplit('.', 1)[-1].lower()
|
| 31 |
+
if extension in {"pdf", "docx"} and size > MAX_SIZE_PDF_DOCS:
|
| 32 |
+
return False, f"File {file.filename} exceeds 1GB size limit"
|
| 33 |
+
elif extension in {"jpg", "jpeg", "png"} and size > MAX_SIZE_IMAGES:
|
| 34 |
+
return False, f"Image {file.filename} exceeds 500MB size limit"
|
| 35 |
+
elif extension in {"wav", "mp3", "m4a", "ogg"} and size > MAX_SIZE_AUDIO:
|
| 36 |
+
return False, f"Audio file {file.filename} exceeds 100MB size limit"
|
| 37 |
return True, None
|
| 38 |
except Exception as e:
|
| 39 |
+
logging.error(f"Error checking file size: {e}", exc_info=True)
|
| 40 |
+
return False, f"Error checking file size: {str(e)}"
|
| 41 |
|
| 42 |
|
| 43 |
def save_data_to_storage(filename, data):
|
|
|
|
| 44 |
try:
|
| 45 |
+
upload_folder = current_app.config.get("UPLOAD_FOLDER", "uploads")
|
| 46 |
+
if not os.path.exists(upload_folder):
|
| 47 |
+
os.makedirs(upload_folder, exist_ok=True)
|
| 48 |
+
filename = filename.rsplit(".", 1)[0]
|
| 49 |
+
filepath = os.path.join(upload_folder, f"{filename}.json")
|
| 50 |
+
with open(filepath, "w") as file:
|
| 51 |
+
json.dump(data, file)
|
| 52 |
except Exception as e:
|
| 53 |
+
logging.error(f"Exception during save: {e}")
|
|
|
|
| 54 |
|
| 55 |
|
| 56 |
def get_data_from_storage(filename):
|
|
|
|
| 57 |
try:
|
| 58 |
+
upload_folder = current_app.config.get("UPLOAD_FOLDER", "uploads")
|
| 59 |
+
filepath = os.path.join(upload_folder, f"{filename}.json")
|
| 60 |
+
if not os.path.exists(filepath):
|
| 61 |
+
return None
|
| 62 |
+
with open(filepath, "r") as file:
|
| 63 |
+
data = json.load(file)
|
| 64 |
+
return data
|
| 65 |
except Exception as e:
|
| 66 |
+
logging.error(f"Error loading data: {e}")
|
| 67 |
return None
|
requirements.txt
CHANGED
|
@@ -8,7 +8,7 @@ python-dotenv==1.0.1
|
|
| 8 |
torch==2.1.0
|
| 9 |
torchaudio==2.1.0
|
| 10 |
torchvision==0.16.0
|
| 11 |
-
transformers==4.
|
| 12 |
sentence-transformers==2.2.2
|
| 13 |
scikit-learn==1.3.2
|
| 14 |
numpy==1.24.3
|
|
@@ -16,9 +16,7 @@ pandas==2.1.4
|
|
| 16 |
scipy==1.11.4
|
| 17 |
accelerate==0.25.0
|
| 18 |
|
| 19 |
-
|
| 20 |
-
modelscope==1.9.5
|
| 21 |
-
qwen2==0.1.0
|
| 22 |
|
| 23 |
# NLP
|
| 24 |
spacy==3.7.2
|
|
|
|
| 8 |
torch==2.1.0
|
| 9 |
torchaudio==2.1.0
|
| 10 |
torchvision==0.16.0
|
| 11 |
+
transformers==4.36.2
|
| 12 |
sentence-transformers==2.2.2
|
| 13 |
scikit-learn==1.3.2
|
| 14 |
numpy==1.24.3
|
|
|
|
| 16 |
scipy==1.11.4
|
| 17 |
accelerate==0.25.0
|
| 18 |
|
| 19 |
+
|
|
|
|
|
|
|
| 20 |
|
| 21 |
# NLP
|
| 22 |
spacy==3.7.2
|