Fine-tuned VGG-16 Model for Gunshot Detection
This is a fine-tuned VGG-16 model for detecting gunshots in audio recordings. The model was trained on a dataset of audio clips labeled as either "gunshot" or "background".
Model Details
- Fine-tuned by: Ranabir Saha
- Fine-tuned on: Tropical forest gunshot classification training audio dataset from Automated detection of gunshots in tropical forests using convolutional neural networks (Katsis et al. 2022)
- Dataset Source: https://doi.org/10.17632/x48cwz364j.3
- Input: 4-second
.wavaudio files, converted to 224x224x3 mel-spectrograms during preprocessing - Output: Binary classification (Gunshot/Background)
Training
The model was trained using the following parameters:
- Base Model: VGG-16 pre-trained on ImageNet
- Optimizer: Adam (initial learning rate=0.0001, fine-tuning learning rate=1e-5)
- Loss Function: Categorical cross-entropy
- Metrics: Accuracy, Precision, Recall
- Batch Size: 32
- Initial Training: Up to 25 epochs with early stopping (patience=5) on validation loss
- Fine-tuning: Last 8 layers unfrozen, up to 10 epochs with early stopping (patience=5)
- Class Weights: Balanced to handle class imbalance
Usage
To use this model for inference, you can load it from the Hugging Face Hub and pass preprocessed mel-spectrograms generated from .wav files as input.
Example
import numpy as np
import tensorflow as tf
import librosa
from huggingface_hub import hf_hub_download
# Download the model
model_path = hf_hub_download(repo_id="ranvir-not-found/vgg16-sda_gunshot-detection", filename="vgg16_model.keras")
model = tf.keras.models.load_model(model_path)
# Function to load and preprocess .wav file
def load_and_preprocess_wav(file_path):
# Load and process audio
audio_data, sr = librosa.load(file_path, sr=None)
mel_spectrogram = librosa.feature.melspectrogram(
y=audio_data, sr=sr, n_mels=224, fmax=4000
)
mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)
# Normalize
spec_min = np.min(mel_spectrogram)
spec_max = np.max(mel_spectrogram)
if spec_max > spec_min:
mel_spectrogram = 255 * (mel_spectrogram - spec_min) / (spec_max - spec_min)
else:
mel_spectrogram = np.zeros_like(mel_spectrogram)
mel_spectrogram = mel_spectrogram.astype(np.float32)
# Resize to 224x224
mel = tf.image.resize(mel_spectrogram[..., np.newaxis], (224, 224))
# Repeat to create 3 channels
mel = tf.repeat(mel, 3, axis=-1)
# Apply VGG-16 preprocessing
mel = tf.keras.applications.vgg16.preprocess_input(mel)
return mel
# Example usage
wav_path = "path/to/your/audio.wav"
input_data = load_and_preprocess_wav(wav_path)
input_data = tf.expand_dims(input_data, axis=0) # Add batch dimension
predictions = model.predict(input_data)
class_names = ['gunshot', 'background']
predicted_class = class_names[np.argmax(predictions[0])]
print(f"Predicted class: {predicted_class}, Probabilities: {predictions[0]}")
Evaluation
The model was evaluated on a validation set, and the following metrics were computed:
- Classification Report (including Accuracy, Precision, Recall) The model was optimized for recall on the 'gunshot' class.
For more details, please refer to the training script and logs.