Methodology - FSP Finder

FSP Finder leverages a sophisticated pipeline of artificial intelligence and machine learning models to analyze audio, transcribe speech, identify explicit content, and ultimately, produce a clean version of the audio track. This multi-stage process ensures high accuracy and effective content moderation.

Key components

1. Audio source separation with demucs

Before any transcription or analysis, the input audio file undergoes a crucial step: source separation. We use Demucs, a state-of-the-art deep learning model for music source separation. Specifically, the mdx_extra model is employed to effectively separate the vocal track from the instrumental track. This isolation of vocals is critical for accurate speech-to-text transcription and subsequent explicit content detection, as it minimizes interference from background music.

2. Speech-to-Text transcription with whisper

Once the vocal track is isolated, it is fed into an adapted version of OpenAI's Whisper model for highly accurate speech-to-text transcription. Our system utilizes a fine-tuned Whisper model (whisper-medium.en) further enhanced with LoRA (Low-Rank Adaptation) weights. Specifically, this model was fine tuned using timestamped vocals data from the DALI dataset. More information on the fine tuning can be found on the auto-censoring GitHub page.

We additionally include a second-pass approach for ensuring that all vocals are transcribed. Any untranscribed gap in the vocals transcript is rerun through Whisper, with more generous computation metrics, if it is detected to have any audio (average dB rating greater than -30). Any remaining untranscribed gaps after the second pass are flagged with a warning.

3. Voice activity detection (VAD) with Silero VAD

To optimize transcription and ensure that only relevant speech segments are processed, we integrate Silero VAD (Voice Activity Detection). This model accurately identifies periods of speech within the audio, allowing the Whisper model to focus its processing power only when actual speech is present. This not only speeds up the transcription process but also helps in accurately identifying and re-transcribing untranscribed gaps that might contain speech.

4. (Experimental) Explicit content detection with Gemma LLM

For the core task of identifying explicit or profane content, the transcribed text is analyzed by Gemma, Google's lightweight, state-of-the-art open large language model. We utilize a quantized version of the Gemma 9B instruction-tuned model. The LLM is prompted with specific instructions to act as an AI content moderator, identifying phrases that would be considered "indecent" or "profane" under broadcast standards, while also considering context to avoid false positives.

This portion of the tool is currently deactivated as we gather training data to better fine tune the model for appropriate content moderation. Your use of this tool is an important step in building a more robust censoring system in the future.

5. Backup explicit content detection

Either in addition to or in lieu of the LLM-based detection, a robust backup censoring mechanism is in place. This logic specifically targets commonly known curse words and multi-word explicit phrases (see below). Users can add additional words to be automatically detected using advanced settings. This dual-layer approach—combining the nuanced understanding of a large language model with explicit keyword matching—ensures comprehensive coverage and reduces the chances of explicit content slipping through.

Click to see a list of automatically flagged words (CONTENT WARNING: profanity, racial slurs, generally offensive content...)

Any word that contains the following as a substring:
fuck, shit, piss, bitch, nigg, dyke, cock, faggot, cunt, tits, pussy, dick, asshole, whore, goddam, douche, chink, tranny, jizz, kike, gook, cocksucker

Or any of the following words exactly as they appear:
fag, cum, clit, wank, ho, hoes, hos

Two word phrases:
god damn, cock sucker, jerk off

6. Audio censoring and reconstruction

Once explicit segments are identified, either by the model or the user, the system precisely silences those specific portions within the isolated vocal track. After censoring, the modified vocal track is then seamlessly recombined with the original instrumental track. The final output is an edited audio file where explicit content is removed, while preserving the integrity and quality of the original song. Metadata from the original audio is also transferred to the censored version.
All processing is done with lossless .wav files. The user has the option to save files as 256k mp3, wav or flac.