Node.jsでWhisperを実行する方法：単語レベルタイムスタンプ付き

OpenAIのWhisperモデルを利用して、Node.js環境で単語レベルのタイムスタンプ付きで音声認識を行うには、nodejs-whisperパッケージを使用できます。このパッケージは、WhisperモデルのNode.jsバインディングを提供し、単語レベルのタイムスタンプをサポートします。

Key Takeaways

nodejs-whisperパッケージを使用すると、Node.jsで単語レベルのタイムスタンプを簡単に統合できます。
単語レベルのタイムスタンプにより、字幕や分析の精度とタイミングの精度が向上します。
Whisperは、さまざまなプロジェクトのニーズに合わせて柔軟なモデルと出力オプションを提供します。

Step-by-Step Guide

ビルドツールのインストール:

システムに必要なビルドツールがインストールされていることを確認します。Debianベースのシステムでは、次のコマンドを使用してインストールできます。
```
sudo apt update
sudo apt install build-essential
```
Windowsユーザーの場合は、MinGW-w64またはMSYS2をインストールすることをお勧めします。インストール後、mingw32-makeまたはmakeがシステムのPATHで利用可能であることを確認してください。
nodejs-whisperパッケージのインストール:

npmを使用してパッケージをインストールします。
```
npm install nodejs-whisper
```
Whisperモデルのダウンロード:

パッケージをインストールした後、目的のWhisperモデルをダウンロードします。
```
npx nodejs-whisper download
```
利用可能なモデルは次のとおりです。
- tiny
- tiny.en
- base
- base.en
- small
- small.en
- medium
- medium.en
- large-v1
- large
- large-v3-turbo
要件に基づいて、パフォーマンスと精度を調整するモデルを選択します。

単語レベルのタイムスタンプを使用してオーディオを書き起こす:

JavaScriptまたはTypeScriptファイル（例：transcribe.js）を作成し、次のコードを追加します。

const path = require('path');
const { nodewhisper } = require('nodejs-whisper');

// Provide the exact path to your audio file
const filePath = path.resolve(__dirname, 'YourAudioFileName.wav');

(async () => {
  await nodewhisper(filePath, {
    modelName: 'base.en', // Specify the downloaded model name
    autoDownloadModelName: 'base.en', // (Optional) Auto-download the model if not present
    removeWavFileAfterTranscription: false, // (Optional) Remove WAV file after transcription
    withCuda: false, // (Optional) Use CUDA for faster processing if available
    logger: console, // (Optional) Logging instance, defaults to console
    whisperOptions: {
      outputInCsv: false, // Output result in CSV file
      outputInJson: false, // Output result in JSON file
      outputInJsonFull: false, // Output result in JSON file with detailed information
      outputInLrc: false, // Output result in LRC file
      outputInSrt: true, // Output result in SRT file
      outputInText: false, // Output result in TXT file
      outputInVtt: false, // Output result in VTT file
      outputInWords: true, // Output result in WTS file for karaoke
      translateToEnglish: false, // Translate from source language to English
      wordTimestamps: true, // Enable word-level timestamps
      timestamps_length: 20, // Amount of dialogue per timestamp pair
      splitOnWord: true, // Split on word rather than on token
    },
  });
})();

'YourAudioFileName.wav'をオーディオファイルのパスに置き換えます。このスクリプトはオーディオを処理し、単語レベルのタイムスタンプを含むSRTファイルを生成します。

Additional Notes

オーディオ形式: nodejs-whisperパッケージは、Whisperモデルをサポートするために、オーディオファイルを自動的に16000 Hzの周波数でWAV形式に変換します。
CUDAサポート: 互換性のあるNVIDIA GPUとCUDAがインストールされている場合は、withCuda: trueを設定すると、処理が高速化されます。
出力形式: whisperOptionsを使用すると、CSV、JSON、LRC、SRT、TXT、VTT、WTSなど、さまざまな出力形式を指定できます。これらのオプションは、ニーズに基づいて調整してください。