Spaces:

kyboface
/

MyInfiniteTal

Configuration error

App Files Files Community

kyboface commited on Oct 30, 2025

Commit

0f32f16

verified ·

1 Parent(s): 7221420

Upload 13 files

Browse files

Files changed (13) hide show

LICENSE.txt +201 -0
README.md +387 -14
app.py +819 -0
generate_infinitetalk.py +663 -0
kokoro/__init__.py +23 -0
kokoro/__main__.py +148 -0
kokoro/custom_stft.py +197 -0
kokoro/istftnet.py +421 -0
kokoro/model.py +155 -0
kokoro/modules.py +183 -0
kokoro/pipeline.py +445 -0
requirements.txt +21 -0
setup.sh +28 -0

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md CHANGED Viewed

@@ -1,14 +1,387 @@
----
-title: MyInfiniteTal
-emoji: 👀
-colorFrom: gray
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: check infiniteTak
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<div align="center">
+<p align="center">
+  <img src="assets/logo2.jpg" alt="InfinteTalk" width="440"/>
+</p>
+<h1>InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing</h1>
+[Shaoshu Yang*](https://scholar.google.com/citations?user=JrdZbTsAAAAJ&hl=en) · [Zhe Kong*](https://scholar.google.com/citations?user=4X3yLwsAAAAJ&hl=zh-CN) · [Feng Gao*](https://scholar.google.com/citations?user=lFkCeoYAAAAJ) · [Meng Cheng*]() · [Xiangyu Liu*]() · [Yong Zhang](https://yzhang2016.github.io/)<sup>&#9993;</sup> · [Zhuoliang Kang](https://scholar.google.com/citations?user=W1ZXjMkAAAAJ&hl=en)
+[Wenhan Luo](https://whluo.github.io/) · [Xunliang Cai](https://openreview.net/profile?id=~Xunliang_Cai1) · [Ran He](https://scholar.google.com/citations?user=ayrg9AUAAAAJ&hl=en)· [Xiaoming Wei](https://scholar.google.com/citations?user=JXV5yrZxj5MC&hl=zh-CN)
+<sup>*</sup>Equal Contribution
+<sup>&#9993;</sup>Corresponding Authors
+<a href='https://meigen-ai.github.io/InfiniteTalk/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
+<a href='https://arxiv.org/abs/2508.14033'><img src='https://img.shields.io/badge/Technique-Report-red'></a>
+<a href='https://huggingface.co/MeiGen-AI/InfiniteTalk'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
+</div>
+> **TL; DR:**  InfiniteTalk is an unlimited-length talking video generation model that supports both audio-driven video-to-video and image-to-video generation
+<p align="center">
+  <img src="assets/pipeline.png">
+</p>
+## 🔥 Latest News
+* August 19, 2025: We release the [Technique-Report](https://arxiv.org/abs/2508.14033) , weights, and code of **InfiniteTalk**. The Gradio and the [ComfyUI](https://github.com/MeiGen-AI/InfiniteTalk/tree/comfyui) branch have been released.
+* August 19, 2025: We release the [project page](https://meigen-ai.github.io/InfiniteTalk/) of **InfiniteTalk**
+## ✨ Key Features
+We propose **InfiniteTalk**, a novel sparse-frame video dubbing framework. Given an input video and audio track, InfiniteTalk synthesizes a new video with accurate lip synchronization while simultaneously aligning head movements, body posture, and facial expressions with the audio. Unlike traditional dubbing methods that focus solely on lips, InfiniteTalk enables infinite-length video generation with accurate lip synchronization and consistent identity preservation. Beside, InfiniteTalk can also be used as an image-audio-to-video model with an image and an audio as input.
+- 💬 Sparse-frame Video Dubbing – Synchronizes not only lips, but aslo head, body, and expressions
+- ⏱️ Infinite-Length Generation – Supports unlimited video duration
+- ✨ Stability – Reduces hand/body distortions compared to MultiTalk
+- 🚀 Lip Accuracy – Achieves superior lip synchronization to MultiTalk
+## 🌐 Community  Works
+- [Wan2GP](https://github.com/deepbeepmeep/Wan2GP/): Thanks [deepbeepmeep](https://github.com/deepbeepmeep) for integrating InfiniteTalk in Wan2GP that is optimized for low VRAM and offers many video edtiting option and other models (MMaudio support, Qwen Image Edit, ...).
+- [ComfyUI](https://github.com/kijai/ComfyUI-WanVideoWrapper): Thanks for the comfyui support of [kijai](https://github.com/kijai).
+## 📑 Todo List
+- [x] Release the technical report
+- [x] Inference
+- [x] Checkpoints
+- [x] Multi-GPU Inference
+- [ ] Inference acceleration
+  - [x] TeaCache
+  - [x] int8 quantization
+  - [ ] LCM distillation
+  - [ ] Sparse Attention
+- [x] Run with very low VRAM
+- [x] Gradio demo
+- [x] ComfyUI
+## Video Demos
+### Video-to-video (HQ videos can be found on [Google Drive](https://drive.google.com/drive/folders/1BNrH6GJZ2Wt5gBuNLmfXZ6kpqb9xFPjU?usp=sharing) )
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/04f15986-8de7-4bb4-8cde-7f7f38244f9f" width="320" controls loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/1500f72e-a096-42e5-8b44-f887fa8ae7cb" width="320" controls loop></video>
+     </td>
+     <td>
+          <video src="https://github.com/user-attachments/assets/28f484c2-87dc-4828-a9e7-cb963da92d14" width="320" controls loop></video>
+     </td>
+     <td>
+          <video src="https://github.com/user-attachments/assets/665fabe4-3e24-4008-a0a2-a66e2e57c38b" width="320" controls loop></video>
+     </td>
+  </tr>
+</table>
+### Image-to-video
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/7e4a4dad-9666-4896-8684-2acb36aead59" width="320" controls loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/bd6da665-f34d-4634-ae94-b4978f92ad3a" width="320" controls loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/510e2648-82db-4648-aaf3-6542303dbe22" width="320" controls loop></video>
+     </td>
+     <td>
+          <video src="https://github.com/user-attachments/assets/27bb087b-866a-4300-8a03-3bbb4ce3ddf9" width="320" controls loop></video>
+     </td>
+  </tr>
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/3263c5e1-9f98-4b9b-8688-b3e497460a76" width="320" controls loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/5ff3607f-90ec-4eee-b964-9d5ee3028005" width="320" controls loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/e504417b-c8c7-4cf0-9afa-da0f3cbf3726" width="320" controls loop></video>
+     </td>
+     <td>
+          <video src="https://github.com/user-attachments/assets/56aac91e-c51f-4d44-b80d-7d115e94ead7" width="320" controls loop></video>
+     </td>
+  </tr>
+</table>
+## Quick Start
+### 🛠️Installation
+#### 1. Create a conda environment and install pytorch, xformers
+```
+conda create -n multitalk python=3.10
+conda activate multitalk
+pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
+pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121
+```
+#### 2. Flash-attn installation:
+```
+pip install misaki[en]
+pip install ninja
+pip install psutil
+pip install packaging
+pip install wheel
+pip install flash_attn==2.7.4.post1
+```
+#### 3. Other dependencies
+```
+pip install -r requirements.txt
+conda install -c conda-forge librosa
+```
+#### 4. FFmeg installation
+```
+conda install -c conda-forge ffmpeg
+```
+or
+```
+sudo yum install ffmpeg ffmpeg-devel
+```
+### 🧱Model Preparation
+#### 1. Model Download
+| Models        |                       Download Link                                           |    Notes                      |
+| --------------|-------------------------------------------------------------------------------|-------------------------------|
+| Wan2.1-I2V-14B-480P  |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)       | Base model
+| chinese-wav2vec2-base |      🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base)          | Audio encoder
+| MeiGen-InfiniteTalk      |      🤗 [Huggingface](https://huggingface.co/MeiGen-AI/InfiniteTalk)              | Our audio condition weights
+Download models using huggingface-cli:
+``` sh
+huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
+huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
+huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
+huggingface-cli download MeiGen-AI/InfiniteTalk --local-dir ./weights/InfiniteTalk
+```
+### 🔑 Quick Inference
+Our model is compatible with both 480P and 720P resolutions.
+> Some tips
+> - Lip synchronization accuracy: Audio CFG works optimally between 3–5. Increase the audio CFG value for better synchronization.
+> - FusionX： While it enables faster inference and higher quality, FusionX LoRA exacerbates color shift over 1 minute and reduces ID preservation in videos.
+> - V2V generation: Enables unlimited length generation. The model mimics the original video's camera movement, though not identically. Using SDEdit improves camera movement accuracy significantly but introduces color shift and is best suited for short clips. Improvements for long video camera control are planned.
+> - I2V generation: Generates good results from a single image for up to 1 minute. Beyond 1 minute, color shifts become more pronounced. One trick for the high-quailty generation beyond 1 min is to copy the image to a video by translating or zooming in the image.  Here is a script to [convert image to video](https://github.com/MeiGen-AI/InfiniteTalk/blob/main/tools/convert_img_to_video.py).
+> - Quantization model: If your inference process is killed due to insufficient memory, we suggest using the quantization model, which can help **reduce memory usage**.
+#### Usage of InfiniteTalk
+```
+--mode streaming: long video generation.
+--mode clip: generate short video with one chunk.
+--use_teacache: run with TeaCache.
+--size infinitetalk-480: generate 480P video.
+--size infinitetalk-720: generate 720P video.
+--use_apg: run with APG.
+--teacache_thresh: A coefficient used for TeaCache acceleration
+—-sample_text_guide_scale： When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
+—-sample_audio_guide_scale： When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
+—-sample_audio_guide_scale： When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
+--max_frame_num: The max frame length of the generated video, the default is 40 seconds(1000 frames).
+```
+#### 1. Inference
+##### 1) Run with single GPU
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --input_json examples/single_example_image.json \
+    --size infinitetalk-480 \
+    --sample_steps 40 \
+    --mode streaming \
+    --motion_frame 9 \
+    --save_file infinitetalk_res
+```
+##### 2) Run with 720P
+If you want run with 720P, set `--size infinitetalk-720`:
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --input_json examples/single_example_image.json \
+    --size infinitetalk-720 \
+    --sample_steps 40 \
+    --mode streaming \
+    --motion_frame 9 \
+    --save_file infinitetalk_res_720p
+```
+##### 3) Run with very low VRAM
+If you want run with very low VRAM, set `--num_persistent_param_in_dit 0`:
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --input_json examples/single_example_image.json \
+    --size infinitetalk-480 \
+    --sample_steps 40 \
+    --num_persistent_param_in_dit 0 \
+    --mode streaming \
+    --motion_frame 9 \
+    --save_file infinitetalk_res_lowvram
+```
+##### 4) Multi-GPU inference
+```
+GPU_NUM=8
+torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --dit_fsdp --t5_fsdp \
+    --ulysses_size=$GPU_NUM \
+    --input_json examples/single_example_image.json \
+    --size infinitetalk-480 \
+    --sample_steps 40 \
+    --mode streaming \
+    --motion_frame 9 \
+    --save_file infinitetalk_res_multigpu
+```
+##### 5) Multi-Person animation
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
+    --input_json examples/multi_example_image.json \
+    --size infinitetalk-480 \
+    --sample_steps 40 \
+    --num_persistent_param_in_dit 0 \
+    --mode streaming \
+    --motion_frame 9 \
+    --save_file infinitetalk_res_multiperson
+```
+#### 2. Run with FusioniX or Lightx2v(Require only 4~8 steps)
+[FusioniX](https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/FusionX_LoRa/Wan2.1_I2V_14B_FusionX_LoRA.safetensors) require 8 steps and [lightx2v](https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors) requires only 4 steps.
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
+    --input_json examples/single_example_image.json \
+    --lora_scale 1.0 \
+    --size infinitetalk-480 \
+    --sample_text_guide_scale 1.0 \
+    --sample_audio_guide_scale 2.0 \
+    --sample_steps 8 \
+    --mode streaming \
+    --motion_frame 9 \
+    --sample_shift 2 \
+    --num_persistent_param_in_dit 0 \
+    --save_file infinitetalk_res_lora
+```
+#### 3. Run with the quantization model (Only support run with single gpu)
+```
+python generate_infinitetalk.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --input_json examples/single_example_image.json \
+    --size infinitetalk-480 \
+    --sample_steps 40 \
+    --mode streaming \
+    --quant fp8 \
+    --quant_dir weights/InfiniteTalk/quant_models/infinitetalk_single_fp8.safetensors \
+    --motion_frame 9 \
+    --num_persistent_param_in_dit 0 \
+    --save_file infinitetalk_res_quant
+```
+#### 4. Run with Gradio
+```
+python app.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
+    --num_persistent_param_in_dit 0 \
+    --motion_frame 9
+```
+or
+```
+python app.py \
+    --ckpt_dir weights/Wan2.1-I2V-14B-480P \
+    --wav2vec_dir 'weights/chinese-wav2vec2-base' \
+    --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
+    --num_persistent_param_in_dit 0 \
+    --motion_frame 9
+```
+## 📚 Citation
+If you find our work useful in your research, please consider citing:
+```
+@misc{yang2025infinitetalkaudiodrivenvideogeneration,
+      title={InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing},
+      author={Shaoshu Yang and Zhe Kong and Feng Gao and Meng Cheng and Xiangyu Liu and Yong Zhang and Zhuoliang Kang and Wenhan Luo and Xunliang Cai and Ran He and Xiaoming Wei},
+      year={2025},
+      eprint={2508.14033},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2508.14033},
+}
+```
+## 📜 License
+The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents,
+granting you the freedom to use them while ensuring that your usage complies with the provisions of this license.
+You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws,
+causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.

app.py ADDED Viewed

	@@ -0,0 +1,819 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import argparse
+import logging
+import os
+os.environ["no_proxy"] = "localhost,127.0.0.1,::1"
+import sys
+import json
+import warnings
+from datetime import datetime
+import gradio as gr
+warnings.filterwarnings('ignore')
+import random
+import torch
+import torch.distributed as dist
+from PIL import Image
+import subprocess
+import wan
+from wan.configs import SIZE_CONFIGS, SUPPORTED_SIZES, WAN_CONFIGS
+from wan.utils.utils import cache_image, cache_video, str2bool
+from wan.utils.multitalk_utils import save_video_ffmpeg
+from kokoro import KPipeline
+from transformers import Wav2Vec2FeatureExtractor
+from src.audio_analysis.wav2vec2 import Wav2Vec2Model
+import librosa
+import pyloudnorm as pyln
+import numpy as np
+from einops import rearrange
+import soundfile as sf
+import re
+def _validate_args(args):
+    # Basic check
+    assert args.ckpt_dir is not None, "Please specify the checkpoint directory."
+    assert args.task in WAN_CONFIGS, f"Unsupport task: {args.task}"
+    # The default sampling steps are 40 for image-to-video tasks and 50 for text-to-video tasks.
+    if args.sample_steps is None:
+        args.sample_steps = 40
+    if args.sample_shift is None:
+        if args.size == 'infinitetalk-480':
+            args.sample_shift = 7
+        elif args.size == 'infinitetalk-720':
+            args.sample_shift = 11
+        else:
+            raise NotImplementedError(f'Not supported size')
+    args.base_seed = args.base_seed if args.base_seed >= 0 else random.randint(
+        0, 99999999)
+    # Size check
+    assert args.size in SUPPORTED_SIZES[
+        args.
+        task], f"Unsupport size {args.size} for task {args.task}, supported sizes are: {', '.join(SUPPORTED_SIZES[args.task])}"
+def _parse_args():
+    parser = argparse.ArgumentParser(
+        description="Generate a image or video from a text prompt or image using Wan"
+    )
+    parser.add_argument(
+        "--task",
+        type=str,
+        default="infinitetalk-14B",
+        choices=list(WAN_CONFIGS.keys()),
+        help="The task to run.")
+    parser.add_argument(
+        "--size",
+        type=str,
+        default="infinitetalk-480",
+        choices=list(SIZE_CONFIGS.keys()),
+        help="The buckget size of the generated video. The aspect ratio of the output video will follow that of the input image."
+    )
+    parser.add_argument(
+        "--frame_num",
+        type=int,
+        default=81,
+        help="How many frames to be generated in one clip. The number should be 4n+1"
+    )
+    parser.add_argument(
+        "--ckpt_dir",
+        type=str,
+        default='./weights/Wan2.1-I2V-14B-480P',
+        help="The path to the Wan checkpoint directory.")
+    parser.add_argument(
+        "--quant_dir",
+        type=str,
+        default=None,
+        help="The path to the Wan quant checkpoint directory.")
+    parser.add_argument(
+        "--infinitetalk_dir",
+        type=str,
+        default='weights/InfiniteTalk/single/infinitetalk.safetensors',
+        help="The path to the InfiniteTalk checkpoint directory.")
+    parser.add_argument(
+        "--wav2vec_dir",
+        type=str,
+        default='./weights/chinese-wav2vec2-base',
+        help="The path to the wav2vec checkpoint directory.")
+    parser.add_argument(
+        "--dit_path",
+        type=str,
+        default=None,
+        help="The path to the Wan checkpoint directory.")
+    parser.add_argument(
+        "--lora_dir",
+        type=str,
+        nargs='+',
+        default=None,
+        help="The path to the LoRA checkpoint directory.")
+    parser.add_argument(
+        "--lora_scale",
+        type=float,
+        nargs='+',
+        default=[1.2],
+        help="Controls how much to influence the outputs with the LoRA parameters. Accepts multiple float values."
+    )
+    parser.add_argument(
+        "--offload_model",
+        type=str2bool,
+        default=None,
+        help="Whether to offload the model to CPU after each model forward, reducing GPU memory usage."
+    )
+    parser.add_argument(
+        "--ulysses_size",
+        type=int,
+        default=1,
+        help="The size of the ulysses parallelism in DiT.")
+    parser.add_argument(
+        "--ring_size",
+        type=int,
+        default=1,
+        help="The size of the ring attention parallelism in DiT.")
+    parser.add_argument(
+        "--t5_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for T5.")
+    parser.add_argument(
+        "--t5_cpu",
+        action="store_true",
+        default=False,
+        help="Whether to place T5 model on CPU.")
+    parser.add_argument(
+        "--dit_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for DiT.")
+    parser.add_argument(
+        "--save_file",
+        type=str,
+        default=None,
+        help="The file to save the generated image or video to.")
+    parser.add_argument(
+        "--audio_save_dir",
+        type=str,
+        default='save_audio/gradio',
+        help="The path to save the audio embedding.")
+    parser.add_argument(
+        "--base_seed",
+        type=int,
+        default=42,
+        help="The seed to use for generating the image or video.")
+    parser.add_argument(
+        "--input_json",
+        type=str,
+        default='examples.json',
+        help="[meta file] The condition path to generate the video.")
+    parser.add_argument(
+        "--motion_frame",
+        type=int,
+        default=9,
+        help="Driven frame length used in the mode of long video genration.")
+    parser.add_argument(
+        "--mode",
+        type=str,
+        default="streaming",
+        choices=['clip', 'streaming'],
+        help="clip: generate one video chunk, streaming: long video generation")
+    parser.add_argument(
+        "--sample_steps", type=int, default=None, help="The sampling steps.")
+    parser.add_argument(
+        "--sample_shift",
+        type=float,
+        default=None,
+        help="Sampling shift factor for flow matching schedulers.")
+    parser.add_argument(
+        "--sample_text_guide_scale",
+        type=float,
+        default=5.0,
+        help="Classifier free guidance scale for text control.")
+    parser.add_argument(
+        "--sample_audio_guide_scale",
+        type=float,
+        default=4.0,
+        help="Classifier free guidance scale for audio control.")
+    parser.add_argument(
+        "--num_persistent_param_in_dit",
+        type=int,
+        default=None,
+        required=False,
+        help="Maximum parameter quantity retained in video memory, small number to reduce VRAM required",
+    )
+    parser.add_argument(
+        "--use_teacache",
+        action="store_true",
+        default=False,
+        help="Enable teacache for video generation."
+    )
+    parser.add_argument(
+        "--teacache_thresh",
+        type=float,
+        default=0.2,
+        help="Threshold for teacache."
+    )
+    parser.add_argument(
+        "--use_apg",
+        action="store_true",
+        default=False,
+        help="Enable adaptive projected guidance for video generation (APG)."
+    )
+    parser.add_argument(
+        "--apg_momentum",
+        type=float,
+        default=-0.75,
+        help="Momentum used in adaptive projected guidance (APG)."
+    )
+    parser.add_argument(
+        "--apg_norm_threshold",
+        type=float,
+        default=55,
+        help="Norm threshold used in adaptive projected guidance (APG)."
+    )
+    parser.add_argument(
+        "--color_correction_strength",
+        type=float,
+        default=1.0,
+        help="strength for color correction [0.0 -- 1.0]."
+    )
+    parser.add_argument(
+        "--quant",
+        type=str,
+        default=None,
+        help="Quantization type, must be 'int8' or 'fp8'."
+    )
+    args = parser.parse_args()
+    _validate_args(args)
+    return args
+def custom_init(device, wav2vec):
+    audio_encoder = Wav2Vec2Model.from_pretrained(wav2vec, local_files_only=True).to(device)
+    audio_encoder.feature_extractor._freeze_parameters()
+    wav2vec_feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(wav2vec, local_files_only=True)
+    return wav2vec_feature_extractor, audio_encoder
+def loudness_norm(audio_array, sr=16000, lufs=-23):
+    meter = pyln.Meter(sr)
+    loudness = meter.integrated_loudness(audio_array)
+    if abs(loudness) > 100:
+        return audio_array
+    normalized_audio = pyln.normalize.loudness(audio_array, loudness, lufs)
+    return normalized_audio
+def audio_prepare_multi(left_path, right_path, audio_type, sample_rate=16000):
+    if not (left_path=='None' or right_path=='None'):
+        human_speech_array1 = audio_prepare_single(left_path)
+        human_speech_array2 = audio_prepare_single(right_path)
+    elif left_path=='None':
+        human_speech_array2 = audio_prepare_single(right_path)
+        human_speech_array1 = np.zeros(human_speech_array2.shape[0])
+    elif right_path=='None':
+        human_speech_array1 = audio_prepare_single(left_path)
+        human_speech_array2 = np.zeros(human_speech_array1.shape[0])
+    if audio_type=='para':
+        new_human_speech1 = human_speech_array1
+        new_human_speech2 = human_speech_array2
+    elif audio_type=='add':
+        new_human_speech1 = np.concatenate([human_speech_array1[: human_speech_array1.shape[0]], np.zeros(human_speech_array2.shape[0])])
+        new_human_speech2 = np.concatenate([np.zeros(human_speech_array1.shape[0]), human_speech_array2[:human_speech_array2.shape[0]]])
+    sum_human_speechs = new_human_speech1 + new_human_speech2
+    return new_human_speech1, new_human_speech2, sum_human_speechs
+def _init_logging(rank):
+    # logging
+    if rank == 0:
+        # set format
+        logging.basicConfig(
+            level=logging.INFO,
+            format="[%(asctime)s] %(levelname)s: %(message)s",
+            handlers=[logging.StreamHandler(stream=sys.stdout)])
+    else:
+        logging.basicConfig(level=logging.ERROR)
+def get_embedding(speech_array, wav2vec_feature_extractor, audio_encoder, sr=16000, device='cpu'):
+    audio_duration = len(speech_array) / sr
+    video_length = audio_duration * 25 # Assume the video fps is 25
+    # wav2vec_feature_extractor
+    audio_feature = np.squeeze(
+        wav2vec_feature_extractor(speech_array, sampling_rate=sr).input_values
+    )
+    audio_feature = torch.from_numpy(audio_feature).float().to(device=device)
+    audio_feature = audio_feature.unsqueeze(0)
+    # audio encoder
+    with torch.no_grad():
+        embeddings = audio_encoder(audio_feature, seq_len=int(video_length), output_hidden_states=True)
+    if len(embeddings) == 0:
+        print("Fail to extract audio embedding")
+        return None
+    audio_emb = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
+    audio_emb = rearrange(audio_emb, "b s d -> s b d")
+    audio_emb = audio_emb.cpu().detach()
+    return audio_emb
+def extract_audio_from_video(filename, sample_rate):
+    raw_audio_path = filename.split('/')[-1].split('.')[0]+'.wav'
+    ffmpeg_command = [
+        "ffmpeg",
+        "-y",
+        "-i",
+        str(filename),
+        "-vn",
+        "-acodec",
+        "pcm_s16le",
+        "-ar",
+        "16000",
+        "-ac",
+        "2",
+        str(raw_audio_path),
+    ]
+    subprocess.run(ffmpeg_command, check=True)
+    human_speech_array, sr = librosa.load(raw_audio_path, sr=sample_rate)
+    human_speech_array = loudness_norm(human_speech_array, sr)
+    os.remove(raw_audio_path)
+    return human_speech_array
+def audio_prepare_single(audio_path, sample_rate=16000):
+    ext = os.path.splitext(audio_path)[1].lower()
+    if ext in ['.mp4', '.mov', '.avi', '.mkv']:
+        human_speech_array = extract_audio_from_video(audio_path, sample_rate)
+        return human_speech_array
+    else:
+        human_speech_array, sr = librosa.load(audio_path, sr=sample_rate)
+        human_speech_array = loudness_norm(human_speech_array, sr)
+        return human_speech_array
+def process_tts_single(text, save_dir, voice1):
+    s1_sentences = []
+    pipeline = KPipeline(lang_code='a', repo_id='weights/Kokoro-82M')
+    voice_tensor = torch.load(voice1, weights_only=True)
+    generator = pipeline(
+        text, voice=voice_tensor, # <= change voice here
+        speed=1, split_pattern=r'\n+'
+    )
+    audios = []
+    for i, (gs, ps, audio) in enumerate(generator):
+        audios.append(audio)
+    audios = torch.concat(audios, dim=0)
+    s1_sentences.append(audios)
+    s1_sentences = torch.concat(s1_sentences, dim=0)
+    save_path1 =f'{save_dir}/s1.wav'
+    sf.write(save_path1, s1_sentences, 24000) # save each audio file
+    s1, _ = librosa.load(save_path1, sr=16000)
+    return s1, save_path1
+def process_tts_multi(text, save_dir, voice1, voice2):
+    pattern = r'\(s(\d+)\)\s*(.*?)(?=\s*\(s\d+\)|$)'
+    matches = re.findall(pattern, text, re.DOTALL)
+    s1_sentences = []
+    s2_sentences = []
+    pipeline = KPipeline(lang_code='a', repo_id='weights/Kokoro-82M')
+    for idx, (speaker, content) in enumerate(matches):
+        if speaker == '1':
+            voice_tensor = torch.load(voice1, weights_only=True)
+            generator = pipeline(
+                content, voice=voice_tensor, # <= change voice here
+                speed=1, split_pattern=r'\n+'
+            )
+            audios = []
+            for i, (gs, ps, audio) in enumerate(generator):
+                audios.append(audio)
+            audios = torch.concat(audios, dim=0)
+            s1_sentences.append(audios)
+            s2_sentences.append(torch.zeros_like(audios))
+        elif speaker == '2':
+            voice_tensor = torch.load(voice2, weights_only=True)
+            generator = pipeline(
+                content, voice=voice_tensor, # <= change voice here
+                speed=1, split_pattern=r'\n+'
+            )
+            audios = []
+            for i, (gs, ps, audio) in enumerate(generator):
+                audios.append(audio)
+            audios = torch.concat(audios, dim=0)
+            s2_sentences.append(audios)
+            s1_sentences.append(torch.zeros_like(audios))
+    s1_sentences = torch.concat(s1_sentences, dim=0)
+    s2_sentences = torch.concat(s2_sentences, dim=0)
+    sum_sentences = s1_sentences + s2_sentences
+    save_path1 =f'{save_dir}/s1.wav'
+    save_path2 =f'{save_dir}/s2.wav'
+    save_path_sum = f'{save_dir}/sum.wav'
+    sf.write(save_path1, s1_sentences, 24000) # save each audio file
+    sf.write(save_path2, s2_sentences, 24000)
+    sf.write(save_path_sum, sum_sentences, 24000)
+    s1, _ = librosa.load(save_path1, sr=16000)
+    s2, _ = librosa.load(save_path2, sr=16000)
+    # sum, _ = librosa.load(save_path_sum, sr=16000)
+    return s1, s2, save_path_sum
+def run_graio_demo(args):
+    rank = int(os.getenv("RANK", 0))
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+    local_rank = int(os.getenv("LOCAL_RANK", 0))
+    device = local_rank
+    _init_logging(rank)
+    if args.offload_model is None:
+        args.offload_model = False if world_size > 1 else True
+        logging.info(
+            f"offload_model is not specified, set to {args.offload_model}.")
+    if world_size > 1:
+        torch.cuda.set_device(local_rank)
+        dist.init_process_group(
+            backend="nccl",
+            init_method="env://",
+            rank=rank,
+            world_size=world_size)
+    else:
+        assert not (
+            args.t5_fsdp or args.dit_fsdp
+        ), f"t5_fsdp and dit_fsdp are not supported in non-distributed environments."
+        assert not (
+            args.ulysses_size > 1 or args.ring_size > 1
+        ), f"context parallel are not supported in non-distributed environments."
+    if args.ulysses_size > 1 or args.ring_size > 1:
+        assert args.ulysses_size * args.ring_size == world_size, f"The number of ulysses_size and ring_size should be equal to the world size."
+        from xfuser.core.distributed import (
+            init_distributed_environment,
+            initialize_model_parallel,
+        )
+        init_distributed_environment(
+            rank=dist.get_rank(), world_size=dist.get_world_size())
+        initialize_model_parallel(
+            sequence_parallel_degree=dist.get_world_size(),
+            ring_degree=args.ring_size,
+            ulysses_degree=args.ulysses_size,
+        )
+    cfg = WAN_CONFIGS[args.task]
+    if args.ulysses_size > 1:
+        assert cfg.num_heads % args.ulysses_size == 0, f"`{cfg.num_heads=}` cannot be divided evenly by `{args.ulysses_size=}`."
+    logging.info(f"Generation job args: {args}")
+    logging.info(f"Generation model config: {cfg}")
+    if dist.is_initialized():
+        base_seed = [args.base_seed] if rank == 0 else [None]
+        dist.broadcast_object_list(base_seed, src=0)
+        args.base_seed = base_seed[0]
+    assert args.task == "infinitetalk-14B", 'You should choose multitalk in args.task.'
+    wav2vec_feature_extractor, audio_encoder= custom_init('cpu', args.wav2vec_dir)
+    os.makedirs(args.audio_save_dir,exist_ok=True)
+    logging.info("Creating MultiTalk pipeline.")
+    wan_i2v = wan.InfiniteTalkPipeline(
+        config=cfg,
+        checkpoint_dir=args.ckpt_dir,
+        quant_dir=args.quant_dir,
+        device_id=device,
+        rank=rank,
+        t5_fsdp=args.t5_fsdp,
+        dit_fsdp=args.dit_fsdp,
+        use_usp=(args.ulysses_size > 1 or args.ring_size > 1),
+        t5_cpu=args.t5_cpu,
+        lora_dir=args.lora_dir,
+        lora_scales=args.lora_scale,
+        quant=args.quant,
+        dit_path=args.dit_path,
+        infinitetalk_dir=args.infinitetalk_dir
+    )
+    if args.num_persistent_param_in_dit is not None:
+        wan_i2v.vram_management = True
+        wan_i2v.enable_vram_management(
+            num_persistent_param_in_dit=args.num_persistent_param_in_dit
+        )
+    def generate_video(img2vid_image, vid2vid_vid, task_mode, img2vid_prompt, n_prompt, img2vid_audio_1, img2vid_audio_2,
+                    sd_steps, seed, text_guide_scale, audio_guide_scale, mode_selector, tts_text, resolution_select, human1_voice, human2_voice):
+        input_data = {}
+        input_data["prompt"] = img2vid_prompt
+        if task_mode=='VideoDubbing':
+            input_data["cond_video"] = vid2vid_vid
+        else:
+            input_data["cond_video"] = img2vid_image
+        person = {}
+        if mode_selector == "Single Person(Local File)":
+            person['person1'] = img2vid_audio_1
+        elif mode_selector == "Single Person(TTS)":
+            tts_audio = {}
+            tts_audio['text'] = tts_text
+            tts_audio['human1_voice'] = human1_voice
+            input_data["tts_audio"] = tts_audio
+        elif mode_selector == "Multi Person(Local File, audio add)":
+            person['person1'] = img2vid_audio_1
+            person['person2'] = img2vid_audio_2
+            input_data["audio_type"] = 'add'
+        elif mode_selector == "Multi Person(Local File, audio parallel)":
+            person['person1'] = img2vid_audio_1
+            person['person2'] = img2vid_audio_2
+            input_data["audio_type"] = 'para'
+        else:
+            tts_audio = {}
+            tts_audio['text'] = tts_text
+            tts_audio['human1_voice'] = human1_voice
+            tts_audio['human2_voice'] = human2_voice
+            input_data["tts_audio"] = tts_audio
+        input_data["cond_audio"] = person
+        if 'Local File' in mode_selector:
+            if len(input_data['cond_audio'])==2:
+                new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(input_data['cond_audio']['person1'], input_data['cond_audio']['person2'], input_data['audio_type'])
+                audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder)
+                audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder)
+                emb1_path = os.path.join(args.audio_save_dir, '1.pt')
+                emb2_path = os.path.join(args.audio_save_dir, '2.pt')
+                sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+                sf.write(sum_audio, sum_human_speechs, 16000)
+                torch.save(audio_embedding_1, emb1_path)
+                torch.save(audio_embedding_2, emb2_path)
+                input_data['cond_audio']['person1'] = emb1_path
+                input_data['cond_audio']['person2'] = emb2_path
+                input_data['video_audio'] = sum_audio
+            elif len(input_data['cond_audio'])==1:
+                human_speech = audio_prepare_single(input_data['cond_audio']['person1'])
+                audio_embedding = get_embedding(human_speech, wav2vec_feature_extractor, audio_encoder)
+                emb_path = os.path.join(args.audio_save_dir, '1.pt')
+                sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+                sf.write(sum_audio, human_speech, 16000)
+                torch.save(audio_embedding, emb_path)
+                input_data['cond_audio']['person1'] = emb_path
+                input_data['video_audio'] = sum_audio
+        elif 'TTS' in mode_selector:
+            if 'human2_voice' not in input_data['tts_audio'].keys():
+                new_human_speech1, sum_audio = process_tts_single(input_data['tts_audio']['text'], args.audio_save_dir, input_data['tts_audio']['human1_voice'])
+                audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder)
+                emb1_path = os.path.join(args.audio_save_dir, '1.pt')
+                torch.save(audio_embedding_1, emb1_path)
+                input_data['cond_audio']['person1'] = emb1_path
+                input_data['video_audio'] = sum_audio
+            else:
+                new_human_speech1, new_human_speech2, sum_audio = process_tts_multi(input_data['tts_audio']['text'], args.audio_save_dir, input_data['tts_audio']['human1_voice'], input_data['tts_audio']['human2_voice'])
+                audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder)
+                audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder)
+                emb1_path = os.path.join(args.audio_save_dir, '1.pt')
+                emb2_path = os.path.join(args.audio_save_dir, '2.pt')
+                torch.save(audio_embedding_1, emb1_path)
+                torch.save(audio_embedding_2, emb2_path)
+                input_data['cond_audio']['person1'] = emb1_path
+                input_data['cond_audio']['person2'] = emb2_path
+                input_data['video_audio'] = sum_audio
+        # if len(input_data['cond_audio'])==2:
+        #     new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(input_data['cond_audio']['person1'], input_data['cond_audio']['person2'], input_data['audio_type'])
+        #     audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder)
+        #     audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder)
+        #     emb1_path = os.path.join(args.audio_save_dir, '1.pt')
+        #     emb2_path = os.path.join(args.audio_save_dir, '2.pt')
+        #     sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+        #     sf.write(sum_audio, sum_human_speechs, 16000)
+        #     torch.save(audio_embedding_1, emb1_path)
+        #     torch.save(audio_embedding_2, emb2_path)
+        #     input_data['cond_audio']['person1'] = emb1_path
+        #     input_data['cond_audio']['person2'] = emb2_path
+        #     input_data['video_audio'] = sum_audio
+        # elif len(input_data['cond_audio'])==1:
+        #     human_speech = audio_prepare_single(input_data['cond_audio']['person1'])
+        #     audio_embedding = get_embedding(human_speech, wav2vec_feature_extractor, audio_encoder)
+        #     emb_path = os.path.join(args.audio_save_dir, '1.pt')
+        #     sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+        #     sf.write(sum_audio, human_speech, 16000)
+        #     torch.save(audio_embedding, emb_path)
+        #     input_data['cond_audio']['person1'] = emb_path
+        #     input_data['video_audio'] = sum_audio
+        logging.info("Generating video ...")
+        video = wan_i2v.generate_infinitetalk(
+            input_data,
+            size_buckget=resolution_select,
+            motion_frame=args.motion_frame,
+            frame_num=args.frame_num,
+            shift=args.sample_shift,
+            sampling_steps=sd_steps,
+            text_guide_scale=text_guide_scale,
+            audio_guide_scale=audio_guide_scale,
+            seed=seed,
+            n_prompt=n_prompt,
+            offload_model=args.offload_model,
+            max_frames_num=args.frame_num if args.mode == 'clip' else 1000,
+            color_correction_strength = args.color_correction_strength,
+            extra_args=args,
+            )
+        if args.save_file is None:
+            formatted_time = datetime.now().strftime("%Y%m%d_%H%M%S")
+            formatted_prompt = input_data['prompt'].replace(" ", "_").replace("/",
+                                                                        "_")[:50]
+            args.save_file = f"{args.task}_{args.size.replace('*','x') if sys.platform=='win32' else args.size}_{args.ulysses_size}_{args.ring_size}_{formatted_prompt}_{formatted_time}"
+        logging.info(f"Saving generated video to {args.save_file}.mp4")
+        save_video_ffmpeg(video, args.save_file, [input_data['video_audio']], high_quality_save=False)
+        logging.info("Finished.")
+        return args.save_file + '.mp4'
+    def toggle_audio_mode(mode):
+        if 'TTS' in mode:
+            return [
+                gr.Audio(visible=False, interactive=False),
+                gr.Audio(visible=False, interactive=False),
+                gr.Textbox(visible=True, interactive=True)
+            ]
+        elif 'Single' in mode:
+            return [
+                gr.Audio(visible=True, interactive=True),
+                gr.Audio(visible=False, interactive=False),
+                gr.Textbox(visible=False, interactive=False)
+            ]
+        else:
+            return [
+                gr.Audio(visible=True, interactive=True),
+                gr.Audio(visible=True, interactive=True),
+                gr.Textbox(visible=False, interactive=False)
+            ]
+    def show_upload(mode):
+        if mode == "SingleImageDriven":
+            return gr.update(visible=True), gr.update(visible=False)
+        else:
+            return gr.update(visible=False), gr.update(visible=True)
+    with gr.Blocks() as demo:
+        gr.Markdown("""
+                    <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
+                        MeiGen-InfiniteTalk
+                    </div>
+                    <div style="text-align: center; font-size: 16px; font-weight: normal; margin-bottom: 20px;">
+                        InfiniteTalk: Audio-driven Video Generation for Spare-Frame Video Dubbing.
+                    </div>
+                    <div style="display: flex; justify-content: center; gap: 10px; flex-wrap: wrap;">
+                        <a href=''><img src='https://img.shields.io/badge/Project-Page-blue'></a>
+                        <a href=''><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>
+                        <a href=''><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
+                    </div>
+                    """)
+        with gr.Row():
+            with gr.Column(scale=1):
+                task_mode = gr.Radio(
+                    choices=["SingleImageDriven", "VideoDubbing"],
+                    label="Choose SingleImageDriven task or VideoDubbing task",
+                    value="VideoDubbing"
+                )
+                vid2vid_vid = gr.Video(
+                    label="Upload Input Video",
+                    visible=True)
+                img2vid_image = gr.Image(
+                    type="filepath",
+                    label="Upload Input Image",
+                    elem_id="image_upload",
+                    visible=False
+                )
+                img2vid_prompt = gr.Textbox(
+                    label="Prompt",
+                    placeholder="Describe the video you want to generate",
+                )
+                task_mode.change(
+                    fn=show_upload,
+                    inputs=task_mode,
+                    outputs=[img2vid_image, vid2vid_vid]
+                )
+                with gr.Accordion("Audio Options", open=True):
+                    mode_selector = gr.Radio(
+                        choices=["Single Person(Local File)", "Single Person(TTS)", "Multi Person(Local File, audio add)", "Multi Person(Local File, audio parallel)", "Multi Person(TTS)"],
+                        label="Select person and audio mode.",
+                        value="Single Person(Local File)"
+                    )
+                    resolution_select = gr.Radio(
+                        choices=["infinitetalk-480", "infinitetalk-720"],
+                        label="Select resolution.",
+                        value="infinitetalk-480"
+                    )
+                    img2vid_audio_1 = gr.Audio(label="Conditioning Audio for speaker 1", type="filepath", visible=True)
+                    img2vid_audio_2 = gr.Audio(label="Conditioning Audio for speaker 2", type="filepath", visible=False)
+                    tts_text = gr.Textbox(
+                        label="Text for TTS",
+                        placeholder="Refer to the format in the examples",
+                        visible=False,
+                        interactive=False
+                    )
+                    mode_selector.change(
+                        fn=toggle_audio_mode,
+                        inputs=mode_selector,
+                        outputs=[img2vid_audio_1, img2vid_audio_2, tts_text]
+                    )
+                with gr.Accordion("Advanced Options", open=False):
+                    with gr.Row():
+                        sd_steps = gr.Slider(
+                            label="Diffusion steps",
+                            minimum=1,
+                            maximum=1000,
+                            value=8,
+                            step=1)
+                        seed = gr.Slider(
+                            label="Seed",
+                            minimum=-1,
+                            maximum=2147483647,
+                            step=1,
+                            value=42)
+                    with gr.Row():
+                        text_guide_scale = gr.Slider(
+                            label="Text Guide scale",
+                            minimum=0,
+                            maximum=20,
+                            value=1.0,
+                            step=1)
+                        audio_guide_scale = gr.Slider(
+                            label="Audio Guide scale",
+                            minimum=0,
+                            maximum=20,
+                            value=2.0,
+                            step=1)
+                    with gr.Row():
+                        human1_voice = gr.Textbox(
+                            label="Voice for the left person",
+                            value="weights/Kokoro-82M/voices/am_adam.pt",
+                        )
+                        human2_voice = gr.Textbox(
+                            label="Voice for right person",
+                            value="weights/Kokoro-82M/voices/af_heart.pt"
+                        )
+                    # with gr.Row():
+                    n_prompt = gr.Textbox(
+                        label="Negative Prompt",
+                        placeholder="Describe the negative prompt you want to add",
+                        value="bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
+                    )
+                run_i2v_button = gr.Button("Generate Video")
+            with gr.Column(scale=2):
+                result_gallery = gr.Video(
+                    label='Generated Video', interactive=False, height=600, )
+                gr.Examples(
+                    examples = [
+                        ['SingleImageDriven', 'examples/single/ref_image.png', None, "A woman is passionately singing into a professional microphone in a recording studio. She wears large black headphones and a dark cardigan over a gray top. Her long, wavy brown hair frames her face as she looks slightly upwards, her mouth open mid-song. The studio is equipped with various audio equipment, including a mixing console and a keyboard, with soundproofing panels on the walls. The lighting is warm and focused on her, creating a professional and intimate atmosphere. A close-up shot captures her expressive performance.", "Single Person(Local File)", "examples/single/1.wav", None, None],
+                        ['VideoDubbing', None, 'examples/single/ref_video.mp4', "A man is talking", "Single Person(Local File)", "examples/single/1.wav", None, None],
+                    ],
+                    inputs = [task_mode, img2vid_image, vid2vid_vid, img2vid_prompt, mode_selector, img2vid_audio_1, img2vid_audio_2, tts_text],
+                )
+        run_i2v_button.click(
+            fn=generate_video,
+            inputs=[img2vid_image, vid2vid_vid, task_mode, img2vid_prompt, n_prompt, img2vid_audio_1, img2vid_audio_2,sd_steps, seed, text_guide_scale, audio_guide_scale, mode_selector, tts_text, resolution_select, human1_voice, human2_voice],
+            outputs=[result_gallery],
+        )
+    demo.launch(server_name="0.0.0.0", debug=True, server_port=8418)
+if __name__ == "__main__":
+    args = _parse_args()
+    run_graio_demo(args)

generate_infinitetalk.py ADDED Viewed

	@@ -0,0 +1,663 @@

+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import argparse
+import logging
+import os
+import sys
+import json
+import warnings
+from datetime import datetime
+warnings.filterwarnings('ignore')
+import random
+import torch
+import torch.distributed as dist
+from PIL import Image
+import subprocess
+import wan
+from wan.configs import SIZE_CONFIGS, SUPPORTED_SIZES, WAN_CONFIGS
+from wan.utils.utils import str2bool, is_video, split_wav_librosa
+from wan.utils.multitalk_utils import save_video_ffmpeg
+from kokoro import KPipeline
+from transformers import Wav2Vec2FeatureExtractor
+from src.audio_analysis.wav2vec2 import Wav2Vec2Model
+from wan.utils.segvideo import shot_detect
+import librosa
+import pyloudnorm as pyln
+import numpy as np
+from einops import rearrange
+import soundfile as sf
+import re
+def _validate_args(args):
+    # Basic check
+    assert args.ckpt_dir is not None, "Please specify the checkpoint directory."
+    assert args.task in WAN_CONFIGS, f"Unsupport task: {args.task}"
+    # The default sampling steps are 40 for image-to-video tasks and 50 for text-to-video tasks.
+    if args.sample_steps is None:
+        args.sample_steps = 40
+    if args.sample_shift is None:
+        if args.size == 'infinitetalk-480':
+            args.sample_shift = 7
+        elif args.size == 'infinitetalk-720':
+            args.sample_shift = 11
+        else:
+            raise NotImplementedError(f'Not supported size')
+    args.base_seed = args.base_seed if args.base_seed >= 0 else random.randint(
+        0, 99999999)
+    # Size check
+    assert args.size in SUPPORTED_SIZES[
+        args.
+        task], f"Unsupport size {args.size} for task {args.task}, supported sizes are: {', '.join(SUPPORTED_SIZES[args.task])}"
+def _parse_args():
+    parser = argparse.ArgumentParser(
+        description="Generate a image or video from a text prompt or image using Wan"
+    )
+    parser.add_argument(
+        "--task",
+        type=str,
+        default="infinitetalk-14B",
+        choices=list(WAN_CONFIGS.keys()),
+        help="The task to run.")
+    parser.add_argument(
+        "--size",
+        type=str,
+        default="infinitetalk-480",
+        choices=list(SIZE_CONFIGS.keys()),
+        help="The buckget size of the generated video. The aspect ratio of the output video will follow that of the input image."
+    )
+    parser.add_argument(
+        "--frame_num",
+        type=int,
+        default=81,
+        help="How many frames to be generated in one clip. The number should be 4n+1"
+    )
+    parser.add_argument(
+        "--max_frame_num",
+        type=int,
+        default=1000,
+        help="The max frame lenght of the generated video."
+    )
+    parser.add_argument(
+        "--ckpt_dir",
+        type=str,
+        default=None,
+        help="The path to the Wan checkpoint directory.")
+    parser.add_argument(
+        "--infinitetalk_dir",
+        type=str,
+        default=None,
+        help="The path to the InfiniteTalk checkpoint directory.")
+    parser.add_argument(
+        "--quant_dir",
+        type=str,
+        default=None,
+        help="The path to the Wan quant checkpoint directory.")
+    parser.add_argument(
+        "--wav2vec_dir",
+        type=str,
+        default=None,
+        help="The path to the wav2vec checkpoint directory.")
+    parser.add_argument(
+        "--dit_path",
+        type=str,
+        default=None,
+        help="The path to the Wan checkpoint directory.")
+    parser.add_argument(
+        "--lora_dir",
+        type=str,
+        nargs='+',
+        default=None,
+        help="The paths to the LoRA checkpoint files."
+    )
+    parser.add_argument(
+        "--lora_scale",
+        type=float,
+        nargs='+',
+        default=[1.2],
+        help="Controls how much to influence the outputs with the LoRA parameters. Accepts multiple float values."
+    )
+    parser.add_argument(
+        "--offload_model",
+        type=str2bool,
+        default=None,
+        help="Whether to offload the model to CPU after each model forward, reducing GPU memory usage."
+    )
+    parser.add_argument(
+        "--ulysses_size",
+        type=int,
+        default=1,
+        help="The size of the ulysses parallelism in DiT.")
+    parser.add_argument(
+        "--ring_size",
+        type=int,
+        default=1,
+        help="The size of the ring attention parallelism in DiT.")
+    parser.add_argument(
+        "--t5_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for T5.")
+    parser.add_argument(
+        "--t5_cpu",
+        action="store_true",
+        default=False,
+        help="Whether to place T5 model on CPU.")
+    parser.add_argument(
+        "--dit_fsdp",
+        action="store_true",
+        default=False,
+        help="Whether to use FSDP for DiT.")
+    parser.add_argument(
+        "--save_file",
+        type=str,
+        default=None,
+        help="The file to save the generated image or video to.")
+    parser.add_argument(
+        "--audio_save_dir",
+        type=str,
+        default='save_audio',
+        help="The path to save the audio embedding.")
+    parser.add_argument(
+        "--base_seed",
+        type=int,
+        default=42,
+        help="The seed to use for generating the image or video.")
+    parser.add_argument(
+        "--input_json",
+        type=str,
+        default='examples.json',
+        help="[meta file] The condition path to generate the video.")
+    parser.add_argument(
+        "--motion_frame",
+        type=int,
+        default=9,
+        help="Driven frame length used in the mode of long video genration.")
+    parser.add_argument(
+        "--mode",
+        type=str,
+        default="clip",
+        choices=['clip', 'streaming'],
+        help="clip: generate one video chunk, streaming: long video generation")
+    parser.add_argument(
+        "--sample_steps", type=int, default=None, help="The sampling steps.")
+    parser.add_argument(
+        "--sample_shift",
+        type=float,
+        default=None,
+        help="Sampling shift factor for flow matching schedulers.")
+    parser.add_argument(
+        "--sample_text_guide_scale",
+        type=float,
+        default=5.0,
+        help="Classifier free guidance scale for text control.")
+    parser.add_argument(
+        "--sample_audio_guide_scale",
+        type=float,
+        default=4.0,
+        help="Classifier free guidance scale for audio control.")
+    parser.add_argument(
+        "--num_persistent_param_in_dit",
+        type=int,
+        default=None,
+        required=False,
+        help="Maximum parameter quantity retained in video memory, small number to reduce VRAM required",
+    )
+    parser.add_argument(
+        "--audio_mode",
+        type=str,
+        default="localfile",
+        choices=['localfile', 'tts'],
+        help="localfile: audio from local wav file, tts: audio from TTS")
+    parser.add_argument(
+        "--use_teacache",
+        action="store_true",
+        default=False,
+        help="Enable teacache for video generation."
+    )
+    parser.add_argument(
+        "--teacache_thresh",
+        type=float,
+        default=0.2,
+        help="Threshold for teacache."
+    )
+    parser.add_argument(
+        "--use_apg",
+        action="store_true",
+        default=False,
+        help="Enable adaptive projected guidance for video generation (APG)."
+    )
+    parser.add_argument(
+        "--apg_momentum",
+        type=float,
+        default=-0.75,
+        help="Momentum used in adaptive projected guidance (APG)."
+    )
+    parser.add_argument(
+        "--apg_norm_threshold",
+        type=float,
+        default=55,
+        help="Norm threshold used in adaptive projected guidance (APG)."
+    )
+    parser.add_argument(
+        "--color_correction_strength",
+        type=float,
+        default=1.0,
+        help="strength for color correction [0.0 -- 1.0]."
+    )
+    parser.add_argument(
+        "--scene_seg",
+        action="store_true",
+        default=False,
+        help="Enable scene segmentation for input video."
+    )
+    parser.add_argument(
+        "--quant",
+        type=str,
+        default=None,
+        help="Quantization type, must be 'int8' or 'fp8'."
+    )
+    args = parser.parse_args()
+    _validate_args(args)
+    return args
+def custom_init(device, wav2vec):
+    audio_encoder = Wav2Vec2Model.from_pretrained(wav2vec, local_files_only=True).to(device)
+    audio_encoder.feature_extractor._freeze_parameters()
+    wav2vec_feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(wav2vec, local_files_only=True)
+    return wav2vec_feature_extractor, audio_encoder
+def loudness_norm(audio_array, sr=16000, lufs=-23):
+    meter = pyln.Meter(sr)
+    loudness = meter.integrated_loudness(audio_array)
+    if abs(loudness) > 100:
+        return audio_array
+    normalized_audio = pyln.normalize.loudness(audio_array, loudness, lufs)
+    return normalized_audio
+def audio_prepare_multi(left_path, right_path, audio_type, sample_rate=16000):
+    if not (left_path=='None' or right_path=='None'):
+        human_speech_array1 = audio_prepare_single(left_path)
+        human_speech_array2 = audio_prepare_single(right_path)
+    elif left_path=='None':
+        human_speech_array2 = audio_prepare_single(right_path)
+        human_speech_array1 = np.zeros(human_speech_array2.shape[0])
+    elif right_path=='None':
+        human_speech_array1 = audio_prepare_single(left_path)
+        human_speech_array2 = np.zeros(human_speech_array1.shape[0])
+    if audio_type=='para':
+        new_human_speech1 = human_speech_array1
+        new_human_speech2 = human_speech_array2
+    elif audio_type=='add':
+        new_human_speech1 = np.concatenate([human_speech_array1[: human_speech_array1.shape[0]], np.zeros(human_speech_array2.shape[0])])
+        new_human_speech2 = np.concatenate([np.zeros(human_speech_array1.shape[0]), human_speech_array2[:human_speech_array2.shape[0]]])
+    sum_human_speechs = new_human_speech1 + new_human_speech2
+    return new_human_speech1, new_human_speech2, sum_human_speechs
+def _init_logging(rank):
+    # logging
+    if rank == 0:
+        # set format
+        logging.basicConfig(
+            level=logging.INFO,
+            format="[%(asctime)s] %(levelname)s: %(message)s",
+            handlers=[logging.StreamHandler(stream=sys.stdout)])
+    else:
+        logging.basicConfig(level=logging.ERROR)
+def get_embedding(speech_array, wav2vec_feature_extractor, audio_encoder, sr=16000, device='cpu'):
+    audio_duration = len(speech_array) / sr
+    video_length = audio_duration * 25 # Assume the video fps is 25
+    # wav2vec_feature_extractor
+    audio_feature = np.squeeze(
+        wav2vec_feature_extractor(speech_array, sampling_rate=sr).input_values
+    )
+    audio_feature = torch.from_numpy(audio_feature).float().to(device=device)
+    audio_feature = audio_feature.unsqueeze(0)
+    # audio encoder
+    with torch.no_grad():
+        embeddings = audio_encoder(audio_feature, seq_len=int(video_length), output_hidden_states=True)
+    if len(embeddings) == 0:
+        print("Fail to extract audio embedding")
+        return None
+    audio_emb = torch.stack(embeddings.hidden_states[1:], dim=1).squeeze(0)
+    audio_emb = rearrange(audio_emb, "b s d -> s b d")
+    audio_emb = audio_emb.cpu().detach()
+    return audio_emb
+def extract_audio_from_video(filename, sample_rate):
+    raw_audio_path = filename.split('/')[-1].split('.')[0]+'.wav'
+    ffmpeg_command = [
+        "ffmpeg",
+        "-y",
+        "-i",
+        str(filename),
+        "-vn",
+        "-acodec",
+        "pcm_s16le",
+        "-ar",
+        "16000",
+        "-ac",
+        "2",
+        str(raw_audio_path),
+    ]
+    subprocess.run(ffmpeg_command, check=True)
+    human_speech_array, sr = librosa.load(raw_audio_path, sr=sample_rate)
+    human_speech_array = loudness_norm(human_speech_array, sr)
+    os.remove(raw_audio_path)
+    return human_speech_array
+def audio_prepare_single(audio_path, sample_rate=16000):
+    ext = os.path.splitext(audio_path)[1].lower()
+    if ext in ['.mp4', '.mov', '.avi', '.mkv']:
+        human_speech_array = extract_audio_from_video(audio_path, sample_rate)
+        return human_speech_array
+    else:
+        human_speech_array, sr = librosa.load(audio_path, sr=sample_rate)
+        human_speech_array = loudness_norm(human_speech_array, sr)
+        return human_speech_array
+def process_tts_single(text, save_dir, voice1):
+    s1_sentences = []
+    pipeline = KPipeline(lang_code='a', repo_id='weights/Kokoro-82M')
+    voice_tensor = torch.load(voice1, weights_only=True)
+    generator = pipeline(
+        text, voice=voice_tensor, # <= change voice here
+        speed=1, split_pattern=r'\n+'
+    )
+    audios = []
+    for i, (gs, ps, audio) in enumerate(generator):
+        audios.append(audio)
+    audios = torch.concat(audios, dim=0)
+    s1_sentences.append(audios)
+    s1_sentences = torch.concat(s1_sentences, dim=0)
+    save_path1 =f'{save_dir}/s1.wav'
+    sf.write(save_path1, s1_sentences, 24000) # save each audio file
+    s1, _ = librosa.load(save_path1, sr=16000)
+    return s1, save_path1
+def process_tts_multi(text, save_dir, voice1, voice2):
+    pattern = r'\(s(\d+)\)\s*(.*?)(?=\s*\(s\d+\)|$)'
+    matches = re.findall(pattern, text, re.DOTALL)
+    s1_sentences = []
+    s2_sentences = []
+    pipeline = KPipeline(lang_code='a', repo_id='weights/Kokoro-82M')
+    for idx, (speaker, content) in enumerate(matches):
+        if speaker == '1':
+            voice_tensor = torch.load(voice1, weights_only=True)
+            generator = pipeline(
+                content, voice=voice_tensor, # <= change voice here
+                speed=1, split_pattern=r'\n+'
+            )
+            audios = []
+            for i, (gs, ps, audio) in enumerate(generator):
+                audios.append(audio)
+            audios = torch.concat(audios, dim=0)
+            s1_sentences.append(audios)
+            s2_sentences.append(torch.zeros_like(audios))
+        elif speaker == '2':
+            voice_tensor = torch.load(voice2, weights_only=True)
+            generator = pipeline(
+                content, voice=voice_tensor, # <= change voice here
+                speed=1, split_pattern=r'\n+'
+            )
+            audios = []
+            for i, (gs, ps, audio) in enumerate(generator):
+                audios.append(audio)
+            audios = torch.concat(audios, dim=0)
+            s2_sentences.append(audios)
+            s1_sentences.append(torch.zeros_like(audios))
+    s1_sentences = torch.concat(s1_sentences, dim=0)
+    s2_sentences = torch.concat(s2_sentences, dim=0)
+    sum_sentences = s1_sentences + s2_sentences
+    save_path1 =f'{save_dir}/s1.wav'
+    save_path2 =f'{save_dir}/s2.wav'
+    save_path_sum = f'{save_dir}/sum.wav'
+    sf.write(save_path1, s1_sentences, 24000) # save each audio file
+    sf.write(save_path2, s2_sentences, 24000)
+    sf.write(save_path_sum, sum_sentences, 24000)
+    s1, _ = librosa.load(save_path1, sr=16000)
+    s2, _ = librosa.load(save_path2, sr=16000)
+    # sum, _ = librosa.load(save_path_sum, sr=16000)
+    return s1, s2, save_path_sum
+def generate(args):
+    rank = int(os.getenv("RANK", 0))
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+    local_rank = int(os.getenv("LOCAL_RANK", 0))
+    device = local_rank
+    _init_logging(rank)
+    if args.offload_model is None:
+        args.offload_model = False if world_size > 1 else True
+        logging.info(
+            f"offload_model is not specified, set to {args.offload_model}.")
+    if world_size > 1:
+        torch.cuda.set_device(local_rank)
+        dist.init_process_group(
+            backend="nccl",
+            init_method="env://",
+            rank=rank,
+            world_size=world_size)
+    else:
+        assert not (
+            args.t5_fsdp or args.dit_fsdp
+        ), f"t5_fsdp and dit_fsdp are not supported in non-distributed environments."
+        assert not (
+            args.ulysses_size > 1 or args.ring_size > 1
+        ), f"context parallel are not supported in non-distributed environments."
+    if args.ulysses_size > 1 or args.ring_size > 1:
+        assert args.ulysses_size * args.ring_size == world_size, f"The number of ulysses_size and ring_size should be equal to the world size."
+        from xfuser.core.distributed import (
+            init_distributed_environment,
+            initialize_model_parallel,
+        )
+        init_distributed_environment(
+            rank=dist.get_rank(), world_size=dist.get_world_size())
+        initialize_model_parallel(
+            sequence_parallel_degree=dist.get_world_size(),
+            ring_degree=args.ring_size,
+            ulysses_degree=args.ulysses_size,
+        )
+    # TODO: use prompt refine
+    # if args.use_prompt_extend:
+    #     if args.prompt_extend_method == "dashscope":
+    #         prompt_expander = DashScopePromptExpander(
+    #             model_name=args.prompt_extend_model,
+    #             is_vl="i2v" in args.task or "flf2v" in args.task)
+    #     elif args.prompt_extend_method == "local_qwen":
+    #         prompt_expander = QwenPromptExpander(
+    #             model_name=args.prompt_extend_model,
+    #             is_vl="i2v" in args.task,
+    #             device=rank)
+    #     else:
+    #         raise NotImplementedError(
+    #             f"Unsupport prompt_extend_method: {args.prompt_extend_method}")
+    cfg = WAN_CONFIGS[args.task]
+    if args.ulysses_size > 1:
+        assert cfg.num_heads % args.ulysses_size == 0, f"`{cfg.num_heads=}` cannot be divided evenly by `{args.ulysses_size=}`."
+    logging.info(f"Generation job args: {args}")
+    logging.info(f"Generation model config: {cfg}")
+    if dist.is_initialized():
+        base_seed = [args.base_seed] if rank == 0 else [None]
+        dist.broadcast_object_list(base_seed, src=0)
+        args.base_seed = base_seed[0]
+    assert args.task == "infinitetalk-14B", 'You should choose infinitetalk in args.task.'
+    logging.info("Creating infinitetalk pipeline.")
+    wan_i2v = wan.InfiniteTalkPipeline(
+        config=cfg,
+        checkpoint_dir=args.ckpt_dir,
+        quant_dir=args.quant_dir,
+        device_id=device,
+        rank=rank,
+        t5_fsdp=args.t5_fsdp,
+        dit_fsdp=args.dit_fsdp,
+        use_usp=(args.ulysses_size > 1 or args.ring_size > 1),
+        t5_cpu=args.t5_cpu,
+        lora_dir=args.lora_dir,
+        lora_scales=args.lora_scale,
+        quant=args.quant,
+        dit_path=args.dit_path,
+        infinitetalk_dir=args.infinitetalk_dir
+    )
+    if args.num_persistent_param_in_dit is not None:
+        wan_i2v.vram_management = True
+        wan_i2v.enable_vram_management(
+            num_persistent_param_in_dit=args.num_persistent_param_in_dit
+        )
+    generated_list = []
+    with open(args.input_json, 'r', encoding='utf-8') as f:
+        input_data = json.load(f)
+    wav2vec_feature_extractor, audio_encoder= custom_init('cpu', args.wav2vec_dir)
+    args.audio_save_dir = os.path.join(args.audio_save_dir, input_data['cond_video'].split('/')[-1].split('.')[0])
+    os.makedirs(args.audio_save_dir,exist_ok=True)
+    conds_list = []
+    if args.scene_seg and is_video(input_data['cond_video']):
+        time_list, cond_list = shot_detect(input_data['cond_video'], args.audio_save_dir)
+        if len(time_list)==0:
+            conds_list.append([input_data['cond_video']])
+            conds_list.append([input_data['cond_audio']['person1']])
+            if len(input_data['cond_audio'])==2:
+                conds_list.append([input_data['cond_audio']['person2']])
+        else:
+            audio1_list = split_wav_librosa(input_data['cond_audio']['person1'], time_list, args.audio_save_dir)
+            conds_list.append(cond_list)
+            conds_list.append(audio1_list)
+            if len(input_data['cond_audio'])==2:
+                audio2_list = split_wav_librosa(input_data['cond_audio']['person2'], time_list, args.audio_save_dir)
+                conds_list.append(audio2_list)
+    else:
+        conds_list.append([input_data['cond_video']])
+        conds_list.append([input_data['cond_audio']['person1']])
+        if len(input_data['cond_audio'])==2:
+            conds_list.append([input_data['cond_audio']['person2']])
+    if len(input_data['cond_audio'])==2:
+        new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(input_data['cond_audio']['person1'], input_data['cond_audio']['person2'], input_data['audio_type'])
+        sum_audio = os.path.join(args.audio_save_dir, 'sum_all.wav')
+        sf.write(sum_audio, sum_human_speechs, 16000)
+        input_data['video_audio'] = sum_audio
+    else:
+        human_speech = audio_prepare_single(input_data['cond_audio']['person1'])
+        sum_audio = os.path.join(args.audio_save_dir, 'sum_all.wav')
+        sf.write(sum_audio, human_speech, 16000)
+        input_data['video_audio'] = sum_audio
+    logging.info("Generating video ...")
+    for idx, items in enumerate(zip(*conds_list)):
+        print(items)
+        input_clip = {}
+        input_clip['prompt'] = input_data['prompt']
+        input_clip['cond_video'] = items[0]
+        if 'audio_type' in input_data:
+            input_clip['audio_type'] = input_data['audio_type']
+        if 'bbox' in input_data:
+            input_clip['bbox'] = input_data['bbox']
+        cond_audio = {}
+        if args.audio_mode=='localfile':
+            if len(input_data['cond_audio'])==2:
+                new_human_speech1, new_human_speech2, sum_human_speechs = audio_prepare_multi(items[1], items[2], input_data['audio_type'])
+                audio_embedding_1 = get_embedding(new_human_speech1, wav2vec_feature_extractor, audio_encoder)
+                audio_embedding_2 = get_embedding(new_human_speech2, wav2vec_feature_extractor, audio_encoder)
+                emb1_path = os.path.join(args.audio_save_dir, '1.pt')
+                emb2_path = os.path.join(args.audio_save_dir, '2.pt')
+                sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+                sf.write(sum_audio, sum_human_speechs, 16000)
+                torch.save(audio_embedding_1, emb1_path)
+                torch.save(audio_embedding_2, emb2_path)
+                cond_audio['person1'] = emb1_path
+                cond_audio['person2'] = emb2_path
+                input_clip['video_audio'] = sum_audio
+                v_length = audio_embedding_1.shape[0]
+            elif len(input_data['cond_audio'])==1:
+                human_speech = audio_prepare_single(items[1])
+                audio_embedding = get_embedding(human_speech, wav2vec_feature_extractor, audio_encoder)
+                emb_path = os.path.join(args.audio_save_dir, '1.pt')
+                sum_audio = os.path.join(args.audio_save_dir, 'sum.wav')
+                sf.write(sum_audio, human_speech, 16000)
+                torch.save(audio_embedding, emb_path)
+                cond_audio['person1'] = emb_path
+                input_clip['video_audio'] = sum_audio
+                v_length = audio_embedding.shape[0]
+        input_clip['cond_audio'] = cond_audio
+        video = wan_i2v.generate_infinitetalk(
+            input_clip,
+            size_buckget=args.size,
+            motion_frame=args.motion_frame,
+            frame_num=args.frame_num,
+            shift=args.sample_shift,
+            sampling_steps=args.sample_steps,
+            text_guide_scale=args.sample_text_guide_scale,
+            audio_guide_scale=args.sample_audio_guide_scale,
+            seed=args.base_seed,
+            offload_model=args.offload_model,
+            max_frames_num=args.frame_num if args.mode == 'clip' else args.max_frame_num,
+            color_correction_strength = args.color_correction_strength,
+            extra_args=args,
+            )
+        generated_list.append(video)
+    if rank == 0:
+        if args.save_file is None:
+            formatted_time = datetime.now().strftime("%Y%m%d_%H%M%S")
+            formatted_prompt = input_clip['prompt'].replace(" ", "_").replace("/",
+                                                                        "_")[:50]
+            args.save_file = f"{args.task}_{args.size.replace('*','x') if sys.platform=='win32' else args.size}_{args.ulysses_size}_{args.ring_size}_{formatted_prompt}_{formatted_time}"
+        sum_video = torch.cat(generated_list, dim=1)
+        save_video_ffmpeg(sum_video, args.save_file, [input_data['video_audio']], high_quality_save=False)
+    logging.info(f"Saving generated video to {args.save_file}.mp4")
+    logging.info("Finished.")
+if __name__ == "__main__":
+    args = _parse_args()
+    generate(args)

kokoro/__init__.py ADDED Viewed

	@@ -0,0 +1,23 @@

+__version__ = '0.9.4'
+from loguru import logger
+import sys
+# Remove default handler
+logger.remove()
+# Add custom handler with clean format including module and line number
+logger.add(
+    sys.stderr,
+    format="<green>{time:HH:mm:ss}</green> | <cyan>{module:>16}:{line}</cyan> | <level>{level: >8}</level> | <level>{message}</level>",
+    colorize=True,
+    level="INFO" # "DEBUG" to enable logger.debug("message") and up prints
+                 # "ERROR" to enable only logger.error("message") prints
+                 # etc
+)
+# Disable before release or as needed
+logger.disable("kokoro")
+from .model import KModel
+from .pipeline import KPipeline

kokoro/__main__.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""Kokoro TTS CLI
+Example usage:
+python3 -m kokoro --text "The sky above the port was the color of television, tuned to a dead channel." -o file.wav --debug
+echo "Bom dia mundo, como vão vocês" > text.txt
+python3 -m kokoro -i text.txt -l p --voice pm_alex > audio.wav
+Common issues:
+pip not installed: `uv pip install pip`
+(Temporary workaround while https://github.com/explosion/spaCy/issues/13747 is not fixed)
+espeak not installed: `apt-get install espeak-ng`
+"""
+import argparse
+import wave
+from pathlib import Path
+from typing import Generator, TYPE_CHECKING
+import numpy as np
+from loguru import logger
+languages = [
+    "a",  # American English
+    "b",  # British English
+    "h",  # Hindi
+    "e",  # Spanish
+    "f",  # French
+    "i",  # Italian
+    "p",  # Brazilian Portuguese
+    "j",  # Japanese
+    "z",  # Mandarin Chinese
+]
+if TYPE_CHECKING:
+    from kokoro import KPipeline
+def generate_audio(
+    text: str, kokoro_language: str, voice: str, speed=1
+) -> Generator["KPipeline.Result", None, None]:
+    from kokoro import KPipeline
+    if not voice.startswith(kokoro_language):
+        logger.warning(f"Voice {voice} is not made for language {kokoro_language}")
+    pipeline = KPipeline(lang_code=kokoro_language)
+    yield from pipeline(text, voice=voice, speed=speed, split_pattern=r"\n+")
+def generate_and_save_audio(
+    output_file: Path, text: str, kokoro_language: str, voice: str, speed=1
+) -> None:
+    with wave.open(str(output_file.resolve()), "wb") as wav_file:
+        wav_file.setnchannels(1)  # Mono audio
+        wav_file.setsampwidth(2)  # 2 bytes per sample (16-bit audio)
+        wav_file.setframerate(24000)  # Sample rate
+        for result in generate_audio(
+            text, kokoro_language=kokoro_language, voice=voice, speed=speed
+        ):
+            logger.debug(result.phonemes)
+            if result.audio is None:
+                continue
+            audio_bytes = (result.audio.numpy() * 32767).astype(np.int16).tobytes()
+            wav_file.writeframes(audio_bytes)
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-m",
+        "--voice",
+        default="af_heart",
+        help="Voice to use",
+    )
+    parser.add_argument(
+        "-l",
+        "--language",
+        help="Language to use (defaults to the one corresponding to the voice)",
+        choices=languages,
+    )
+    parser.add_argument(
+        "-o",
+        "--output-file",
+        "--output_file",
+        type=Path,
+        help="Path to output WAV file",
+        required=True,
+    )
+    parser.add_argument(
+        "-i",
+        "--input-file",
+        "--input_file",
+        type=Path,
+        help="Path to input text file (default: stdin)",
+    )
+    parser.add_argument(
+        "-t",
+        "--text",
+        help="Text to use instead of reading from stdin",
+    )
+    parser.add_argument(
+        "-s",
+        "--speed",
+        type=float,
+        default=1.0,
+        help="Speech speed",
+    )
+    parser.add_argument(
+        "--debug",
+        action="store_true",
+        help="Print DEBUG messages to console",
+    )
+    args = parser.parse_args()
+    if args.debug:
+        logger.level("DEBUG")
+    logger.debug(args)
+    lang = args.language or args.voice[0]
+    if args.text is not None and args.input_file is not None:
+        raise Exception("You cannot specify both 'text' and 'input_file'")
+    elif args.text:
+        text = args.text
+    elif args.input_file:
+        file: Path = args.input_file
+        text = file.read_text()
+    else:
+        import sys
+        print("Press Ctrl+D to stop reading input and start generating", flush=True)
+        text = '\n'.join(sys.stdin)
+    logger.debug(f"Input text: {text!r}")
+    out_file: Path = args.output_file
+    if not out_file.suffix == ".wav":
+        logger.warning("The output file name should end with .wav")
+    generate_and_save_audio(
+        output_file=out_file,
+        text=text,
+        kokoro_language=lang,
+        voice=args.voice,
+        speed=args.speed,
+    )
+if __name__ == "__main__":
+    main()

kokoro/custom_stft.py ADDED Viewed

	@@ -0,0 +1,197 @@

+from attr import attr
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class CustomSTFT(nn.Module):
+    """
+    STFT/iSTFT without unfold/complex ops, using conv1d and conv_transpose1d.
+    - forward STFT => Real-part conv1d + Imag-part conv1d
+    - inverse STFT => Real-part conv_transpose1d + Imag-part conv_transpose1d + sum
+    - avoids F.unfold, so easier to export to ONNX
+    - uses replicate or constant padding for 'center=True' to approximate 'reflect'
+      (reflect is not supported for dynamic shapes in ONNX)
+    """
+    def __init__(
+        self,
+        filter_length=800,
+        hop_length=200,
+        win_length=800,
+        window="hann",
+        center=True,
+        pad_mode="replicate",  # or 'constant'
+    ):
+        super().__init__()
+        self.filter_length = filter_length
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.n_fft = filter_length
+        self.center = center
+        self.pad_mode = pad_mode
+        # Number of frequency bins for real-valued STFT with onesided=True
+        self.freq_bins = self.n_fft // 2 + 1
+        # Build window
+        assert window == 'hann', window
+        window_tensor = torch.hann_window(win_length, periodic=True, dtype=torch.float32)
+        if self.win_length < self.n_fft:
+            # Zero-pad up to n_fft
+            extra = self.n_fft - self.win_length
+            window_tensor = F.pad(window_tensor, (0, extra))
+        elif self.win_length > self.n_fft:
+            window_tensor = window_tensor[: self.n_fft]
+        self.register_buffer("window", window_tensor)
+        # Precompute forward DFT (real, imag)
+        # PyTorch stft uses e^{-j 2 pi k n / N} => real=cos(...), imag=-sin(...)
+        n = np.arange(self.n_fft)
+        k = np.arange(self.freq_bins)
+        angle = 2 * np.pi * np.outer(k, n) / self.n_fft  # shape (freq_bins, n_fft)
+        dft_real = np.cos(angle)
+        dft_imag = -np.sin(angle)  # note negative sign
+        # Combine window and dft => shape (freq_bins, filter_length)
+        # We'll make 2 conv weight tensors of shape (freq_bins, 1, filter_length).
+        forward_window = window_tensor.numpy()  # shape (n_fft,)
+        forward_real = dft_real * forward_window  # (freq_bins, n_fft)
+        forward_imag = dft_imag * forward_window
+        # Convert to PyTorch
+        forward_real_torch = torch.from_numpy(forward_real).float()
+        forward_imag_torch = torch.from_numpy(forward_imag).float()
+        # Register as Conv1d weight => (out_channels, in_channels, kernel_size)
+        # out_channels = freq_bins, in_channels=1, kernel_size=n_fft
+        self.register_buffer(
+            "weight_forward_real", forward_real_torch.unsqueeze(1)
+        )
+        self.register_buffer(
+            "weight_forward_imag", forward_imag_torch.unsqueeze(1)
+        )
+        # Precompute inverse DFT
+        # Real iFFT formula => scale = 1/n_fft, doubling for bins 1..freq_bins-2 if n_fft even, etc.
+        # For simplicity, we won't do the "DC/nyquist not doubled" approach here.
+        # If you want perfect real iSTFT, you can add that logic.
+        # This version just yields good approximate reconstruction with Hann + typical overlap.
+        inv_scale = 1.0 / self.n_fft
+        n = np.arange(self.n_fft)
+        angle_t = 2 * np.pi * np.outer(n, k) / self.n_fft  # shape (n_fft, freq_bins)
+        idft_cos = np.cos(angle_t).T  # => (freq_bins, n_fft)
+        idft_sin = np.sin(angle_t).T  # => (freq_bins, n_fft)
+        # Multiply by window again for typical overlap-add
+        # We also incorporate the scale factor 1/n_fft
+        inv_window = window_tensor.numpy() * inv_scale
+        backward_real = idft_cos * inv_window  # (freq_bins, n_fft)
+        backward_imag = idft_sin * inv_window
+        # We'll implement iSTFT as real+imag conv_transpose with stride=hop.
+        self.register_buffer(
+            "weight_backward_real", torch.from_numpy(backward_real).float().unsqueeze(1)
+        )
+        self.register_buffer(
+            "weight_backward_imag", torch.from_numpy(backward_imag).float().unsqueeze(1)
+        )
+    def transform(self, waveform: torch.Tensor):
+        """
+        Forward STFT => returns magnitude, phase
+        Output shape => (batch, freq_bins, frames)
+        """
+        # waveform shape => (B, T).  conv1d expects (B, 1, T).
+        # Optional center pad
+        if self.center:
+            pad_len = self.n_fft // 2
+            waveform = F.pad(waveform, (pad_len, pad_len), mode=self.pad_mode)
+        x = waveform.unsqueeze(1)  # => (B, 1, T)
+        # Convolution to get real part => shape (B, freq_bins, frames)
+        real_out = F.conv1d(
+            x,
+            self.weight_forward_real,
+            bias=None,
+            stride=self.hop_length,
+            padding=0,
+        )
+        # Imag part
+        imag_out = F.conv1d(
+            x,
+            self.weight_forward_imag,
+            bias=None,
+            stride=self.hop_length,
+            padding=0,
+        )
+        # magnitude, phase
+        magnitude = torch.sqrt(real_out**2 + imag_out**2 + 1e-14)
+        phase = torch.atan2(imag_out, real_out)
+        # Handle the case where imag_out is 0 and real_out is negative to correct ONNX atan2 to match PyTorch
+        # In this case, PyTorch returns pi, ONNX returns -pi
+        correction_mask = (imag_out == 0) & (real_out < 0)
+        phase[correction_mask] = torch.pi
+        return magnitude, phase
+    def inverse(self, magnitude: torch.Tensor, phase: torch.Tensor, length=None):
+        """
+        Inverse STFT => returns waveform shape (B, T).
+        """
+        # magnitude, phase => (B, freq_bins, frames)
+        # Re-create real/imag => shape (B, freq_bins, frames)
+        real_part = magnitude * torch.cos(phase)
+        imag_part = magnitude * torch.sin(phase)
+        # conv_transpose wants shape (B, freq_bins, frames). We'll treat "frames" as time dimension
+        # so we do (B, freq_bins, frames) => (B, freq_bins, frames)
+        # But PyTorch conv_transpose1d expects (B, in_channels, input_length)
+        real_part = real_part  # (B, freq_bins, frames)
+        imag_part = imag_part
+        # real iSTFT => convolve with "backward_real", "backward_imag", and sum
+        # We'll do 2 conv_transpose calls, each giving (B, 1, time),
+        # then add them => (B, 1, time).
+        real_rec = F.conv_transpose1d(
+            real_part,
+            self.weight_backward_real,  # shape (freq_bins, 1, filter_length)
+            bias=None,
+            stride=self.hop_length,
+            padding=0,
+        )
+        imag_rec = F.conv_transpose1d(
+            imag_part,
+            self.weight_backward_imag,
+            bias=None,
+            stride=self.hop_length,
+            padding=0,
+        )
+        # sum => (B, 1, time)
+        waveform = real_rec - imag_rec  # typical real iFFT has minus for imaginary part
+        # If we used "center=True" in forward, we should remove pad
+        if self.center:
+            pad_len = self.n_fft // 2
+            # Because of transposed convolution, total length might have extra samples
+            # We remove `pad_len` from start & end if possible
+            waveform = waveform[..., pad_len:-pad_len]
+        # If a specific length is desired, clamp
+        if length is not None:
+            waveform = waveform[..., :length]
+        # shape => (B, T)
+        return waveform
+    def forward(self, x: torch.Tensor):
+        """
+        Full STFT -> iSTFT pass: returns time-domain reconstruction.
+        Same interface as your original code.
+        """
+        mag, phase = self.transform(x)
+        return self.inverse(mag, phase, length=x.shape[-1])

kokoro/istftnet.py ADDED Viewed

	@@ -0,0 +1,421 @@

+# ADAPTED from https://github.com/yl4579/StyleTTS2/blob/main/Modules/istftnet.py
+from kokoro.custom_stft import CustomSTFT
+from torch.nn.utils import weight_norm
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# https://github.com/yl4579/StyleTTS2/blob/main/Modules/utils.py
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+def get_padding(kernel_size, dilation=1):
+    return int((kernel_size*dilation - dilation)/2)
+class AdaIN1d(nn.Module):
+    def __init__(self, style_dim, num_features):
+        super().__init__()
+        # affine should be False, however there's a bug in the old torch.onnx.export (not newer dynamo) that causes the channel dimension to be lost if affine=False. When affine is true, there's additional learnably parameters. This shouldn't really matter setting it to True, since we're in inference mode
+        self.norm = nn.InstanceNorm1d(num_features, affine=True)
+        self.fc = nn.Linear(style_dim, num_features*2)
+    def forward(self, x, s):
+        h = self.fc(s)
+        h = h.view(h.size(0), h.size(1), 1)
+        gamma, beta = torch.chunk(h, chunks=2, dim=1)
+        return (1 + gamma) * self.norm(x) + beta
+class AdaINResBlock1(nn.Module):
+    def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), style_dim=64):
+        super(AdaINResBlock1, self).__init__()
+        self.convs1 = nn.ModuleList([
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                                  padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                                  padding=get_padding(kernel_size, dilation[1]))),
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+                                  padding=get_padding(kernel_size, dilation[2])))
+        ])
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList([
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                                  padding=get_padding(kernel_size, 1))),
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                                  padding=get_padding(kernel_size, 1))),
+            weight_norm(nn.Conv1d(channels, channels, kernel_size, 1, dilation=1,
+                                  padding=get_padding(kernel_size, 1)))
+        ])
+        self.convs2.apply(init_weights)
+        self.adain1 = nn.ModuleList([
+            AdaIN1d(style_dim, channels),
+            AdaIN1d(style_dim, channels),
+            AdaIN1d(style_dim, channels),
+        ])
+        self.adain2 = nn.ModuleList([
+            AdaIN1d(style_dim, channels),
+            AdaIN1d(style_dim, channels),
+            AdaIN1d(style_dim, channels),
+        ])
+        self.alpha1 = nn.ParameterList([nn.Parameter(torch.ones(1, channels, 1)) for i in range(len(self.convs1))])
+        self.alpha2 = nn.ParameterList([nn.Parameter(torch.ones(1, channels, 1)) for i in range(len(self.convs2))])
+    def forward(self, x, s):
+        for c1, c2, n1, n2, a1, a2 in zip(self.convs1, self.convs2, self.adain1, self.adain2, self.alpha1, self.alpha2):
+            xt = n1(x, s)
+            xt = xt + (1 / a1) * (torch.sin(a1 * xt) ** 2)  # Snake1D
+            xt = c1(xt)
+            xt = n2(xt, s)
+            xt = xt + (1 / a2) * (torch.sin(a2 * xt) ** 2)  # Snake1D
+            xt = c2(xt)
+            x = xt + x
+        return x
+class TorchSTFT(nn.Module):
+    def __init__(self, filter_length=800, hop_length=200, win_length=800, window='hann'):
+        super().__init__()
+        self.filter_length = filter_length
+        self.hop_length = hop_length
+        self.win_length = win_length
+        assert window == 'hann', window
+        self.window = torch.hann_window(win_length, periodic=True, dtype=torch.float32)
+    def transform(self, input_data):
+        forward_transform = torch.stft(
+            input_data,
+            self.filter_length, self.hop_length, self.win_length, window=self.window.to(input_data.device),
+            return_complex=True)
+        return torch.abs(forward_transform), torch.angle(forward_transform)
+    def inverse(self, magnitude, phase):
+        inverse_transform = torch.istft(
+            magnitude * torch.exp(phase * 1j),
+            self.filter_length, self.hop_length, self.win_length, window=self.window.to(magnitude.device))
+        return inverse_transform.unsqueeze(-2)  # unsqueeze to stay consistent with conv_transpose1d implementation
+    def forward(self, input_data):
+        self.magnitude, self.phase = self.transform(input_data)
+        reconstruction = self.inverse(self.magnitude, self.phase)
+        return reconstruction
+class SineGen(nn.Module):
+    """ Definition of sine generator
+    SineGen(samp_rate, harmonic_num = 0,
+            sine_amp = 0.1, noise_std = 0.003,
+            voiced_threshold = 0,
+            flag_for_pulse=False)
+    samp_rate: sampling rate in Hz
+    harmonic_num: number of harmonic overtones (default 0)
+    sine_amp: amplitude of sine-wavefrom (default 0.1)
+    noise_std: std of Gaussian noise (default 0.003)
+    voiced_thoreshold: F0 threshold for U/V classification (default 0)
+    flag_for_pulse: this SinGen is used inside PulseGen (default False)
+    Note: when flag_for_pulse is True, the first time step of a voiced
+        segment is always sin(torch.pi) or cos(0)
+    """
+    def __init__(self, samp_rate, upsample_scale, harmonic_num=0,
+                 sine_amp=0.1, noise_std=0.003,
+                 voiced_threshold=0,
+                 flag_for_pulse=False):
+        super(SineGen, self).__init__()
+        self.sine_amp = sine_amp
+        self.noise_std = noise_std
+        self.harmonic_num = harmonic_num
+        self.dim = self.harmonic_num + 1
+        self.sampling_rate = samp_rate
+        self.voiced_threshold = voiced_threshold
+        self.flag_for_pulse = flag_for_pulse
+        self.upsample_scale = upsample_scale
+    def _f02uv(self, f0):
+        # generate uv signal
+        uv = (f0 > self.voiced_threshold).type(torch.float32)
+        return uv
+    def _f02sine(self, f0_values):
+        """ f0_values: (batchsize, length, dim)
+            where dim indicates fundamental tone and overtones
+        """
+        # convert to F0 in rad. The interger part n can be ignored
+        # because 2 * torch.pi * n doesn't affect phase
+        rad_values = (f0_values / self.sampling_rate) % 1
+        # initial phase noise (no noise for fundamental component)
+        rand_ini = torch.rand(f0_values.shape[0], f0_values.shape[2], device=f0_values.device)
+        rand_ini[:, 0] = 0
+        rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
+        # instantanouse phase sine[t] = sin(2*pi \sum_i=1 ^{t} rad)
+        if not self.flag_for_pulse:
+            rad_values = F.interpolate(rad_values.transpose(1, 2), scale_factor=1/self.upsample_scale, mode="linear").transpose(1, 2)
+            phase = torch.cumsum(rad_values, dim=1) * 2 * torch.pi
+            phase = F.interpolate(phase.transpose(1, 2) * self.upsample_scale, scale_factor=self.upsample_scale, mode="linear").transpose(1, 2)
+            sines = torch.sin(phase)
+        else:
+            # If necessary, make sure that the first time step of every
+            # voiced segments is sin(pi) or cos(0)
+            # This is used for pulse-train generation
+            # identify the last time step in unvoiced segments
+            uv = self._f02uv(f0_values)
+            uv_1 = torch.roll(uv, shifts=-1, dims=1)
+            uv_1[:, -1, :] = 1
+            u_loc = (uv < 1) * (uv_1 > 0)
+            # get the instantanouse phase
+            tmp_cumsum = torch.cumsum(rad_values, dim=1)
+            # different batch needs to be processed differently
+            for idx in range(f0_values.shape[0]):
+                temp_sum = tmp_cumsum[idx, u_loc[idx, :, 0], :]
+                temp_sum[1:, :] = temp_sum[1:, :] - temp_sum[0:-1, :]
+                # stores the accumulation of i.phase within
+                # each voiced segments
+                tmp_cumsum[idx, :, :] = 0
+                tmp_cumsum[idx, u_loc[idx, :, 0], :] = temp_sum
+            # rad_values - tmp_cumsum: remove the accumulation of i.phase
+            # within the previous voiced segment.
+            i_phase = torch.cumsum(rad_values - tmp_cumsum, dim=1)
+            # get the sines
+            sines = torch.cos(i_phase * 2 * torch.pi)
+        return sines
+    def forward(self, f0):
+        """ sine_tensor, uv = forward(f0)
+        input F0: tensor(batchsize=1, length, dim=1)
+                  f0 for unvoiced steps should be 0
+        output sine_tensor: tensor(batchsize=1, length, dim)
+        output uv: tensor(batchsize=1, length, 1)
+        """
+        f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim, device=f0.device)
+        # fundamental component
+        fn = torch.multiply(f0, torch.FloatTensor([[range(1, self.harmonic_num + 2)]]).to(f0.device))
+        # generate sine waveforms
+        sine_waves = self._f02sine(fn) * self.sine_amp
+        # generate uv signal
+        # uv = torch.ones(f0.shape)
+        # uv = uv * (f0 > self.voiced_threshold)
+        uv = self._f02uv(f0)
+        # noise: for unvoiced should be similar to sine_amp
+        #        std = self.sine_amp/3 -> max value ~ self.sine_amp
+        #        for voiced regions is self.noise_std
+        noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
+        noise = noise_amp * torch.randn_like(sine_waves)
+        # first: set the unvoiced part to 0 by uv
+        # then: additive noise
+        sine_waves = sine_waves * uv + noise
+        return sine_waves, uv, noise
+class SourceModuleHnNSF(nn.Module):
+    """ SourceModule for hn-nsf
+    SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
+                 add_noise_std=0.003, voiced_threshod=0)
+    sampling_rate: sampling_rate in Hz
+    harmonic_num: number of harmonic above F0 (default: 0)
+    sine_amp: amplitude of sine source signal (default: 0.1)
+    add_noise_std: std of additive Gaussian noise (default: 0.003)
+        note that amplitude of noise in unvoiced is decided
+        by sine_amp
+    voiced_threshold: threhold to set U/V given F0 (default: 0)
+    Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+    F0_sampled (batchsize, length, 1)
+    Sine_source (batchsize, length, 1)
+    noise_source (batchsize, length 1)
+    uv (batchsize, length, 1)
+    """
+    def __init__(self, sampling_rate, upsample_scale, harmonic_num=0, sine_amp=0.1,
+                 add_noise_std=0.003, voiced_threshod=0):
+        super(SourceModuleHnNSF, self).__init__()
+        self.sine_amp = sine_amp
+        self.noise_std = add_noise_std
+        # to produce sine waveforms
+        self.l_sin_gen = SineGen(sampling_rate, upsample_scale, harmonic_num,
+                                 sine_amp, add_noise_std, voiced_threshod)
+        # to merge source harmonics into a single excitation
+        self.l_linear = nn.Linear(harmonic_num + 1, 1)
+        self.l_tanh = nn.Tanh()
+    def forward(self, x):
+        """
+        Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
+        F0_sampled (batchsize, length, 1)
+        Sine_source (batchsize, length, 1)
+        noise_source (batchsize, length 1)
+        """
+        # source for harmonic branch
+        with torch.no_grad():
+            sine_wavs, uv, _ = self.l_sin_gen(x)
+        sine_merge = self.l_tanh(self.l_linear(sine_wavs))
+        # source for noise branch, in the same shape as uv
+        noise = torch.randn_like(uv) * self.sine_amp / 3
+        return sine_merge, noise, uv
+class Generator(nn.Module):
+    def __init__(self, style_dim, resblock_kernel_sizes, upsample_rates, upsample_initial_channel, resblock_dilation_sizes, upsample_kernel_sizes, gen_istft_n_fft, gen_istft_hop_size, disable_complex=False):
+        super(Generator, self).__init__()
+        self.num_kernels = len(resblock_kernel_sizes)
+        self.num_upsamples = len(upsample_rates)
+        self.m_source = SourceModuleHnNSF(
+                    sampling_rate=24000,
+                    upsample_scale=math.prod(upsample_rates) * gen_istft_hop_size,
+                    harmonic_num=8, voiced_threshod=10)
+        self.f0_upsamp = nn.Upsample(scale_factor=math.prod(upsample_rates) * gen_istft_hop_size)
+        self.noise_convs = nn.ModuleList()
+        self.noise_res = nn.ModuleList()
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
+            self.ups.append(weight_norm(
+                nn.ConvTranspose1d(upsample_initial_channel//(2**i), upsample_initial_channel//(2**(i+1)),
+                                   k, u, padding=(k-u)//2)))
+        self.resblocks = nn.ModuleList()
+        for i in range(len(self.ups)):
+            ch = upsample_initial_channel//(2**(i+1))
+            for j, (k, d) in enumerate(zip(resblock_kernel_sizes,resblock_dilation_sizes)):
+                self.resblocks.append(AdaINResBlock1(ch, k, d, style_dim))
+            c_cur = upsample_initial_channel // (2 ** (i + 1))
+            if i + 1 < len(upsample_rates):
+                stride_f0 = math.prod(upsample_rates[i + 1:])
+                self.noise_convs.append(nn.Conv1d(
+                    gen_istft_n_fft + 2, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=(stride_f0+1) // 2))
+                self.noise_res.append(AdaINResBlock1(c_cur, 7, [1,3,5], style_dim))
+            else:
+                self.noise_convs.append(nn.Conv1d(gen_istft_n_fft + 2, c_cur, kernel_size=1))
+                self.noise_res.append(AdaINResBlock1(c_cur, 11, [1,3,5], style_dim))
+        self.post_n_fft = gen_istft_n_fft
+        self.conv_post = weight_norm(nn.Conv1d(ch, self.post_n_fft + 2, 7, 1, padding=3))
+        self.ups.apply(init_weights)
+        self.conv_post.apply(init_weights)
+        self.reflection_pad = nn.ReflectionPad1d((1, 0))
+        self.stft = (
+            CustomSTFT(filter_length=gen_istft_n_fft, hop_length=gen_istft_hop_size, win_length=gen_istft_n_fft)
+            if disable_complex
+            else TorchSTFT(filter_length=gen_istft_n_fft, hop_length=gen_istft_hop_size, win_length=gen_istft_n_fft)
+        )
+    def forward(self, x, s, f0):
+        with torch.no_grad():
+            f0 = self.f0_upsamp(f0[:, None]).transpose(1, 2)  # bs,n,t
+            har_source, noi_source, uv = self.m_source(f0)
+            har_source = har_source.transpose(1, 2).squeeze(1)
+            har_spec, har_phase = self.stft.transform(har_source)
+            har = torch.cat([har_spec, har_phase], dim=1)
+        for i in range(self.num_upsamples):
+            x = F.leaky_relu(x, negative_slope=0.1)
+            x_source = self.noise_convs[i](har)
+            x_source = self.noise_res[i](x_source, s)
+            x = self.ups[i](x)
+            if i == self.num_upsamples - 1:
+                x = self.reflection_pad(x)
+            x = x + x_source
+            xs = None
+            for j in range(self.num_kernels):
+                if xs is None:
+                    xs = self.resblocks[i*self.num_kernels+j](x, s)
+                else:
+                    xs += self.resblocks[i*self.num_kernels+j](x, s)
+            x = xs / self.num_kernels
+        x = F.leaky_relu(x)
+        x = self.conv_post(x)
+        spec = torch.exp(x[:,:self.post_n_fft // 2 + 1, :])
+        phase = torch.sin(x[:, self.post_n_fft // 2 + 1:, :])
+        return self.stft.inverse(spec, phase)
+class UpSample1d(nn.Module):
+    def __init__(self, layer_type):
+        super().__init__()
+        self.layer_type = layer_type
+    def forward(self, x):
+        if self.layer_type == 'none':
+            return x
+        else:
+            return F.interpolate(x, scale_factor=2, mode='nearest')
+class AdainResBlk1d(nn.Module):
+    def __init__(self, dim_in, dim_out, style_dim=64, actv=nn.LeakyReLU(0.2), upsample='none', dropout_p=0.0):
+        super().__init__()
+        self.actv = actv
+        self.upsample_type = upsample
+        self.upsample = UpSample1d(upsample)
+        self.learned_sc = dim_in != dim_out
+        self._build_weights(dim_in, dim_out, style_dim)
+        self.dropout = nn.Dropout(dropout_p)
+        if upsample == 'none':
+            self.pool = nn.Identity()
+        else:
+            self.pool = weight_norm(nn.ConvTranspose1d(dim_in, dim_in, kernel_size=3, stride=2, groups=dim_in, padding=1, output_padding=1))
+    def _build_weights(self, dim_in, dim_out, style_dim):
+        self.conv1 = weight_norm(nn.Conv1d(dim_in, dim_out, 3, 1, 1))
+        self.conv2 = weight_norm(nn.Conv1d(dim_out, dim_out, 3, 1, 1))
+        self.norm1 = AdaIN1d(style_dim, dim_in)
+        self.norm2 = AdaIN1d(style_dim, dim_out)
+        if self.learned_sc:
+            self.conv1x1 = weight_norm(nn.Conv1d(dim_in, dim_out, 1, 1, 0, bias=False))
+    def _shortcut(self, x):
+        x = self.upsample(x)
+        if self.learned_sc:
+            x = self.conv1x1(x)
+        return x
+    def _residual(self, x, s):
+        x = self.norm1(x, s)
+        x = self.actv(x)
+        x = self.pool(x)
+        x = self.conv1(self.dropout(x))
+        x = self.norm2(x, s)
+        x = self.actv(x)
+        x = self.conv2(self.dropout(x))
+        return x
+    def forward(self, x, s):
+        out = self._residual(x, s)
+        out = (out + self._shortcut(x)) * torch.rsqrt(torch.tensor(2))
+        return out
+class Decoder(nn.Module):
+    def __init__(self, dim_in, style_dim, dim_out,
+                 resblock_kernel_sizes,
+                 upsample_rates,
+                 upsample_initial_channel,
+                 resblock_dilation_sizes,
+                 upsample_kernel_sizes,
+                 gen_istft_n_fft, gen_istft_hop_size,
+                 disable_complex=False):
+        super().__init__()
+        self.encode = AdainResBlk1d(dim_in + 2, 1024, style_dim)
+        self.decode = nn.ModuleList()
+        self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
+        self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
+        self.decode.append(AdainResBlk1d(1024 + 2 + 64, 1024, style_dim))
+        self.decode.append(AdainResBlk1d(1024 + 2 + 64, 512, style_dim, upsample=True))
+        self.F0_conv = weight_norm(nn.Conv1d(1, 1, kernel_size=3, stride=2, groups=1, padding=1))
+        self.N_conv = weight_norm(nn.Conv1d(1, 1, kernel_size=3, stride=2, groups=1, padding=1))
+        self.asr_res = nn.Sequential(weight_norm(nn.Conv1d(512, 64, kernel_size=1)))
+        self.generator = Generator(style_dim, resblock_kernel_sizes, upsample_rates,
+                                   upsample_initial_channel, resblock_dilation_sizes,
+                                   upsample_kernel_sizes, gen_istft_n_fft, gen_istft_hop_size, disable_complex=disable_complex)
+    def forward(self, asr, F0_curve, N, s):
+        F0 = self.F0_conv(F0_curve.unsqueeze(1))
+        N = self.N_conv(N.unsqueeze(1))
+        x = torch.cat([asr, F0, N], axis=1)
+        x = self.encode(x, s)
+        asr_res = self.asr_res(asr)
+        res = True
+        for block in self.decode:
+            if res:
+                x = torch.cat([x, asr_res, F0, N], axis=1)
+            x = block(x, s)
+            if block.upsample_type != "none":
+                res = False
+        x = self.generator(x, s, F0_curve)
+        return x

kokoro/model.py ADDED Viewed

	@@ -0,0 +1,155 @@

+from .istftnet import Decoder
+from .modules import CustomAlbert, ProsodyPredictor, TextEncoder
+from dataclasses import dataclass
+from huggingface_hub import hf_hub_download
+from loguru import logger
+from transformers import AlbertConfig
+from typing import Dict, Optional, Union
+import json
+import torch
+import os
+class KModel(torch.nn.Module):
+    '''
+    KModel is a torch.nn.Module with 2 main responsibilities:
+    1. Init weights, downloading config.json + model.pth from HF if needed
+    2. forward(phonemes: str, ref_s: FloatTensor) -> (audio: FloatTensor)
+    You likely only need one KModel instance, and it can be reused across
+    multiple KPipelines to avoid redundant memory allocation.
+    Unlike KPipeline, KModel is language-blind.
+    KModel stores self.vocab and thus knows how to map phonemes -> input_ids,
+    so there is no need to repeatedly download config.json outside of KModel.
+    '''
+    MODEL_NAMES = {
+        'hexgrad/Kokoro-82M': 'kokoro-v1_0.pth',
+        'hexgrad/Kokoro-82M-v1.1-zh': 'kokoro-v1_1-zh.pth',
+    }
+    def __init__(
+        self,
+        repo_id: Optional[str] = None,
+        config: Union[Dict, str, None] = None,
+        model: Optional[str] = None,
+        disable_complex: bool = False
+    ):
+        super().__init__()
+        if repo_id is None:
+            repo_id = 'hexgrad/Kokoro-82M'
+            print(f"WARNING: Defaulting repo_id to {repo_id}. Pass repo_id='{repo_id}' to suppress this warning.")
+        self.repo_id = repo_id
+        if not isinstance(config, dict):
+            if not config:
+                logger.debug("No config provided, downloading from HF")
+                config = hf_hub_download(repo_id=repo_id, filename='config.json')
+            with open(config, 'r', encoding='utf-8') as r:
+                config = json.load(r)
+                logger.debug(f"Loaded config: {config}")
+        self.vocab = config['vocab']
+        self.bert = CustomAlbert(AlbertConfig(vocab_size=config['n_token'], **config['plbert']))
+        self.bert_encoder = torch.nn.Linear(self.bert.config.hidden_size, config['hidden_dim'])
+        self.context_length = self.bert.config.max_position_embeddings
+        self.predictor = ProsodyPredictor(
+            style_dim=config['style_dim'], d_hid=config['hidden_dim'],
+            nlayers=config['n_layer'], max_dur=config['max_dur'], dropout=config['dropout']
+        )
+        self.text_encoder = TextEncoder(
+            channels=config['hidden_dim'], kernel_size=config['text_encoder_kernel_size'],
+            depth=config['n_layer'], n_symbols=config['n_token']
+        )
+        self.decoder = Decoder(
+            dim_in=config['hidden_dim'], style_dim=config['style_dim'],
+            dim_out=config['n_mels'], disable_complex=disable_complex, **config['istftnet']
+        )
+        if not model:
+            try:
+                model = hf_hub_download(repo_id=repo_id, filename=KModel.MODEL_NAMES[repo_id])
+            except:
+                model = os.path.join(repo_id, 'kokoro-v1_0.pth')
+        for key, state_dict in torch.load(model, map_location='cpu', weights_only=True).items():
+            assert hasattr(self, key), key
+            try:
+                getattr(self, key).load_state_dict(state_dict)
+            except:
+                logger.debug(f"Did not load {key} from state_dict")
+                state_dict = {k[7:]: v for k, v in state_dict.items()}
+                getattr(self, key).load_state_dict(state_dict, strict=False)
+    @property
+    def device(self):
+        return self.bert.device
+    @dataclass
+    class Output:
+        audio: torch.FloatTensor
+        pred_dur: Optional[torch.LongTensor] = None
+    @torch.no_grad()
+    def forward_with_tokens(
+        self,
+        input_ids: torch.LongTensor,
+        ref_s: torch.FloatTensor,
+        speed: float = 1
+    ) -> tuple[torch.FloatTensor, torch.LongTensor]:
+        input_lengths = torch.full(
+            (input_ids.shape[0],),
+            input_ids.shape[-1],
+            device=input_ids.device,
+            dtype=torch.long
+        )
+        text_mask = torch.arange(input_lengths.max()).unsqueeze(0).expand(input_lengths.shape[0], -1).type_as(input_lengths)
+        text_mask = torch.gt(text_mask+1, input_lengths.unsqueeze(1)).to(self.device)
+        bert_dur = self.bert(input_ids, attention_mask=(~text_mask).int())
+        d_en = self.bert_encoder(bert_dur).transpose(-1, -2)
+        s = ref_s[:, 128:]
+        d = self.predictor.text_encoder(d_en, s, input_lengths, text_mask)
+        x, _ = self.predictor.lstm(d)
+        duration = self.predictor.duration_proj(x)
+        duration = torch.sigmoid(duration).sum(axis=-1) / speed
+        pred_dur = torch.round(duration).clamp(min=1).long().squeeze()
+        indices = torch.repeat_interleave(torch.arange(input_ids.shape[1], device=self.device), pred_dur)
+        pred_aln_trg = torch.zeros((input_ids.shape[1], indices.shape[0]), device=self.device)
+        pred_aln_trg[indices, torch.arange(indices.shape[0])] = 1
+        pred_aln_trg = pred_aln_trg.unsqueeze(0).to(self.device)
+        en = d.transpose(-1, -2) @ pred_aln_trg
+        F0_pred, N_pred = self.predictor.F0Ntrain(en, s)
+        t_en = self.text_encoder(input_ids, input_lengths, text_mask)
+        asr = t_en @ pred_aln_trg
+        audio = self.decoder(asr, F0_pred, N_pred, ref_s[:, :128]).squeeze()
+        return audio, pred_dur
+    def forward(
+        self,
+        phonemes: str,
+        ref_s: torch.FloatTensor,
+        speed: float = 1,
+        return_output: bool = False
+    ) -> Union['KModel.Output', torch.FloatTensor]:
+        input_ids = list(filter(lambda i: i is not None, map(lambda p: self.vocab.get(p), phonemes)))
+        logger.debug(f"phonemes: {phonemes} -> input_ids: {input_ids}")
+        assert len(input_ids)+2 <= self.context_length, (len(input_ids)+2, self.context_length)
+        input_ids = torch.LongTensor([[0, *input_ids, 0]]).to(self.device)
+        ref_s = ref_s.to(self.device)
+        audio, pred_dur = self.forward_with_tokens(input_ids, ref_s, speed)
+        audio = audio.squeeze().cpu()
+        pred_dur = pred_dur.cpu() if pred_dur is not None else None
+        logger.debug(f"pred_dur: {pred_dur}")
+        return self.Output(audio=audio, pred_dur=pred_dur) if return_output else audio
+class KModelForONNX(torch.nn.Module):
+    def __init__(self, kmodel: KModel):
+        super().__init__()
+        self.kmodel = kmodel
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        ref_s: torch.FloatTensor,
+        speed: float = 1
+    ) -> tuple[torch.FloatTensor, torch.LongTensor]:
+        waveform, duration = self.kmodel.forward_with_tokens(input_ids, ref_s, speed)
+        return waveform, duration

kokoro/modules.py ADDED Viewed

	@@ -0,0 +1,183 @@

+# https://github.com/yl4579/StyleTTS2/blob/main/models.py
+from .istftnet import AdainResBlk1d
+from torch.nn.utils import weight_norm
+from transformers import AlbertModel
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class LinearNorm(nn.Module):
+    def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
+        super(LinearNorm, self).__init__()
+        self.linear_layer = nn.Linear(in_dim, out_dim, bias=bias)
+        nn.init.xavier_uniform_(self.linear_layer.weight, gain=nn.init.calculate_gain(w_init_gain))
+    def forward(self, x):
+        return self.linear_layer(x)
+class LayerNorm(nn.Module):
+    def __init__(self, channels, eps=1e-5):
+        super().__init__()
+        self.channels = channels
+        self.eps = eps
+        self.gamma = nn.Parameter(torch.ones(channels))
+        self.beta = nn.Parameter(torch.zeros(channels))
+    def forward(self, x):
+        x = x.transpose(1, -1)
+        x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
+        return x.transpose(1, -1)
+class TextEncoder(nn.Module):
+    def __init__(self, channels, kernel_size, depth, n_symbols, actv=nn.LeakyReLU(0.2)):
+        super().__init__()
+        self.embedding = nn.Embedding(n_symbols, channels)
+        padding = (kernel_size - 1) // 2
+        self.cnn = nn.ModuleList()
+        for _ in range(depth):
+            self.cnn.append(nn.Sequential(
+                weight_norm(nn.Conv1d(channels, channels, kernel_size=kernel_size, padding=padding)),
+                LayerNorm(channels),
+                actv,
+                nn.Dropout(0.2),
+            ))
+        self.lstm = nn.LSTM(channels, channels//2, 1, batch_first=True, bidirectional=True)
+    def forward(self, x, input_lengths, m):
+        x = self.embedding(x)  # [B, T, emb]
+        x = x.transpose(1, 2)  # [B, emb, T]
+        m = m.unsqueeze(1)
+        x.masked_fill_(m, 0.0)
+        for c in self.cnn:
+            x = c(x)
+            x.masked_fill_(m, 0.0)
+        x = x.transpose(1, 2)  # [B, T, chn]
+        lengths = input_lengths if input_lengths.device == torch.device('cpu') else input_lengths.to('cpu')
+        x = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
+        self.lstm.flatten_parameters()
+        x, _ = self.lstm(x)
+        x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
+        x = x.transpose(-1, -2)
+        x_pad = torch.zeros([x.shape[0], x.shape[1], m.shape[-1]], device=x.device)
+        x_pad[:, :, :x.shape[-1]] = x
+        x = x_pad
+        x.masked_fill_(m, 0.0)
+        return x
+class AdaLayerNorm(nn.Module):
+    def __init__(self, style_dim, channels, eps=1e-5):
+        super().__init__()
+        self.channels = channels
+        self.eps = eps
+        self.fc = nn.Linear(style_dim, channels*2)
+    def forward(self, x, s):
+        x = x.transpose(-1, -2)
+        x = x.transpose(1, -1)
+        h = self.fc(s)
+        h = h.view(h.size(0), h.size(1), 1)
+        gamma, beta = torch.chunk(h, chunks=2, dim=1)
+        gamma, beta = gamma.transpose(1, -1), beta.transpose(1, -1)
+        x = F.layer_norm(x, (self.channels,), eps=self.eps)
+        x = (1 + gamma) * x + beta
+        return x.transpose(1, -1).transpose(-1, -2)
+class ProsodyPredictor(nn.Module):
+    def __init__(self, style_dim, d_hid, nlayers, max_dur=50, dropout=0.1):
+        super().__init__()
+        self.text_encoder = DurationEncoder(sty_dim=style_dim, d_model=d_hid,nlayers=nlayers, dropout=dropout)
+        self.lstm = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
+        self.duration_proj = LinearNorm(d_hid, max_dur)
+        self.shared = nn.LSTM(d_hid + style_dim, d_hid // 2, 1, batch_first=True, bidirectional=True)
+        self.F0 = nn.ModuleList()
+        self.F0.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
+        self.F0.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
+        self.F0.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
+        self.N = nn.ModuleList()
+        self.N.append(AdainResBlk1d(d_hid, d_hid, style_dim, dropout_p=dropout))
+        self.N.append(AdainResBlk1d(d_hid, d_hid // 2, style_dim, upsample=True, dropout_p=dropout))
+        self.N.append(AdainResBlk1d(d_hid // 2, d_hid // 2, style_dim, dropout_p=dropout))
+        self.F0_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
+        self.N_proj = nn.Conv1d(d_hid // 2, 1, 1, 1, 0)
+    def forward(self, texts, style, text_lengths, alignment, m):
+        d = self.text_encoder(texts, style, text_lengths, m)
+        m = m.unsqueeze(1)
+        lengths = text_lengths if text_lengths.device == torch.device('cpu') else text_lengths.to('cpu')
+        x = nn.utils.rnn.pack_padded_sequence(d, lengths, batch_first=True, enforce_sorted=False)
+        self.lstm.flatten_parameters()
+        x, _ = self.lstm(x)
+        x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
+        x_pad = torch.zeros([x.shape[0], m.shape[-1], x.shape[-1]], device=x.device)
+        x_pad[:, :x.shape[1], :] = x
+        x = x_pad
+        duration = self.duration_proj(nn.functional.dropout(x, 0.5, training=False))
+        en = (d.transpose(-1, -2) @ alignment)
+        return duration.squeeze(-1), en
+    def F0Ntrain(self, x, s):
+        x, _ = self.shared(x.transpose(-1, -2))
+        F0 = x.transpose(-1, -2)
+        for block in self.F0:
+            F0 = block(F0, s)
+        F0 = self.F0_proj(F0)
+        N = x.transpose(-1, -2)
+        for block in self.N:
+            N = block(N, s)
+        N = self.N_proj(N)
+        return F0.squeeze(1), N.squeeze(1)
+class DurationEncoder(nn.Module):
+    def __init__(self, sty_dim, d_model, nlayers, dropout=0.1):
+        super().__init__()
+        self.lstms = nn.ModuleList()
+        for _ in range(nlayers):
+            self.lstms.append(nn.LSTM(d_model + sty_dim, d_model // 2, num_layers=1, batch_first=True, bidirectional=True, dropout=dropout))
+            self.lstms.append(AdaLayerNorm(sty_dim, d_model))
+        self.dropout = dropout
+        self.d_model = d_model
+        self.sty_dim = sty_dim
+    def forward(self, x, style, text_lengths, m):
+        masks = m
+        x = x.permute(2, 0, 1)
+        s = style.expand(x.shape[0], x.shape[1], -1)
+        x = torch.cat([x, s], axis=-1)
+        x.masked_fill_(masks.unsqueeze(-1).transpose(0, 1), 0.0)
+        x = x.transpose(0, 1)
+        x = x.transpose(-1, -2)
+        for block in self.lstms:
+            if isinstance(block, AdaLayerNorm):
+                x = block(x.transpose(-1, -2), style).transpose(-1, -2)
+                x = torch.cat([x, s.permute(1, 2, 0)], axis=1)
+                x.masked_fill_(masks.unsqueeze(-1).transpose(-1, -2), 0.0)
+            else:
+                lengths = text_lengths if text_lengths.device == torch.device('cpu') else text_lengths.to('cpu')
+                x = x.transpose(-1, -2)
+                x = nn.utils.rnn.pack_padded_sequence(
+                    x, lengths, batch_first=True, enforce_sorted=False)
+                block.flatten_parameters()
+                x, _ = block(x)
+                x, _ = nn.utils.rnn.pad_packed_sequence(
+                    x, batch_first=True)
+                x = F.dropout(x, p=self.dropout, training=False)
+                x = x.transpose(-1, -2)
+                x_pad = torch.zeros([x.shape[0], x.shape[1], m.shape[-1]], device=x.device)
+                x_pad[:, :, :x.shape[-1]] = x
+                x = x_pad
+        return x.transpose(-1, -2)
+# https://github.com/yl4579/StyleTTS2/blob/main/Utils/PLBERT/util.py
+class CustomAlbert(AlbertModel):
+    def forward(self, *args, **kwargs):
+        outputs = super().forward(*args, **kwargs)
+        return outputs.last_hidden_state

kokoro/pipeline.py ADDED Viewed

	@@ -0,0 +1,445 @@

+from .model import KModel
+from dataclasses import dataclass
+from huggingface_hub import hf_hub_download
+from loguru import logger
+from misaki import en, espeak
+from typing import Callable, Generator, List, Optional, Tuple, Union
+import re
+import torch
+import os
+ALIASES = {
+    'en-us': 'a',
+    'en-gb': 'b',
+    'es': 'e',
+    'fr-fr': 'f',
+    'hi': 'h',
+    'it': 'i',
+    'pt-br': 'p',
+    'ja': 'j',
+    'zh': 'z',
+}
+LANG_CODES = dict(
+    # pip install misaki[en]
+    a='American English',
+    b='British English',
+    # espeak-ng
+    e='es',
+    f='fr-fr',
+    h='hi',
+    i='it',
+    p='pt-br',
+    # pip install misaki[ja]
+    j='Japanese',
+    # pip install misaki[zh]
+    z='Mandarin Chinese',
+)
+class KPipeline:
+    '''
+    KPipeline is a language-aware support class with 2 main responsibilities:
+    1. Perform language-specific G2P, mapping (and chunking) text -> phonemes
+    2. Manage and store voices, lazily downloaded from HF if needed
+    You are expected to have one KPipeline per language. If you have multiple
+    KPipelines, you should reuse one KModel instance across all of them.
+    KPipeline is designed to work with a KModel, but this is not required.
+    There are 2 ways to pass an existing model into a pipeline:
+    1. On init: us_pipeline = KPipeline(lang_code='a', model=model)
+    2. On call: us_pipeline(text, voice, model=model)
+    By default, KPipeline will automatically initialize its own KModel. To
+    suppress this, construct a "quiet" KPipeline with model=False.
+    A "quiet" KPipeline yields (graphemes, phonemes, None) without generating
+    any audio. You can use this to phonemize and chunk your text in advance.
+    A "loud" KPipeline _with_ a model yields (graphemes, phonemes, audio).
+    '''
+    def __init__(
+        self,
+        lang_code: str,
+        repo_id: Optional[str] = None,
+        model: Union[KModel, bool] = True,
+        trf: bool = False,
+        en_callable: Optional[Callable[[str], str]] = None,
+        device: Optional[str] = None
+    ):
+        """Initialize a KPipeline.
+        Args:
+            lang_code: Language code for G2P processing
+            model: KModel instance, True to create new model, False for no model
+            trf: Whether to use transformer-based G2P
+            device: Override default device selection ('cuda' or 'cpu', or None for auto)
+                   If None, will auto-select cuda if available
+                   If 'cuda' and not available, will explicitly raise an error
+        """
+        if repo_id is None:
+            repo_id = 'hexgrad/Kokoro-82M'
+            print(f"WARNING: Defaulting repo_id to {repo_id}. Pass repo_id='{repo_id}' to suppress this warning.")
+            config=None
+        else:
+            config = os.path.join(repo_id, 'config.json')
+        self.repo_id = repo_id
+        lang_code = lang_code.lower()
+        lang_code = ALIASES.get(lang_code, lang_code)
+        assert lang_code in LANG_CODES, (lang_code, LANG_CODES)
+        self.lang_code = lang_code
+        self.model = None
+        if isinstance(model, KModel):
+            self.model = model
+        elif model:
+            if device == 'cuda' and not torch.cuda.is_available():
+                raise RuntimeError("CUDA requested but not available")
+            if device == 'mps' and not torch.backends.mps.is_available():
+                raise RuntimeError("MPS requested but not available")
+            if device == 'mps' and os.environ.get('PYTORCH_ENABLE_MPS_FALLBACK') != '1':
+                raise RuntimeError("MPS requested but fallback not enabled")
+            if device is None:
+                if torch.cuda.is_available():
+                    device = 'cuda'
+                elif os.environ.get('PYTORCH_ENABLE_MPS_FALLBACK') == '1' and torch.backends.mps.is_available():
+                    device = 'mps'
+                else:
+                    device = 'cpu'
+            try:
+                self.model = KModel(repo_id=repo_id, config=config).to(device).eval()
+            except RuntimeError as e:
+                if device == 'cuda':
+                    raise RuntimeError(f"""Failed to initialize model on CUDA: {e}.
+                                       Try setting device='cpu' or check CUDA installation.""")
+                raise
+        self.voices = {}
+        if lang_code in 'ab':
+            try:
+                fallback = espeak.EspeakFallback(british=lang_code=='b')
+            except Exception as e:
+                logger.warning("EspeakFallback not Enabled: OOD words will be skipped")
+                logger.warning({str(e)})
+                fallback = None
+            self.g2p = en.G2P(trf=trf, british=lang_code=='b', fallback=fallback, unk='')
+        elif lang_code == 'j':
+            try:
+                from misaki import ja
+                self.g2p = ja.JAG2P()
+            except ImportError:
+                logger.error("You need to `pip install misaki[ja]` to use lang_code='j'")
+                raise
+        elif lang_code == 'z':
+            try:
+                from misaki import zh
+                self.g2p = zh.ZHG2P(
+                    version=None if repo_id.endswith('/Kokoro-82M') else '1.1',
+                    en_callable=en_callable
+                )
+            except ImportError:
+                logger.error("You need to `pip install misaki[zh]` to use lang_code='z'")
+                raise
+        else:
+            language = LANG_CODES[lang_code]
+            logger.warning(f"Using EspeakG2P(language='{language}'). Chunking logic not yet implemented, so long texts may be truncated unless you split them with '\\n'.")
+            self.g2p = espeak.EspeakG2P(language=language)
+    def load_single_voice(self, voice: str):
+        if voice in self.voices:
+            return self.voices[voice]
+        if voice.endswith('.pt'):
+            f = voice
+        else:
+            f = hf_hub_download(repo_id=self.repo_id, filename=f'voices/{voice}.pt')
+            if not voice.startswith(self.lang_code):
+                v = LANG_CODES.get(voice, voice)
+                p = LANG_CODES.get(self.lang_code, self.lang_code)
+                logger.warning(f'Language mismatch, loading {v} voice into {p} pipeline.')
+        pack = torch.load(f, weights_only=True)
+        self.voices[voice] = pack
+        return pack
+    """
+    load_voice is a helper function that lazily downloads and loads a voice:
+    Single voice can be requested (e.g. 'af_bella') or multiple voices (e.g. 'af_bella,af_jessica').
+    If multiple voices are requested, they are averaged.
+    Delimiter is optional and defaults to ','.
+    """
+    def load_voice(self, voice: Union[str, torch.FloatTensor], delimiter: str = ",") -> torch.FloatTensor:
+        if isinstance(voice, torch.FloatTensor):
+            return voice
+        if voice in self.voices:
+            return self.voices[voice]
+        logger.debug(f"Loading voice: {voice}")
+        packs = [self.load_single_voice(v) for v in voice.split(delimiter)]
+        if len(packs) == 1:
+            return packs[0]
+        self.voices[voice] = torch.mean(torch.stack(packs), dim=0)
+        return self.voices[voice]
+    @staticmethod
+    def tokens_to_ps(tokens: List[en.MToken]) -> str:
+        return ''.join(t.phonemes + (' ' if t.whitespace else '') for t in tokens).strip()
+    @staticmethod
+    def waterfall_last(
+        tokens: List[en.MToken],
+        next_count: int,
+        waterfall: List[str] = ['!.?…', ':;', ',—'],
+        bumps: List[str] = [')', '”']
+    ) -> int:
+        for w in waterfall:
+            z = next((i for i, t in reversed(list(enumerate(tokens))) if t.phonemes in set(w)), None)
+            if z is None:
+                continue
+            z += 1
+            if z < len(tokens) and tokens[z].phonemes in bumps:
+                z += 1
+            if next_count - len(KPipeline.tokens_to_ps(tokens[:z])) <= 510:
+                return z
+        return len(tokens)
+    @staticmethod
+    def tokens_to_text(tokens: List[en.MToken]) -> str:
+        return ''.join(t.text + t.whitespace for t in tokens).strip()
+    def en_tokenize(
+        self,
+        tokens: List[en.MToken]
+    ) -> Generator[Tuple[str, str, List[en.MToken]], None, None]:
+        tks = []
+        pcount = 0
+        for t in tokens:
+            # American English: ɾ => T
+            t.phonemes = '' if t.phonemes is None else t.phonemes#.replace('ɾ', 'T')
+            next_ps = t.phonemes + (' ' if t.whitespace else '')
+            next_pcount = pcount + len(next_ps.rstrip())
+            if next_pcount > 510:
+                z = KPipeline.waterfall_last(tks, next_pcount)
+                text = KPipeline.tokens_to_text(tks[:z])
+                logger.debug(f"Chunking text at {z}: '{text[:30]}{'...' if len(text) > 30 else ''}'")
+                ps = KPipeline.tokens_to_ps(tks[:z])
+                yield text, ps, tks[:z]
+                tks = tks[z:]
+                pcount = len(KPipeline.tokens_to_ps(tks))
+                if not tks:
+                    next_ps = next_ps.lstrip()
+            tks.append(t)
+            pcount += len(next_ps)
+        if tks:
+            text = KPipeline.tokens_to_text(tks)
+            ps = KPipeline.tokens_to_ps(tks)
+            yield ''.join(text).strip(), ''.join(ps).strip(), tks
+    @staticmethod
+    def infer(
+        model: KModel,
+        ps: str,
+        pack: torch.FloatTensor,
+        speed: Union[float, Callable[[int], float]] = 1
+    ) -> KModel.Output:
+        if callable(speed):
+            speed = speed(len(ps))
+        return model(ps, pack[len(ps)-1], speed, return_output=True)
+    def generate_from_tokens(
+        self,
+        tokens: Union[str, List[en.MToken]],
+        voice: str,
+        speed: float = 1,
+        model: Optional[KModel] = None
+    ) -> Generator['KPipeline.Result', None, None]:
+        """Generate audio from either raw phonemes or pre-processed tokens.
+        Args:
+            tokens: Either a phoneme string or list of pre-processed MTokens
+            voice: The voice to use for synthesis
+            speed: Speech speed modifier (default: 1)
+            model: Optional KModel instance (uses pipeline's model if not provided)
+        Yields:
+            KPipeline.Result containing the input tokens and generated audio
+        Raises:
+            ValueError: If no voice is provided or token sequence exceeds model limits
+        """
+        model = model or self.model
+        if model and voice is None:
+            raise ValueError('Specify a voice: pipeline.generate_from_tokens(..., voice="af_heart")')
+        pack = self.load_voice(voice).to(model.device) if model else None
+        # Handle raw phoneme string
+        if isinstance(tokens, str):
+            logger.debug("Processing phonemes from raw string")
+            if len(tokens) > 510:
+                raise ValueError(f'Phoneme string too long: {len(tokens)} > 510')
+            output = KPipeline.infer(model, tokens, pack, speed) if model else None
+            yield self.Result(graphemes='', phonemes=tokens, output=output)
+            return
+        logger.debug("Processing MTokens")
+        # Handle pre-processed tokens
+        for gs, ps, tks in self.en_tokenize(tokens):
+            if not ps:
+                continue
+            elif len(ps) > 510:
+                logger.warning(f"Unexpected len(ps) == {len(ps)} > 510 and ps == '{ps}'")
+                logger.warning("Truncating to 510 characters")
+                ps = ps[:510]
+            output = KPipeline.infer(model, ps, pack, speed) if model else None
+            if output is not None and output.pred_dur is not None:
+                KPipeline.join_timestamps(tks, output.pred_dur)
+            yield self.Result(graphemes=gs, phonemes=ps, tokens=tks, output=output)
+    @staticmethod
+    def join_timestamps(tokens: List[en.MToken], pred_dur: torch.LongTensor):
+        # Multiply by 600 to go from pred_dur frames to sample_rate 24000
+        # Equivalent to dividing pred_dur frames by 40 to get timestamp in seconds
+        # We will count nice round half-frames, so the divisor is 80
+        MAGIC_DIVISOR = 80
+        if not tokens or len(pred_dur) < 3:
+            # We expect at least 3: <bos>, token, <eos>
+            return
+        # We track 2 counts, measured in half-frames: (left, right)
+        # This way we can cut space characters in half
+        # TODO: Is -3 an appropriate offset?
+        left = right = 2 * max(0, pred_dur[0].item() - 3)
+        # Updates:
+        # left = right + (2 * token_dur) + space_dur
+        # right = left + space_dur
+        i = 1
+        for t in tokens:
+            if i >= len(pred_dur)-1:
+                break
+            if not t.phonemes:
+                if t.whitespace:
+                    i += 1
+                    left = right + pred_dur[i].item()
+                    right = left + pred_dur[i].item()
+                    i += 1
+                continue
+            j = i + len(t.phonemes)
+            if j >= len(pred_dur):
+                break
+            t.start_ts = left / MAGIC_DIVISOR
+            token_dur = pred_dur[i: j].sum().item()
+            space_dur = pred_dur[j].item() if t.whitespace else 0
+            left = right + (2 * token_dur) + space_dur
+            t.end_ts = left / MAGIC_DIVISOR
+            right = left + space_dur
+            i = j + (1 if t.whitespace else 0)
+    @dataclass
+    class Result:
+        graphemes: str
+        phonemes: str
+        tokens: Optional[List[en.MToken]] = None
+        output: Optional[KModel.Output] = None
+        text_index: Optional[int] = None
+        @property
+        def audio(self) -> Optional[torch.FloatTensor]:
+            return None if self.output is None else self.output.audio
+        @property
+        def pred_dur(self) -> Optional[torch.LongTensor]:
+            return None if self.output is None else self.output.pred_dur
+        ### MARK: BEGIN BACKWARD COMPAT ###
+        def __iter__(self):
+            yield self.graphemes
+            yield self.phonemes
+            yield self.audio
+        def __getitem__(self, index):
+            return [self.graphemes, self.phonemes, self.audio][index]
+        def __len__(self):
+            return 3
+        #### MARK: END BACKWARD COMPAT ####
+    def __call__(
+        self,
+        text: Union[str, List[str]],
+        voice: Optional[str] = None,
+        speed: Union[float, Callable[[int], float]] = 1,
+        split_pattern: Optional[str] = r'\n+',
+        model: Optional[KModel] = None
+    ) -> Generator['KPipeline.Result', None, None]:
+        model = model or self.model
+        if model and voice is None:
+            raise ValueError('Specify a voice: en_us_pipeline(text="Hello world!", voice="af_heart")')
+        pack = self.load_voice(voice).to(model.device) if model else None
+        # Convert input to list of segments
+        if isinstance(text, str):
+            text = re.split(split_pattern, text.strip()) if split_pattern else [text]
+        # Process each segment
+        for graphemes_index, graphemes in enumerate(text):
+            if not graphemes.strip():  # Skip empty segments
+                continue
+            # English processing (unchanged)
+            if self.lang_code in 'ab':
+                logger.debug(f"Processing English text: {graphemes[:50]}{'...' if len(graphemes) > 50 else ''}")
+                _, tokens = self.g2p(graphemes)
+                for gs, ps, tks in self.en_tokenize(tokens):
+                    if not ps:
+                        continue
+                    elif len(ps) > 510:
+                        logger.warning(f"Unexpected len(ps) == {len(ps)} > 510 and ps == '{ps}'")
+                        ps = ps[:510]
+                    output = KPipeline.infer(model, ps, pack, speed) if model else None
+                    if output is not None and output.pred_dur is not None:
+                        KPipeline.join_timestamps(tks, output.pred_dur)
+                    yield self.Result(graphemes=gs, phonemes=ps, tokens=tks, output=output, text_index=graphemes_index)
+            # Non-English processing with chunking
+            else:
+                # Split long text into smaller chunks (roughly 400 characters each)
+                # Using sentence boundaries when possible
+                chunk_size = 400
+                chunks = []
+                # Try to split on sentence boundaries first
+                sentences = re.split(r'([.!?]+)', graphemes)
+                current_chunk = ""
+                for i in range(0, len(sentences), 2):
+                    sentence = sentences[i]
+                    # Add the punctuation back if it exists
+                    if i + 1 < len(sentences):
+                        sentence += sentences[i + 1]
+                    if len(current_chunk) + len(sentence) <= chunk_size:
+                        current_chunk += sentence
+                    else:
+                        if current_chunk:
+                            chunks.append(current_chunk.strip())
+                        current_chunk = sentence
+                if current_chunk:
+                    chunks.append(current_chunk.strip())
+                # If no chunks were created (no sentence boundaries), fall back to character-based chunking
+                if not chunks:
+                    chunks = [graphemes[i:i+chunk_size] for i in range(0, len(graphemes), chunk_size)]
+                # Process each chunk
+                for chunk in chunks:
+                    if not chunk.strip():
+                        continue
+                    ps, _ = self.g2p(chunk)
+                    if not ps:
+                        continue
+                    elif len(ps) > 510:
+                        logger.warning(f'Truncating len(ps) == {len(ps)} > 510')
+                        ps = ps[:510]
+                    output = KPipeline.infer(model, ps, pack, speed) if model else None
+                    yield self.Result(graphemes=chunk, phonemes=ps, output=output, text_index=graphemes_index)

requirements.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+opencv-python>=4.9.0.80
+diffusers>=0.31.0
+transformers>=4.49.0
+tokenizers>=0.20.3
+accelerate>=1.1.1
+tqdm
+imageio
+easydict
+ftfy
+dashscope
+imageio-ffmpeg
+scikit-image
+loguru
+gradio>=5.0.0
+numpy>=1.23.5,<2
+xfuser>=0.4.1
+pyloudnorm
+optimum-quanto==0.2.6
+scenedetect
+moviepy==1.0.3
+decord

setup.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/bin/bash
+set -e
+echo "=== Setting up InfiniteTalk ==="
+echo "Repository structure:"
+ls -la InfiniteTalk/
+# Add to Python path
+export PYTHONPATH="/home/user/app/InfiniteTalk:$PYTHONPATH"
+# Install requirements using pip3
+echo "Installing Python requirements..."
+pip3 install -r requirements.txt
+# Test imports
+echo "Testing imports..."
+python3 -c "
+import sys
+sys.path.append('./InfiniteTalk')
+try:
+    from wan.configs import SIZE_CONFIGS, SUPPORTED_SIZES, WAN_CONFIGS
+    print('✓ SUCCESS: All wan imports work!')
+except ImportError as e:
+    print(f'✗ FAILED: {e}')
+"
+echo "=== Setup completed ==="