No It Wasn't A Waste Entirely

Published: 1 month ago (April 3, 2026 at 07:21 AM EDT)

5 min read

Source: Dev.to

Source: Dev.to

Introduction

In this article I present how I successfully uploaded the 24 GB PolyGlotFake multimodal deep‑fake dataset to Kaggle for easier, non‑interactive experimentation.

The original GitHub repository of the PolyGlotFake dataset is:

https://github.com/PolyGlotFake/PolyGlotFake

PolyGlotFake Dataset Overview

PolyGlotFake is a multilingual, multimodal deep‑fake dataset designed to address the challenges of deep‑fake detection. It contains videos with manipulated audio and visual components across seven languages, using advanced Text‑to‑Speech, voice‑cloning, and lip‑sync technologies.

Download link (Google Drive):
https://drive.google.com/file/d/1aBWLii‑TbrpKNLSTwpmjqu98eKovWLxF/view?usp=drive_link

Quantitative Comparison

Dataset	Release	Manipulated Modality	Multilingual	Real videos	Fake videos	Total videos	Manipulation Methods	Techniques Labeling	Attribute Labeling
UADFV	2018	V	No	49	49	98	1	No	No
TIMI	2018	V	No	320	640	960	2	No	No
FF++	2019	V	No	1 000	4 000	5 000	4	No	No
DFD	2019	V	No	360	3 068	3 431	5	No	No
DFDC	2020	A/V	No	23 654	104 500	128 154	8	No	No
DeeperForensics	2020	V	No	50 000	10 000	60 000	1	No	No
Celeb‑DF	2020	V	No	590	5 639	6 229	1	No	No
FFIW	2020	V	No	10 000	10 000	20 000	1	—	—
…	…	…	…	…	…	…	…	…	…

View the full table on GitHub.

The README in the repository contains the same Drive link for downloading the dataset.

Lessons Learned from a Failed Attempt

When I first tried to upload the “wild” deep‑fake dataset to Kaggle, I extracted the tar files into nested image folders and attempted to upload them via:

Google Drive → GCS bucket → Kaggle

That approach failed because:

My local machine lacked enough storage to keep the four 4‑TB tar files (train‑real, train‑fake, test‑real, test‑fake).
Uploading thousands of individual files is slow and error‑prone.

Successful Workflow for PolyGlotFake

1. Store the RAR archive in Google Drive

The dataset is provided as a single RAR archive, which is ideal for large‑file transfers.

2. Set up a Colab notebook

# Authenticate Google Cloud
from google.colab import auth
auth.authenticate_user()

project_id = 'polyglotfake'
!gcloud config set project {project_id}
!gsutil ls

3. Download the 24 GB RAR from Drive to Colab

!gdown --id 1cUlwVi8Wu6MmDu8Mh2lXTIPJFz63KOtd

4. Mount a GCS bucket with `gcsfuse`

# Install gcsfuse
echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" \
    > /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt -qq update
apt -qq install gcsfuse

# Create a mount point
mkdir my_gcs_mount

5. Copy the RAR to the bucket

%cp /content/goblin/PolyGlotFake.rar /content/my_gcs_mount/polyglotfake/

This step took > 3 hours.

6. Make the bucket public

gcloud storage buckets add-iam-policy-binding gs://pgfake \
    --member=allUsers \
    --role=roles/storage.objectViewer

The public URL can be copied from the Google Cloud Console.

7. Upload to Kaggle

Go to https://www.kaggle.com/datasets/?new=true.
Choose “Link” as the source.
Paste the public GCS URL.

The upload took ~ 2 hours.

Result: The dataset is now publicly available at
https://www.kaggle.com/datasets/debajyatidey/polyglotfake.

Supporting Files

Real‑video metadata (CSV) – available in the Kaggle dataset.
Fake‑video metadata – too large to render here.

Visualizations (Google Looker Studio)

Chart	Description
![Age distribution by language]	Distribution of age by language (subjects speaking in each language).
![Age by gender]	Distribution of age of subjects by gender.
![Sex ratio]	Sex ratio across all real videos.
![Deep‑fake distribution]	Various charts showing how deep‑fake videos are organized and distributed.

All visualizations were created with Google Looker Studio.

Closing Thoughts

After many failed attempts, broken pipelines, and storage limitations, this approach finally succeeded. The key was not to explode the dataset into thousands of files, but to keep it as a single archive and let the cloud storage services handle the transfer.

Lesson: Data engineering is not a side‑quest in machine learning—it’s the game itself.
Datasets like PolyGlotFake are intentionally complex (multilingual, multimodal). Treating them with simple, robust pipelines pays off.

Because they reflect real‑world deepfake challenges. Making them accessible is not just convenience — it directly impacts how fast someone can experiment, iterate, and actually do research.

And that’s really the point.

If one person can now spin up a Kaggle notebook, plug in the dataset, and start experimenting in minutes instead of wasting days setting things up — then this entire ordeal was worth it.

Would I do it again?
But at least now I know this much — it’s just that you were doing it the hard way.

So, yes, … that’s a wrap!

Feel free to connect with me. :)

Thanks for reading! 🙏🏻
Written with 💚 by Debajyati Dey

Follow me

Debajyati Dey – Web Developer, Freelance Technical Writer, Casual Deep Learning Enthusiast, always eager to work with new technologies & documenting them.

📧 Email me for collaboration.

Happy coding 🧑🏽‍💻👩🏽‍💻! Have a nice day ahead! 🚀

No It Wasn't A Waste Entirely

Introduction

PolyGlotFake Dataset Overview

Quantitative Comparison

Lessons Learned from a Failed Attempt

Successful Workflow for PolyGlotFake

1. Store the RAR archive in Google Drive

2. Set up a Colab notebook

3. Download the 24 GB RAR from Drive to Colab

4. Mount a GCS bucket with `gcsfuse`

5. Copy the RAR to the bucket

6. Make the bucket public

7. Upload to Kaggle

Supporting Files

Visualizations (Google Looker Studio)

Closing Thoughts

Follow me

Related posts

Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Understanding Attention Mechanisms – Part 6: Final Step in Decoding

Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Why AI Agents Don't Follow Rules — The Case for Physical Governance

Introduction

PolyGlotFake Dataset Overview

Quantitative Comparison

Lessons Learned from a Failed Attempt

Successful Workflow for PolyGlotFake

1. Store the RAR archive in Google Drive

2. Set up a Colab notebook

3. Download the 24 GB RAR from Drive to Colab

4. Mount a GCS bucket with gcsfuse

5. Copy the RAR to the bucket

6. Make the bucket public

7. Upload to Kaggle

Supporting Files

Visualizations (Google Looker Studio)

Closing Thoughts

Follow me

Related posts

Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Understanding Attention Mechanisms – Part 6: Final Step in Decoding

Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText

Why AI Agents Don't Follow Rules — The Case for Physical Governance

3. Download the 24 GB RAR from Drive to Colab

4. Mount a GCS bucket with `gcsfuse`