No It Wasn't A Waste Entirely

Published: (April 3, 2026 at 07:21 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Introduction

In this article I present how I successfully uploaded the 24 GB PolyGlotFake multimodal deep‑fake dataset to Kaggle for easier, non‑interactive experimentation.

The original GitHub repository of the PolyGlotFake dataset is:

https://github.com/PolyGlotFake/PolyGlotFake

PolyGlotFake Dataset Overview

PolyGlotFake is a multilingual, multimodal deep‑fake dataset designed to address the challenges of deep‑fake detection. It contains videos with manipulated audio and visual components across seven languages, using advanced Text‑to‑Speech, voice‑cloning, and lip‑sync technologies.

Download link (Google Drive):
https://drive.google.com/file/d/1aBWLii‑TbrpKNLSTwpmjqu98eKovWLxF/view?usp=drive_link

Quantitative Comparison

DatasetReleaseManipulated ModalityMultilingualReal videosFake videosTotal videosManipulation MethodsTechniques LabelingAttribute Labeling
UADFV2018VNo4949981NoNo
TIMI2018VNo3206409602NoNo
FF++2019VNo1 0004 0005 0004NoNo
DFD2019VNo3603 0683 4315NoNo
DFDC2020A/VNo23 654104 500128 1548NoNo
DeeperForensics2020VNo50 00010 00060 0001NoNo
Celeb‑DF2020VNo5905 6396 2291NoNo
FFIW2020VNo10 00010 00020 0001

View the full table on GitHub.

The README in the repository contains the same Drive link for downloading the dataset.

Lessons Learned from a Failed Attempt

When I first tried to upload the “wild” deep‑fake dataset to Kaggle, I extracted the tar files into nested image folders and attempted to upload them via:

Google Drive → GCS bucket → Kaggle

That approach failed because:

  • My local machine lacked enough storage to keep the four 4‑TB tar files (train‑real, train‑fake, test‑real, test‑fake).
  • Uploading thousands of individual files is slow and error‑prone.

Successful Workflow for PolyGlotFake

1. Store the RAR archive in Google Drive

The dataset is provided as a single RAR archive, which is ideal for large‑file transfers.

2. Set up a Colab notebook

# Authenticate Google Cloud
from google.colab import auth
auth.authenticate_user()

project_id = 'polyglotfake'
!gcloud config set project {project_id}
!gsutil ls

3. Download the 24 GB RAR from Drive to Colab

!gdown --id 1cUlwVi8Wu6MmDu8Mh2lXTIPJFz63KOtd

4. Mount a GCS bucket with gcsfuse

# Install gcsfuse
echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" \
    > /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt -qq update
apt -qq install gcsfuse

# Create a mount point
mkdir my_gcs_mount

5. Copy the RAR to the bucket

%cp /content/goblin/PolyGlotFake.rar /content/my_gcs_mount/polyglotfake/

This step took > 3 hours.

6. Make the bucket public

gcloud storage buckets add-iam-policy-binding gs://pgfake \
    --member=allUsers \
    --role=roles/storage.objectViewer

The public URL can be copied from the Google Cloud Console.

7. Upload to Kaggle

  1. Go to https://www.kaggle.com/datasets/?new=true.
  2. Choose “Link” as the source.
  3. Paste the public GCS URL.

The upload took ~ 2 hours.

Result: The dataset is now publicly available at
https://www.kaggle.com/datasets/debajyatidey/polyglotfake.

Supporting Files

  • Real‑video metadata (CSV) – available in the Kaggle dataset.
  • Fake‑video metadata – too large to render here.

Visualizations (Google Looker Studio)

ChartDescription
![Age distribution by language]Distribution of age by language (subjects speaking in each language).
![Age by gender]Distribution of age of subjects by gender.
![Sex ratio]Sex ratio across all real videos.
![Deep‑fake distribution]Various charts showing how deep‑fake videos are organized and distributed.

All visualizations were created with Google Looker Studio.

Closing Thoughts

After many failed attempts, broken pipelines, and storage limitations, this approach finally succeeded. The key was not to explode the dataset into thousands of files, but to keep it as a single archive and let the cloud storage services handle the transfer.

Lesson: Data engineering is not a side‑quest in machine learning—it’s the game itself.
Datasets like PolyGlotFake are intentionally complex (multilingual, multimodal). Treating them with simple, robust pipelines pays off.

Because they reflect real‑world deepfake challenges. Making them accessible is not just convenience — it directly impacts how fast someone can experiment, iterate, and actually do research.

And that’s really the point.

If one person can now spin up a Kaggle notebook, plug in the dataset, and start experimenting in minutes instead of wasting days setting things up — then this entire ordeal was worth it.

Would I do it again?
But at least now I know this much — it’s just that you were doing it the hard way.

So, yes, … that’s a wrap!

Feel free to connect with me. :)

Thanks for reading! 🙏🏻
Written with 💚 by Debajyati Dey

Follow me

Debajyati Dey – Web Developer, Freelance Technical Writer, Casual Deep Learning Enthusiast, always eager to work with new technologies & documenting them.

📧 Email me for collaboration.

Happy coding 🧑🏽‍💻👩🏽‍💻! Have a nice day ahead! 🚀

0 views
Back to Blog

Related posts

Read more »