Introducing SEA-VL: A Multicultural Vision-Language Dataset for Southeast Asia

SEACrowd ยท March 14, 2025

Weโ€™re excited to announce a major milestone from the SEACrowd teamโ€”the launch of SEA-VL, the largest open-source vision-language (VL) dataset specifically designed to represent the cultural diversity of Southeast Asia ๐Ÿ‡ง๐Ÿ‡ณ๐Ÿ‡ฐ๐Ÿ‡ญ๐Ÿ‡น๐Ÿ‡ฑ๐Ÿ‡ฎ๐Ÿ‡ฉ๐Ÿ‡ฑ๐Ÿ‡ฆ๐Ÿ‡ฒ๐Ÿ‡พ๐Ÿ‡ฒ๐Ÿ‡ฒ๐Ÿ‡ต๐Ÿ‡ญ๐Ÿ‡ธ๐Ÿ‡ฌ๐Ÿ‡น๐Ÿ‡ญ๐Ÿ‡ป๐Ÿ‡ณ.

๐Ÿ“„ Read the Paper

Weโ€™ve published our full methodology and findings on arXiv:
๐Ÿ“œ Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

๐Ÿ”— SEA-VL Dataset on Hugging Face: Explore SEA-VL

๐Ÿ” Why SEA-VL?

Most vision-language datasets today reflect Western-centric imagery and language, leaving Southeast Asian cultures underrepresented and misinterpreted.

SEA-VL is our open-source initiative to change thatโ€”designed to better represent the languages, traditions, and everyday realities of Southeast Asian communities.

๐Ÿ“Š Highlights

  • ๐Ÿ“ธ 1.3 million culturally relevant image-text pairs
  • ๐ŸŒ Covers all 11 Southeast Asian countries
  • ๐Ÿ—ƒ๏ธ 50ร— larger than any previous SEA-focused VL dataset
  • ๐Ÿ”— Hosted on Hugging Face: Explore SEA-VL

๐Ÿ› ๏ธ How We Built SEA-VL

We combined several approaches to balance scale with cultural fidelity:

  • Crowdsourcing โ€” High cultural accuracy, but slow and resource-intensive
  • Image Crawling โ€” ~85% cultural relevance and highly scalable
  • Image Generation โ€” Still fails to reflect SEA cultures authentically and poses licensing challenges

๐Ÿ’ก Why This Matters

  • โœ… AI trained on SEA-VL understands local contexts, languages, and traditions
  • โœ… Community contributions prevent cultural misrepresentation or erasure
  • โœ… We empower Southeast Asian communities to shape how AI sees the region

๐Ÿ“ฃ Help Us Spread the Word

Weโ€™ve announced SEA-VL on our social channelsโ€”please reshare and help us grow!

๐Ÿฆ Twitter/X ๐Ÿ’ผ LinkedIn ๐Ÿ“˜ Facebook ๐Ÿฆ‹ Bluesky

๐Ÿ‘ Whatโ€™s Next?

We extend our deepest thanks to the contributors across Southeast Asia who made this possible.

This is only the beginningโ€”Phase 2 is on the horizon, and we invite researchers, practitioners, and community members to collaborate with us. Stay tuned on our Discord!

Together, letโ€™s build AI that reflects the full spectrum of human culture for Southeast Asia.

SEACrowd Arxiv

Twitter, Facebook