SEACrowd · November 1, 2023

🚨 Call for contributors! Become a co-author for the newly launched SEACrowd: a community-driven project aimed at gathering and standardizing NLP resources for SEA languages.

So what can I do for SEACrowd?

1️⃣ Submitting Metadata of existing public dataset

You can submit detailed metadata of the existing datasets. You will provide important information such as data license, size, language and dialect, annotation method, and so on.

2️⃣ Building DataLoader

For metadata approved from the previous task, you can also assist in building a HuggingFace dataloader. This ensures that all datasets in SEACrowd are standardized.

3️⃣ Listing Private NLP Datasets for SEA Languages

Sadly, some previous NLP resources on SEA languages remain behind closed doors. Surprisingly (according to prior initiative), this is often because the authors never considered releasing the data as an option. Here, you can identify and list potential private datasets by reviewing the literature.

4️⃣ Opening Your Private NLP Dataset for SEA Languages

If you have previously worked with closed data (or if we’ve reached out to you due to Task 3 😉), consider releasing your dataset. You still retain ownership of the data; we’re only cataloging it

Contributions are rewarded with points. Generally more complex tasks get higher points. Especially so, releasing high-quality, rare language NLP datasets will earn you a good amount.

Reach certain point thresholds for merch👕 and co-authorship📝!

Interested? Then check out our GitHub page, and join our Discord server. Feel free to ask any questions😀

SEACrowd Poster

Twitter, Facebook