Hello,

We noticed you're browsing in private or incognito mode.

To continue reading this article, please exit incognito mode or log in.

Not an Insider? Subscribe now for unlimited access to online articles.

  • Joe Raedle/Getty Images
  • Intelligent Machines

    Cookbooks, Wikipedia, and auto-generated Spanglish: The quirky ways AI researchers gather data

    Here are four of the most creative data collection methods used by experts at the leading annual conference on natural-language processing.

    Data is the oil that fuels AI development, and it gives us many of the advances we take for granted: YouTube captions, Spotify music recommendations, those creepy ads that follow you around the Internet.

    But when it comes to collecting useful data, AI experts often have to get creative. Take natural-language processing (NLP), a subfield of AI that focuses on teaching computers how to parse human language. At the annual Conference on Empirical Methods in NLP, experts presented a broad range of research that drew on information gathered in some ingenious ways. We’ve summarized four of our favorite projects below.

    SPANGLISH

    Among the papers on multilingual NLP this year, Microsoft presented one that focused on processing “code-mixed language”—text or speech that switches fluidly between two languages. Considering that more than half of the world’s population is multilingual, this understudied area is important.

    The researchers started with Spanglish (Spanish and English), but they lacked enough Spanglish text to train the machine. As common as code-mixing is in multilingual conversation, it’s rarely found in text. To overcome that challenge, the researchers wrote a program to pop English into the Microsoft Bing translator and weave some phrases from the Spanish translation back into the original text. The program made sure the words and phrases that were swapped had the same meaning. Just like that, they were able to create as much Spanglish as they needed.

    The resulting NLP model outperformed previous models that were trained on just Spanish and English separately. The researchers hope that their work will eventually help develop multilingual chatbots that can speak naturally in code-mixed language.

    COOKBOOKS

    Recipes are great for making food, but they can also provide nourishment for machines. They all follow a similar step-by-step pattern, and they often include pictures that correspond with the text—an excellent source of structured data for teaching machines to comprehend text and images at the same time. That’s why researchers at Hacettepe University in Turkey compiled a giant data set of around 20,000 illustrated cooking recipes. They hope it will be a new resource for benchmarking the performance of joint image-text comprehension.

    What they call “RecipeQA” will build on previous research that has focused on machine reading comprehension and visual comprehension separately. In the former, the machine must understand a question and a related passage to find the answer; in the latter, it searches for the answer in a related photo instead. Having text and photos side by side increases the complexity of the task because the photos and text may share complementary or redundant information.

    SHORTER SENTENCES

    Google wants AI to spruce up your prose. To this end, researchers there created the largest ever data set for breaking up long sentences into smaller ones with the equivalent meaning. Where would you find massive amounts of editing data? Wikipedia, of course.

    From Wikipedia’s rich edit history, the research team extracted instances in which people split long sentences. The result: 60 times more distinct sentence-split examples and 90 times more vocabulary words than were found in the previous benchmark data set for this task. The data set also spans multiple languages.

    When they trained a machine-learning model on their new data, it achieved 91% accuracy. (Here, the percentage reflects the proportion of sentences that retained their meaning and grammatical correctness after being rewritten.) In comparison, a model trained on previous data reached only 32% accuracy. When they combined both data sets and trained another model, it achieved 95% accuracy. The researchers concluded that future improvements could be made by finding even more sources of data.

    SOCIAL-MEDIA BIAS

    Studies have shown that the language we generate can be a great predictor of our race, gender, and age, even if that information is never explicitly stated. With that in mind, researchers at Bar-Ilan University in Israel and the Allen Institute for Artificial Intelligence tried using AI to de-bias text by removing those embedded indicators.

    To acquire enough data that could represent the language patterns across different demographics, they turned to Twitter. They gathered a bunch of tweets from users that were evenly distributed between non-Hispanic whites and non-Hispanic blacks; between men and women; and between people in the 18-34 and above-35 age groups.

    They then used an adversarial approach—pitting two neural networks against one another—to see if they could automatically remove the inherent demographic indicators within the tweets. One neural net tried to predict the demographics, while the other tried to tweak the text to be completely neutral, with the goal of driving down the first model’s prediction accuracy to 50% (or chance). The approach ultimately mitigated race, gender, and age indicators significantly but not entirely.

    Be the leader your company needs. Implement ethical AI.
    Join us at EmTech Digital 2019.

    Register now
    More from Intelligent Machines

    Artificial intelligence and robots are transforming how we work and live.

    Want more award-winning journalism? Subscribe to Insider Plus.
    • Insider Plus {! insider.prices.plus !}*

      {! insider.display.menuOptionsLabel !}

      Everything included in Insider Basic, plus the digital magazine, extensive archive, ad-free web experience, and discounts to partner offerings and MIT Technology Review events.

      See details+

      Print + Digital Magazine (6 bi-monthly issues)

      Unlimited online access including all articles, multimedia, and more

      The Download newsletter with top tech stories delivered daily to your inbox

      Technology Review PDF magazine archive, including articles, images, and covers dating back to 1899

      10% Discount to MIT Technology Review events and MIT Press

      Ad-free website experience

    /3
    You've read of three free articles this month. for unlimited online access. You've read of three free articles this month. for unlimited online access. This is your last free article this month. for unlimited online access. You've read all your free articles this month. for unlimited online access. You've read of three free articles this month. for more, or for unlimited online access. for two more free articles, or for unlimited online access.