Babel Machine Changelog

Last Updated: March 01, 2024

Version 2.1

Current Babel Machine version as of March 01, 2024.

Predictions and Modeling

Implemented pooled models for Emotions Babel and Sentiment Babel.
Initial release for Named Entity Recognition (NER) Babel: https://nerbabel.poltextlab.com

Pipeline and Backend

Implemented a dropdown option for choosing the processing unit for prediction: GPU or CPU. We recommend choosing the GPU for most scenarios unless your file contains 1000 or less rows. (Use case examples: Small samples and trying out the service.) Please note that NER Babel is CPU-only by design.
Fixed an issue that resulted in an error message or NA value in the prediction column beyond the 30000th row.
Fixed an issue that caused CUDA error messages and resulted in processing failures.
Technical improvements on VM memory usage.
Implemented an interim solution for VMs not properly shutting down after a job is complete.

Improvements and Adjustments

Announcements are now displayed on top of the upload forms with their own formatting.
Added clarifications regarding supported languages in the upload form text.
Improved the internal messages we use to track and debug submissions.
Emotion Babel has been renamed to Emotions Babel.
Adjusted page subtitles and upload form texts for clarification.
Adjusted the messages of the successful prediction emails to properly reflect the module used.
Dataset download emails provide the cost of the prediction computation of the submitted file.

Version 2.0

Babel Machine version from January 24, 2024 to March 01, 2024.

This version introduces the division of the Babel Machine into various modules. CAP Babel Machine is now found under https://capbabel.poltextlab.com, and https://babel.poltextlab.com is a landing page for the module selection.

Predictions and Modeling

Initial release for Manifesto Babel.
Initial release for Sentiment Babel.
Initial release for Emotion Babel.
The baseline model for CAP Babel Machine has been retrained.

Pipeline and Backend

Support for the Babel Machine models has been implemented.

Improvements and Adjustments

Module pages now have a menu on top of the page that can be used to jump to the other Babel modules.

Version 1.2

This version introduced the 10 domain setup (see table and note below).

Predictions and Modeling

Version 1.1 Domains	Version 1.2 Domains
Budget	Budget
-	Executive Orders
Judicial Decision	Judiciary
Legal	Legislative
Manifesto	Party Manifestos
Media	Media
-	Public Opinion
Social Media	Social Media
Speech	Execuitive Speech, Parliamentary Speech
Other	-

We have developed language-domain models that cover 9 languages (Hungarian, English, Italian, Dutch, Spanish, French, and Danish) and 10 domains (Media, Social Media, Parliamentary Speech, Legislative, Executive Speech, Executive Order, Party Manifesto, Judiciary, Budget, Public Opinion). Babel DOES WORK for other domains and languages, but we cannot provide validity scores due to a lack of hand-coded test data.
There is a model selection step that chooses the language/language-domain model accordingly for supported datasets based on which model has the higher F1 score performence.
We implemented softmax scores (which was a feature request): Users receive with the email the three highest probability category predictions by the Babel Machine model and the corresponding probability (softmax) scores assigned to each label. Take them with a grain of salt.
Support for "None" category (label 999): "Most of the language models that the CAP Babel Machine uses were fine-tuned on training data containing the label 'None' in addition to the 21 CAP major policy topics, indicating that the given text contains no relevant policy content. We use the label 999 for these cases. Note that some of the models (e.g., Danish legislative, Dutch media) do not recognize this category and thus cannot predict if the row has no policy content." It thus serves as a policy relevancy binary classifier as well.

Pipeline and Backend

Support for uploading large datasets (up to ~800 MB); no need for splitting files manually (unless they are way larger than this limit).
Improved prediction speed.
Added a cache for loading in models.
CSV validation that checks the dataset for typical errors (such as improper usage of delimiters that causes the file processing to break).
Security improvements (important for us, as providers):
Character limits on the upload form
Upload form input and CSV files are sanitized (does not accept special characters) to prevent exploits such as SQL injection
Internal reporting adjustments so we can identify submission information and issues faster:
Detailed metadata description on Slack that corresponds to the metadata on the upload form
Report number of rows and number of coded rows so we can verify that all rows got properly coded
Runtime is now reported as timestamps and an estimate runtime cost added so we can see the price of each dataset coding (especially for bigger files)

Improvements and Adjustments

Added a Contact Us form to the page so users can reach out to us with inquiries and questions.
There is a menu on the top of the page that includes link back to poltextlab.com.
The upload form shows the characters remaining for each field (due to the implementation of character limitation).
Updated the upload instructions to provide more technical details about the dataset validation steps so common issues that cause processing failures can be addressed by the uploader. Then, the uploader can resubmit the file after correcting the error that caused the failure.
Errors on the upload form have been clarified, in particular the UTF-8 error to provide more context for those less familiar with character encoding.
User will receive an email if the CSV validation step fails with an additional explanation of what caused the error. As some errors are inherently difficult to catch (such as improper delimiter usage), we recommend to follow the upload instructions for troubleshooting.
The email with the coded dataset + softmax scores provides suggestions on opening CSV files properly (from our experience Excel and LibreOffice weren't the best at handling them).

Version 1.1

This version added the initial language-domain setup to the pipeline.

Predictions and Modeling

We have developed language-domain models that cover 9 languages (Hungarian, English, Italian, Dutch, Spanish, French, and Danish) and 7 domains (Legal, Speech, Budget, Manifesto, Media, Social Media, Other). Babel DOES WORK for other domains and languages, but we cannot provide validity scores due to a lack of hand-coded test data.

Pipeline and Backend

The amount of standby VMs for prediction has been increased from 2 to 4.

Improvements and Adjustments

Polished the text of the upload form.
Updated the upload instructions.
Internal submission notifications have been improved.
Prediction completion email text has been adjusted.
Institutional affiliation field has been added on the upload form.
Dataset language field has been added on the upload form.

Version 1.0

Initial release of the CAP Babel Machine. Predictions were handled by an XLM-RoBERTa model finetuned on the training data of 6 languages (English, Spanish, Hungarian, Polish, Danish, Dutch).

The research was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/)

HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:

Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2024). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 0(0). https://doi.org/10.1177/08944393241259434

GDPR Compliance Statement

Nature of the Uploaded Data: The files uploaded by users to the tool do not contain personal data as defined in Article 4(1) of the GDPR, which specifies personal data as "any information relating to an identified or identifiable natural person ('data subject')".
Data Process: The files submitted to our tool are stored in a secure cloud environment to allow processing and generation of the output (the coded CSV file). Personal data provided in connection with the file upload—such as the submitter's name, email address, and similar details—are used exclusively for the purpose of sending the coded files back to the user and identifying the organisation of our users. This processing is conducted in compliance with the purpose limitation principle (Article 5(1)(b)) and the data minimisation principle (Article 5(1)(c)) of the GDPR. By submitting the files, the user consents to this data processing, which is strictly limited to returning the results and identifying the file owner. The personal data is stored securely and retained solely for these purposes. In accordance with Article 17 of the GDPR (Right to Erasure, or "Right to be Forgotten"), users may request the deletion of their personal data at any time. Such requests will be processed promptly, and all related personal data will be permanently deleted from our systems.
Training Purposes: We do not use personal data to train machine learning models or perform any other type of analysis. When submitting files, the submitter must declare that the uploaded CSV files do not contain any personal data, as stated in the consent agreement. This approach aligns with the purpose limitation principle (Article 5(1)(b)) of the GDPR, which requires data to be collected for "specified, explicit, and legitimate purposes" and not further processed in a manner incompatible with those purposes.
Google Cloud Platform Compliance: The files submitted to our tool are stored in a secure cloud environment provided by Google Cloud Platform, with configurations ensuring that all processing occurs on servers located within the European Union (EU). This guarantees compliance with GDPR requirements related to data residency and cross-border data transfers. The use of Google Cloud Platform as our processing environment ensures high levels of data security and compliance with GDPR, including the application of the Standard Contractual Clauses (SCCs) for any necessary data transfers. Google Cloud's infrastructure is certified under internationally recognised standards, such as ISO 27001, ISO 27017, and ISO 27018, further ensuring the security and confidentiality of uploaded data.