A web resource making the annotated corpus available and searchable by scholars and people in the field.
The corpus will be made available for the scientific community by creating a web resource which will align the transcription and the audio/video version of speeches. The resource will allow searching the whole corpus according to the relevant implicit categories and speech metadata. It will be possible to retrieve, say, all cases where a left-wing politician speaking during an electoral campaign (or in Parliament, etc.) used a restrictive relative clause with definite noun head to presuppose the truth of some undemonstrated auto-appraising content, or a conversational implicature to implicitly convey some doubtful accusation at the political counterpart, etc.
The modules of the web resource
The corpus construction and exploitation for research require creating an ad hoc platform allowing both annotation and queries on the whole resource. The platform will be made of 3 modules:
- ANNOTATION MODULE: computational tools for manual annotation of pragmatic phenomena by independent annotators on a shared platform, measurement of inter-annotator agreement and creation of a gold standard. Annotation will be performed through a web platform allowing annotators to act independently on the same texts, and to measure inter-rater agreement. Anafora, WebAnno, and FLAT are recently developed web-based collaborative annotation tools that can allow this.
- AUTOMATIC PROCESSING MODULE: backend tools for text and media processing:
- Transcription and alignment of video and audio sources with automatic speech recognition systems to create a textual corpus: transcription and alignment of video and audio sources can be performed automatically with Automatic Speech Recognition (ASR) tools. The use of state-of-the-art algorithms with trained models will ensure a low error rate in the automatically generated text.
- Basic NLP tasks for token-level text annotation to enhance corpus retrieval capabilities: tokenization, lemmatization and PoS-tagging: The whole corpus will be processed with standard automatic tools to perform tokenization, lemmatization and PoS-tagging. To this end, some pre-trained resources are freely available for Italian (e.g. TreeTagger and TextPro). This will allow complex queries for part-of-speech and lemmas instead than just word forms.
In sum, the resource will contain different levels of annotation:
- automatic annotations of tokens, lemmas and parts-of-speech;
- manual annotation of pragmatic phenomena;
- text to audio/video alignment of the multimedia parts;
- metadata according to Communication context and Speaker.
- QUERY MODULE: will allow the end user to perform complex queries on a multi-level annotation structure of data and to display results:
- Corpus Query Engine that indexes annotated data and retrieves textual patterns through CQL (Corpus Query Language), that is a standard de facto for corpus querying; moreover, it calculates frequency lists, collocations and many other statistics.
- Web-based CQL Frontend, that provides an online user-friendly interface to search into the corpus: web tools for corpus end user data retrieval; aligned video and audio sources, if present, will be searchable and linked to query results. SketchEngine is one of the most famous software for corpus querying.
Web frontends will be modified to provide a direct access to aligned audio and video sources. A dedicated web application will be created to allow video and audio streaming with synced transcripions and annotations (similarly to VideoAnt or FrameTrail).