Large-Scale TV Dataset
Fact-Checking
(STVD-FC)

The STVD-FC dataset is related to the Fact-checking problem. It foccusses on the French political discourses and encompasses the 2022 French presidential election covering the period from the 1^st of February up to the 1^st of May 2022. The process to constitute the dataset is detailed in [1]. The dataset contains about 1,330 fact-checked claims that have been scraped from the fact-checking service Factoscope . For the video counterpart, a nearly 6,730 TV programs, that represent a total duration of 6,540 hours have been captured with a TV workstation, alongside with metadata.

The dataset is composed of different parts provided as:

a root directory for each part,
subdirectories grouping all the daily / weekly intances of TV programs,
subdirectories containing the files (audio/video and metadata) for each TV program,
audio/video files encoded with the MPEG-₄ format,
TV metadata files respecting the XMLTV format,
a global XML file containing the fact-checked claims and having an XML schema.

The following naming convention is applied within the dataset:

PartX is a directory name for a part composing the dataset where X is a label 1, ..., 8,
Hashcode is the name for a subdirectory, containing all the daily / weekly intances of a TV program, it has an hexadecimal form 0a83...6b1e with a 160 bits (40 characters) coding,
t_s is a timestamps for naming a subdirectory containing the audio/video and metadata files of a TV program, having the template
YEAR | MONTH | DAY |_|HOURS | MINUTES | SECONDS e.g. 20220211_105000,
t_s is fixed with the starting time of a TV program,
the TV metadata, audio and video files respect the template
t_s, t_s_audio, and t_s_video respectively.

For the needs of visualization and testing, some samples (audio/video files with related fact-checked claims) are given in the next table.

	Audio	Video	Claim	Topics
sample 1	a_1	v_1	f_1	New Caledonian, Guadeloupe
sample 2	a_2	v_2	f_2	François-Xavier Bellamy
sample 3	a_3	v_3	f_3	War in Ukraine
sample 4	a_4	v_4	f_4	Abstention
sample 5	a_5	v_5	f_5	Political battle Macron-Le Pen

The different files constituting the dataset are given below protected with a password. The dataset is available for non-commercial research purposes. Before to download the dataset, get the agreement (in english or french version) and sign it. Then, send the scanned version to Mathieu Delalandre . After verifying your request, we will contact you with the password to unzip the dataset.

The different files constituting the dataset are given here. We provide first the global file file containing fact-checked claims with its XML schema. The parts 1 to 8 are given in the next table.
For a better accessibility, CSV indexing files are provided for every part having the format
Hashcode; Channel; Program where

Hashcode is a long (64 bits) given with an hexadecimal coding,
Channel is a string corresponding the the channel name,
Program is a string corresponding the the program name.

e.g. e13d...875b; Franceinfo; Le fil info

Part	Duration (h)	Hashcodes	Index	Files	Size (GB)	Link
1	815.6 h	28	download	16	245.8 GB	download
2	815.9 h	19	download	16	246.0 GB	download
3	805.5 h	27	download	16	242.6 GB	download
4	814.7 h	21	download	16	244.7 GB	download
5	812.2 h	14	download	16	242.4 GB	download
6	828.4 h	17	download	16	249.5 GB	download
7	808.4 h	12	download	16	241.0 GB	download
8	806.5 h	13	download	16	243.8 GB	download
	6507.2 h	151		128	1,956 GB

NB. Our storage service at the UT delivers at 3-16 MB/s for downloading (from a low / high speed connection, respectively) with concurrent access.

For the needs of kick-off, the STVD-FC dataset is provided with an "hello world" index. This index gives baseline results of NLP and CV methods for a first analysis of the dataset. It is organized with the same naming convention of the root dataset for the directories
i.e. \\PartX\Hashcode\t_s\
where every directory contains the following index files:

t_s_transcript is an audio transcription file formatted as JSON. The transcriptions have been computed with the Whisper system using the multi-lingual whisper-tiny model for tradeoff between the speed and accuracy. The DELL 5820 computer of the TV Workstation was used for processing (for ≃ a week) with a CPU Intel Xeon W-2295 / 36 Threads. Our evaluation of transcript quality on a dataset sample reports ≃ 85% of accuracy. Any transcript file is delivered with a detection of lemmas and named entities with the Spacy library. The final organization of a t_s_transcript file is:
{ "file": "t_s_audio", "text": "....", "lemmas": ["...","..."], "named_entities": ["...","..."] } .
t_s_spot is an keyword spotting file formatted as JSON. It is computed with the FuzzyWuzzy and the ProjetSI-StationTV libraries. Any spotting result is given with a reference keyword (id and text description kw_ref), the position pos of the first character of the keyword in the transcription, the matching distance d and the text description of the spotted keyword kw_spot. A list of ≃ 415 reference keywords was used for spotting. The matching distance with the FuzzyWuzzy library is given as a % with a threshold fixed at 90% for the detection. The final organization of a t_s_spot file is:
{ "file": "t_s_audio", "keywords": [ [ id, "kw_ref", pos, d, "kw_spot" ], [ id, "kw_ref", pos, d, "kw_spot" ] ] } .
The index files with the CV methods are still in the queue ...

The archive of the index (≃ 216 MB) and the list of the reference keywords can be accessed on the following links index, keywords.

Please cite the following paper [1] if you use this dataset.

F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.

Large-Scale TV Dataset Fact-Checking (STVD-FC)

Large-Scale TV Dataset
Fact-Checking
(STVD-FC)