Large-Scale TV Dataset
Fact-Checking
(STVD-FC)
The STVD-FC dataset is related to the Fact-checking problem. It foccusses on the French political discourses and encompasses the 2022 French presidential election covering the period from the 1st of February up to the 1st of May 2022. The process to constitute the dataset is detailed in [1]. The dataset contains about 1,330 fact-checked claims that have been scraped from the fact-checking service Factoscope . For the video counterpart, a nearly 6,730 TV programs, that represent a total duration of 6,540 hours have been captured with a TV workstation, alongside with metadata.
The dataset is composed of different parts provided as:
- a root directory for each part,
- subdirectories grouping all the daily / weekly intances of TV programs,
- subdirectories containing the files (audio/video and metadata) for each TV program,
- audio/video files encoded with the MPEG-4 format,
- TV metadata files respecting the XMLTV format,
- a global XML file containing the fact-checked claims and having an XML schema.
The following naming convention is applied within the dataset:
-
PartX
is a directory name for a part composing the dataset whereX
is a label1, ..., 8
, -
Hashcode
is the name for a subdirectory, containing all the daily / weekly intances of a TV program, it has an hexadecimal form0a83...6b1e
with a160
bits (40
characters) coding, -
ts
is a timestamps for naming a subdirectory containing the audio/video and metadata files of a TV program, having the template
YEAR | MONTH | DAY |_|HOURS | MINUTES | SECONDS e.g. 20220211_105000,
ts
is fixed with the starting time of a TV program, -
the TV metadata, audio and video files respect the template
ts
,ts_audio
, andts_video
respectively.
For the needs of visualization and testing, some samples (audio/video files with related fact-checked claims) are given in the next table.
Audio | Video | Claim | Topics | |
sample 1 | a_1 | v_1 | f_1 | New Caledonian, Guadeloupe |
sample 2 | a_2 | v_2 | f_2 | François-Xavier Bellamy |
sample 3 | a_3 | v_3 | f_3 | War in Ukraine |
sample 4 | a_4 | v_4 | f_4 | Abstention |
sample 5 | a_5 | v_5 | f_5 | Political battle Macron-Le Pen |
The different files constituting the dataset are given below protected with a password. The dataset is available for non-commercial research purposes. Before to download the dataset, get the agreement (in english or french version) and sign it. Then, send the scanned version to Mathieu Delalandre . After verifying your request, we will contact you with the password to unzip the dataset.
The different files constituting the dataset are given here.
We provide first the global file file containing
fact-checked claims
with its XML schema.
The parts 1 to 8 are given in the next table.
For a better accessibility, CSV indexing files are provided for every part having the format
Hashcode; Channel; Program
where
Hashcode
is a long (64 bits) given with an hexadecimal coding,Channel
is a string corresponding the the channel name,Program
is a string corresponding the the program name.
e.g. e13d...875b; Franceinfo; Le fil info
Part | Duration (h) | Hashcodes | Index | Files | Size (GB) | Link |
1 | 815.6 h | 28 | download | 16 | 245.8 GB | download |
2 | 815.9 h | 19 | download | 16 | 246.0 GB | download |
3 | 805.5 h | 27 | download | 16 | 242.6 GB | download |
4 | 814.7 h | 21 | download | 16 | 244.7 GB | download |
5 | 812.2 h | 14 | download | 16 | 242.4 GB | download |
6 | 828.4 h | 17 | download | 16 | 249.5 GB | download |
7 | 808.4 h | 12 | download | 16 | 241.0 GB | download |
8 | 806.5 h | 13 | download | 16 | 243.8 GB | download |
6507.2 h | 151 | 128 | 1,956 GB |
NB. Our storage service at the UT delivers at 3-16 MB/s for downloading (from a low / high speed connection, respectively) with concurrent access.
For the needs of kick-off, the STVD-FC dataset is provided with an "hello world" index. This index gives baseline results of NLP and CV methods for a first analysis of the dataset. It is organized with the same naming convention of the root dataset for the directories
i.e. \\PartX\Hashcode\ts\
where every directory contains the following index files:
-
ts_transcript
is an audio transcription file formatted as JSON. The transcriptions have been computed with the Whisper system using the multi-lingual whisper-tiny model for tradeoff between the speed and accuracy. The DELL 5820 computer of the TV Workstation was used for processing (for ≃ a week) with a CPU Intel Xeon W-2295 / 36 Threads. Our evaluation of transcript quality on a dataset sample reports ≃ 85% of accuracy. Any transcript file is delivered with a detection of lemmas and named entities with the Spacy library. The final organization of ats_transcript
file is:
{ "file": "ts_audio", "text": "....", "lemmas": ["...","..."], "named_entities": ["...","..."] }
. -
ts_spot
is an keyword spotting file formatted as JSON. It is computed with the FuzzyWuzzy and the ProjetSI-StationTV libraries. Any spotting result is given with a reference keyword (id
and text descriptionkwref
), the positionpos
of the first character of the keyword in the transcription, the matching distanced
and the text description of the spotted keywordkwspot
. A list of ≃ 415 reference keywords was used for spotting. The matching distance with the FuzzyWuzzy library is given as a % with a threshold fixed at 90% for the detection. The final organization of ats_spot
file is:
{ "file": "ts_audio", "keywords": [ [ id, "kwref", pos, d, "kwspot" ], [ id, "kwref", pos, d, "kwspot" ] ] }
. - The index files with the CV methods are still in the queue ...
The archive of the index (≃ 216 MB) and the list of the reference keywords can be accessed on the following links index, keywords.
Please cite the following paper [1] if you use this dataset.
- F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.