The homeService Corpus

The homeService corpus is a new English speech database which has been gathered as part of the homeService project. The homeService project is the impact showcase for the UK EPSRC Programme Grant Project, Natural Speech Technology (NST) a collaboration between the Universities of Edinburgh, Cambridge and Sheffield and it is concerned with how speech technology can be of use for people with speech disorders and restricted upper-limb mobility.

The audio recorded during such interactions consists of realistic speech data of speakers with severe dysarthria. The audio recorded during such interactions consists of realistic data of speakers with severe dysarthria. The majority of the homeService corpus is recorded in real home environments where voice control is often the normal means by which users interact with their devices.

The homeService corpus v1.1

The homeService corpus v1.1 is the second release of the audio recorded within the homeService project and it consists of audio recordings of dysarthric speech from 5 different subjects (three male, two female).

Speaker	Type of data	Vocabulary	Number of interactions	Duration	Annotated
F01	ER01train	32	97	2'19"	yes
F02	ER01train	31	314	11'58"	yes
F02	ID01train	32	364	30'02"	yes
F02	ID01test	20	143	9'58"	yes
M01	ER01train	31	230	6'34"	yes
M02	ER01train	31	130	3'16"	yes
M02	ID01test	40	1571	1h44'44"	yes
M02	ID01train	47	5807	6h29'40"	yes
M03	ER01train	12	114	2'47"	yes
M03	ID01train	25	472	36'41"	yes
M03	ID01test	14	133	11'05"	yes
TOTAL		131	9360	10h07'32"

The homeService corpus v1.0

The homeService corpus v1.0 is the first release of the audio recorded within the homeService project and it consists of audio recordings of dysarthric speech from 5 different subjects (three male, two female).

Speaker	Type of data	Vocabulary	Number of interactions	Duration	Annotated
F01	ER01train	32	97	2'19"	yes
F02	ER01train	31	314	11'58"	yes
F02	ID01train	30	314	25'52"	yes
F02	ID01test	16	85	5'40"	yes
M01	ER01train	31	230	6'34"	yes
M02	ER01train	31	130	3'16"	yes
M02	ID01test	40	1571	1h44'44"	yes
M02	ID01train	47	5807	6h29'40"	yes
M03	ER01train	12	114	2'47"	yes
M03	ID01train	18	169	11'26"	yes
M03	ID01test	11	36	3'00"	yes
TOTAL		131	8867	9h27'20"	yes

Each subject’s set is composed by two subsets: enrolment data (ER) and interaction data (ID).

ER is obtained by the user reading lists of the words that they have chosen as commands in their system. To match the acoustic conditions in user’s home, the recording takes place in the same environment in which the system is supposed to function. As the user is reading from a list, the resulting speech will be less natural but is still effective for initial training.
ID is recorded as the user operates the electronic devices in his/her house with the homeService speech enabled interface. Recording starts after the user presses a switch and the microphone is open for a predefined number of seconds. In contrast with the ER data, each produced word is chosen by the user autonomously.

Project team

Mauro Nicolao, Heidi Christensen, Stuart Cunningham, Phil Green, Thomas Hain

Data example

Annotation

Annotation provided in HTK STM format

Filename Mic SpeakerID startTime endTime <Mic,SesId,Lang,Impair,level,intel,purpose> Transcription

hom-F01ER01MCW0000003000003 MC F01 0.00 2.55 <MC,ER01,GBEng,CP,SE,LL,a55,ER01train> delete
hom-M02ID01MC20150309104753 MC M02 0.00 3.00 <MC,ID01,GBEng,MND,MO,MM,a75,ID01train> skysportone

Audio

Audio data is provided in the standard MS-WAVE mono format at 16kHz and 16 bit. It was recorded with a 6-channel Microcone microphone array at 48kHz sampling rate and 32bit definition (these streams are available but not distributed in the current release). The 16 kHz signal is the result of the beam-formed combination of the 6 channels which is embedded in the Microcone hardware.

All audio (ER and ID) was recorded in real home environment.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

An agreement with University of Sheffield has to be signed to use the data.

Due to the sensitive nature of the data and the obligation to participant confidentiality, the audio of the homeService corpus cannot be redistributed under any circumstance.

Download

To download the homeService corpus please send a request to homeservice-group@sheffield.ac.uk

Figshare link

Download page

Citation

M. Nicolao, H. Christensen, S. Cunningham, P. Green, and T. Hain, The homeService corpus v. 1.0, University of Sheffield at http://mini.dcs.shef.ac.uk/resources/homeservice-corpus, 2016, doi: 10.15131/shef.data.3116833

MINI

The homeService Corpus

Quick links

The homeService Corpus

The homeService corpus v1.1

The homeService corpus v1.0

Project team

Data example

Annotation

Audio

License

Download

Citation

Personnel

Projects

Publications