ShefCE: A Cantonese-English Bilingual Speech Corpus

ShefCE: A Cantonese-English Bilingual Speech Corpus

ShefCE is a Cantonese English bilingual parallel speech corpus recorded by L2 English learners in Hong Kong. 31 undergraduate to postgraduate students in Hong Kong aged 20-30 were recruited and recorded a 25-hour speech corpus (12 hours in Cantonese and 13 hours in English). Details can be found [1]:

There is an online repository which contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments. This repository can be accessed with DOI:10.15131/


This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


The corpus is available free of charge for academic research, teaching and non-commercial use. All ShefCE data are based on speech of recording participants. To protect the privacy of all participants, the data is strictly limited for research access only. The licensed data cannot be re-distributed in any format.

The data has to be requested by academic faculty. Potential user of the data, please sign a Data Request Form, and give due credit to published scientific work [1] when using the data.

[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, “ShefCE: A CantoneseEnglish bilingual speech corpus for pronunciation assessment”, in Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

Upon receiving and validation of the the completed Data Request Form, we will issue a time-limited password by email to the user, by which the user can download the corpus from this link.



Back to Top