Grouping Spanish-speaking countries by dialect: A corpus dialectometric approach
Abstract
The present study attempts to cluster Spanish-speaking countries into dialect regions by computational means. The frequencies of 592 lexical and grammatical features for 21 countries were obtained the from Corpus del Español-Web Dialects. Principal components analysis and hierarchical clustering analyses used the resulting data to group countries into dialect regions. A number of algorithms were used to rank features in terms of how much they aided in dialect classification, which allowed grouping based on a smaller set of features.
Six dialect zones were identified: European (Spain), Southern Cone (Uruguay, Argentina), Southern Central America (Costa Rica, Panama), Caribbean (Puerto Rico, Dominican Republic), Northern Central America (Nicaragua, El Salvador, Guatemala, Honduras), Andean South America (Bolivia, Paraguay, Chile, Peru). However, different subsets of features, and different clustering algorithms produced groupings that varied somewhat. The bulk of the variation dealt with where Cuba, Ecuador, Mexico, Venezuela, Colombia, and the US fit into the dialect regions.
The difficulties of the computational approach to dialect classification are discussed. Allowing computer algorithms to determine dialect boundaries appears objective. However, interpreting a principal components analysis entails a degree of
subjectivity. Furthermore, the plethora of different classification algorithms allows the researcher to choose the one that produces the desired outcome.
Keywords
dialectometry, Spanish dialects, corpus approach, statistical analysisReferences
Alba, Orlando. 1992. Zonificación del español de America. In C. Hernández Alonso (ed.), Historia y presente del español en America, 63-84. Valladolid: Junta de Castilla y León.
Aliaga Jiménez, José Luis. Dialectometría y léxico en las hablas de Teruel. 2003. ELUA. Estudios de Lingüística 17: 5-55.
Brown, Earl. K. 2015. On the utility of combining production data and perceptual data to investigate regional linguistic variation: The case of Spanish experiential gustar ‘to like, to please’on Twitter and in an online survey.” Journal of Linguistic Geography 3(2): 47-59. https://doi.org/10.1017/jlg.2016.1
Burridge, J., Vaux, B., Gnacik, M., & Grudeva, Y. 2019. Statistical physics of language maps in the USA. Physical Review E, 99(3): 032305. https://doi.org/10.1103/PhysRevE.99.032305
Armas y Céspedes, Juan Ignacio. 1882. Oríjenes del lenguaje criollo. La Habana: Imprenta de la Viuda de Soler.
Cahuzac, Philippe. 1980. La división del español de América en zonas dialectales. Situación etnolingüística o semántico-dialectal. Lingüística Española Actual 2: 385-461.
Canfield, D. Lincoln. 1962. La pronunciación del español en América. Bogotá: Instituto Caro y Cuervo.
Davies, Mark. 2017. Corpus del Español, Web/Dialects. https://www.corpusdelespanol.org/web-dial/
Donoso, G., & Sánchez, D. 2017. Dialectometric analysis of language variation in Twitter. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, 16-25 Valencia, Spain: Association for Computational Linguistics. 10.18653/v1/W17-1202
Eddington, David Ellingson. 2021. A corpus analysis of some usage differences among Spanish-speaking countries” Dialectologia 27: 71-95.
Frank, Eibe, Mark A. Hall, and Ian H. Witten. 2016. Data Mining: Practical Machine Learning Tools and Techniques, 4th Ed. San Francisco, CA: Morgan Kaufmann.
Embleton, Sheila, Dorin Uritescu, & Eric S. Wheeler. 2013. Defining dialect regions with interpretations: Advancing the multidimensional scaling approach. Literary and Linguistic Computing 28: 13-22. https://doi.org/10.1093/llc/fqs048
Henríquez-Ureña, Pedro. 1921. Observaciones sobre el español en América. Revista de Filología Española 8: 357-390.
Holte, Robert C. 1993. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11.(1): 63-90.
García Mouton, Pilar. 1991. Dialectometría y léxico en Huesca. Mardid: Consejo Superior de Investigaciones Científicas.
Gonçalves, Bruno & David Sánchez. 2014. Crowdsourcing dialect characterization through Twitter. PloS One, 9(11): e112074. https://doi.org/10.1371/journal.pone.0112074; Data: http://www.bgoncalves.com/languages/spanish.html
Gonçalves, Bruno and David Sánchez. 2016. Learning about Spanish dialects through Twitter. Revista Internacional de Lingüística Iberoamericana 14: 65-75.
Grieve, Jack. 2011. A regional analysis of contraction rate in written Standard American English. International Journal of Corpus Linguistics 16(4): 514-546. https://doi.org/10.1075/ijcl.16.4.04gri
Grieve, Jack. 2012. A statistical analysis of regional variation in adverb position in a corpus of written Standard American English. Corpus Linguistics and Linguistic Theory 8(1): 39-72. https://doi.org/10.1515/cllt-2012-0003
Grieve, Jack. 2014. A comparison of statistical methods for the aggregation of regional linguistic variation. In P. Auer, G. von Essen & W. Frick (eds.), Aggregating dialectology, typology, and register analysis, 53-88. Berlin: De Gruyter. https://doi.org/10.1515/9783110317558.53
Guyon, Isabel, Jason Weston, Stephen Barnhill, & Vladimir Vapnik. 2002. Gene selection for cancer classification using support vector machines. Machine Learning,46(1): 389-422.
Henríquez-Ureña, P. H. 1921. Observaciones sobre el español en América. Revista de Filología Española 8: 357-390.
Holte, Robert C. 1993.Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning 11(1): 63-90.
Huang, Yuan, Diansheng Guo, Alcie Kasakoff, & Jack Grieve. 2016. Understanding US regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems 59: 244-255. https://doi.org/10.1016/j.compenvurbsys.2015.12.003
The jamovi project. 2021. jamovi. (Version 1.6) [Computer Software.] Retrieved from https://www.jamovi.org)
Leino, Antti, & Saara Hyvönen. 2008. Comparison of component models in analysing the distribution of dialectal features. International Journal of Humanities and Arts Computing 2: 73-187. DOI: 10.3366/edinburgh/9780748640300.001.0001
Manni, Franz, Wilbert Heeringa, Bruno Toupance, & John Nerbonne. 2008. Do surname differences mirror dialect variation. Human Biology 80: 41-64.
Moreno Fernández, Francisco. 1991. Morfología en el ALEANR: aproximación dialectométrica. In I curso de geografía lingüística de Aragón, 289-309. Zaragoza: Institución Fernando el Católico.
Moreno Fernández, Francisco, and Hiroto Ueda. 2018. Cohesion and particularity in the Spanish dialect continuum. Open Linguistics 4: 722-742. https://doi.org/10.1515/opli-2018-0035
Nagy, Naomi, Xiaoli Zhang, George Nagy, and Edgar W. Schneider. 2006. Clustering dialects automatically: A mutual information approach. University of Pennsylvania Working Papers in Linguistics 12: 12.
Nerbonne, John. 2009. Data‐driven dialectology. Language and Linguistics Compass 3(1): 175-198. https://doi.org/10.1111/j.1749-818X.2008.00114.x
Quesada Pacheco, Miguel Ángel. 2014. División dialectal del español de América según sus hablantes Análisis dialectológico perceptual. Boletín de Filología 49(2): 257-309.
R Core Team 2020. R: A Language and environment for statistical computing. (Version 4.0) [Computer software]. Retrieved from https://cran.r-project.org. (R packages retrieved from MRAN snapshot 2020-08-24).
Resnick, Melvyn C. 1975. Phonological Variants and Dialect Identification in Latin American Spanish. Mouton:The Hague.
Rodriguez-Diaz, Carlos A., Sergio Jimenez, George Dueñas, Johnatan Estivan Bonilla, & Alexander F. Gelbukh. 2018. Dialectones: Finding statistically significant dialectal boundaries using twitter data. Computación y Sistemas 22(4): 1213-1222.
Rodríguez Vázquez, Paloma. 2019. La zonificación dialectal del español de América: propuestas clásicas y propuestas actuales. Document, Universidade de Dantiago de Compostela. http://hdl.handle.net/10347/23567
Rona, José Pedro. 1964. El problema de la división del español americano en zonas dialectales. In F. Moreno Fernández (ed.), Presente y futuro de la lengua española, vol. I, 215-226. Madrid: Ediciones Cultura Hispánica
Sato, Yo, and Kevin Heffernan. 2018. Creating Dialect Sub-corpora by Clustering: a case in Japanese for an adaptive method. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 3612-3616. Luxemburg: European Language Resources Association.
Sayce, David. n.d.. The Number of tweets per day in 2020. Accessed Feb. 2, 2022. https://www.dsayce.com/social-media/tweets-day/
Séguy Jean. 1971. La relation entre la distance spatiale et la distance lexicale. Revue de Linguistique Romane 35: 335–57.
Seol, Hyunsoo. 2020. SnowCluster: Cluster Analysis. [jamovi module]. https://github.com.hyunsooseol/snowCluster
Shackleton Jr, R.obert G. 2005. English-American speech relationships: A quantitative approach. Journal of English Linguistics 33(2): 99-160. https://doi.org/10.1177/0075424205279017
Szmrecsanyi, Benedikt. 2011. Corpus-based dialectometry: a methodological sketch. Corpora 6(1): 45-76.
Tellez, Eric. S., Daniela Moctezuma, Sabino Miranda, & Mario Graff. 2021. A large scale lexical and semantic analysis of Spanish language variations in Twitter. arXiv preprint arXiv:2110.06128.
Tinoco, Antonio. R., & Hiroto Ueda. 2007. The VARILEX Project-Spanish Lexical Variation. Linguistica Atlantica 27: 117-121.
Ueda, Hiroto. 2009. Resultados y proyectos en las investigaciones sobre variación léxica del español. Dialectologia 2: 51-80.
Wagner, Max Leopold. 1920. Amerikanisch-Spanisch und Vulgärlatein. Zeitschrift für Romanische Philologie 40: 286-312; 385-404.
Wieling, Martijn & John Nerbonne. 2015. Advances in dialectometry. Annual Review of Linguistics 1(1): 243-264. https://doi.org/10.1146/annurev-linguist-030514-124930
Zamora, Juan y Jorge Guitart. 1988. Dialectología hispanoamericana. Teoría, descripción, historia. Salamanca: Almer.
Published
Downloads
Copyright (c) 2022 David Ellingson Eddington
This work is licensed under a Creative Commons Attribution 4.0 International License.