Models of social interaction and data analysis of online user behavior during polarization processes

<< Volver atrás

Tesis:

Models of social interaction and data analysis of online user behavior during polarization processes

Autor: MARTÍN GUTIÉRREZ, Samuel

Título: Models of social interaction and data analysis of online user behavior during polarization processes

Fecha: 2020

Materia: Sin materia definida

Escuela: E.T.S. DE INGENIERÍA AGRONÓMICA, ALIMENTARIA Y DE BIOSISTEMAS

Departamentos: INGENIERIA FORESTAL

Acceso electrónico: http://oa.upm.es/66135/

Director/a 1º: BENITO ZAFRILLA, Rosa María

Resumen: The main goal of this thesis is to improve our understanding of collective human behavior both at the level of the elementary social interactions and at the level of the emergent properties that arise from these interactions, placing a special emphasis on the study of social polarization, which seems to be pervasive in the current zeitgeist. To this end, we analyze large quantities of data with both existing and new techniques and develop mathematical models and simulations. We have developed three models of social interaction that relate individual activity (the number of actions of an individual - A) with the collective response of social systems (the number of reactions triggered on the other members of the system - R). That relationship is characterized by means of the probability distribution of the efficiency metric, defined as r| = R/A. Generalizing previous results, we show that the efficiency distribution presents a universal structure in three systems of different nature: Twitter, Wikipedia and the scientific citations network. The models explain that universality and provide a description of the underlying social dynamics of each system. Once the elementary social dynamics have been characterized, we address the study of polarization, a process of social division that results in the majority of the population holding extreme and opposing opinions with few individuals remaining neutral. Empirical data is at the core of this work and Twitter is our main data source, so we begin with a user behavior study of electoral contexts, since these are some of the most common scenarios where polarization may emerge. The objective is to find regularities in the behavioral and communication patterns among the users by analyzing the evolution of their activity and interaction networks. This study contributes to verify the stability of interaction dynamics and establish baselines that may allow the detection of fluctuations and irregularities, which could indicate changes in the social landscape. In order to estimate the opinions of the users with Sentiment Analysis techniques, we develop a methodology to semi-automatically build a training set of tweets in a polarized context. First, we compile lists of Twitter users with known affiliation to one of the poles of a given scenario (a political party in an electoral context, one of the teams in a football match, a certain brand in a commercial competition, etc.). Then, every tweet sent by a user associated to a given pole is labeled as positive if it mentions users of the same pole and as negative if it mentions users from different poles. The methodology is tested by comparing the automatically built training sets with manually labeled reference datasets. On the other hand, we have explored an alternative paradigm of opinion inference that is based not on the content of the messages but on the network of social links woven by the interactions. We have also taken into consideration that the behavior of many social contexts can not be fully understood by simplifying them as bipolar systems (for example, multi-party elections), and they must be treated as multipolar systems. Therefore, we have developed a general opinion inference technique that is not only valid for bipolar systems, but for systems with an arbitrary number of opinion poles. Additionally, we have developed methodological tools to measure and characterize the polarization patterns embedded in the multipolar opinión distributions. This methodology has been applied to five real-world scenarios with two, three, four and five poles, finding clear connections between the opinion distributions and the underlying sociological contexts. The multidimensional nature of opinion is also manifested in the fact that attitudes are often interlinked with other social dimensions. For example, in territorial conflicts language tends to play a critical role and is used in different ways by the different poles. We have studied this phenomenon in a Twitter conversation around the independence of the Spanish region of Catalonia. In particular, we have analyzed the relationship between ideology and language by combining an opinion index computed using the methodology described above with a language index calculated by analyzing the relative use of Catalan and Spanish in the tweets. Finally, we present two network metrics that quantify the polarization of the topology of a social network, although they have potential applications in many other fields: the Network Variance and the Network Covariance. These measures generalize the usual notion of variance and covariance, elementary statistical tools applied in Euclidean spaces, to arbitrary metric spaces. In particular, to networks, because a metric space can be built from a set of nodes using an appropriate distance measure between them. We illustrate the usefulness of the Network (Co)variance by characterizing the relationship between two networks of mathematical knowledge: the functional network (that encodes how are mathematical ideas used in scientific papers) and the structural network (that encodes how are ideas conceptually related). ----------RESUMEN---------- El objetivo último de esta tesis es ampliar el conocimiento del comportamiento humano colectivo tanto a nivel de las interacciones sociales elementales, como a nivel de las propiedades emergentes que surgen de estas interacciones, poniendo especial énfasis en la polarización social, que parece omnipresente en el contexto actual. Con este fin, analizamos grandes cantidades de datos mediante técnicas tanto existentes como originales y desarrollamos modelos matemáticos y simulaciones. Hemos desarrollado tres modelos de interacción social que relacionan la actividad individual (cantidad de acciones de un individuo - A) con la respuesta colectiva de un sistema social (cantidad de reacciones provocadas en otros miembros del sistema - R). Esta relación se caracteriza mediante la distribución de probabilidad de la eficiencia, una métrica definida como r| = R/A. Generalizando resultados anteriores demostramos que la distribución de eficiencia presenta una estructura universal en tres sistemas de distinta naturaleza: Twitter, Wikipedia y la red de citas científicas. Los modelos nos permiten explicar dicha universalidad y, además, proporcionan una descripción de las dinámicas sociales subyacentes de cada sistema. Una vez caracterizadas las dinámicas sociales elementales, abordamos el estudio de la polarización, un proceso de división social que provoca que la mayoría de la población adopte opiniones extremas y opuestas con pocos individuos en una posición neutral. Comenzamos con un estudio del comportamiento de los usuarios de Twitter en contextos electorales, puesto que la polarización surge con frecuencia en estos escenarios. El objetivo es encontrar regularidades en los patrones de comportamiento y comunicación de los usuarios mediante el análisis de la evolución temporal de su actividad así como de sus redes de interacciones. Este estudio contribuye a verificar la estabilidad de las dinámicas de interacción y establecer referencias que podrían permitir la detección de fluctuaciones e irregularidades de comportamiento y, a su vez, variaciones en el panorama social. Con el fin de estimar las opiniones de los usuarios con técnicas de Análisis de Sentimiento, desarrollamos una metodología para construir conjuntos de datos de entrenamiento de manera semi-automática en contextos polarizados. En primer lugar, elaboramos listas de usuarios de Twitter cuya afiliación a alguno de los polos es conocida (los polos serían partidos políticos en un contexto electoral o los equipos en un partido de fútbol). Los tweets enviados por un usuario asociado a un polo se etiquetan como positivos si mencionan usuarios del mismo polo y como negativos si mencionan usuarios de otros polos. Hemos evaluado esta metodología comparando los conjuntos de entrenamiento construidos de forma automática con corpus de referencia etiquetados manualmente. Por otro lado, hemos explorado un paradigma alternativo de inferencia de opinión que no se basa en el contenido de los mensajes intercambiados entre los usuarios sino en la red de contactos sociales entretejida por medio de estas interacciones. También hemos tenido en consideración que el comportamiento de muchos contextos sociales no se puede comprender reduciéndolos a sistemas bipolares (por ejemplo, elecciones con más de dos partidos) y es necesario tratarlos como sistemas multipolares. Por lo tanto hemos desarrollado una técnica general de inferencia de opinión válida no solo para sistemas bipolares, sino para sistemas con un número arbitrario de polos. Esta metodología se ha aplicado a cinco escenarios reales con dos, tres, cuatro y cinco polos, encontrando conexiones claras entre las distribuciones de opinión y el contexto sociológico subyacente. El carácter multidimensional de la opinión también se manifiesta en el hecho de que las opiniones en muchos casos están fuertemente conectadas con otras dimensiones sociales. Por ejemplo, en conflictos territoriales el idioma tiende a jugar un papel crítico, y cada polo suele presentar un uso del idioma bien diferenciado. Hemos estudiado este fenómeno en una conversación de Twitter sobre la independencia de Cataluña. Para ello, hemos analizado la relación entre ideología y uso del idioma combinando un índice de opinión calculado mediante la metodología descrita anteriormente con un índice de lenguaje que recoge el uso relativo del catalán y el español en los tweets. Por último, presentamos dos métricas que permiten cuantificar la polarización de la topología de una red social, aunque su aplicabilidad abarca muchos más ámbitos de la Ciencia de Redes: la Varianza de Red y la Covarianza de Red. Estas medidas generalizan las nociones de varianza y covarianza, herramientas estadísticas elementales en espacios euclídeos, a espacios métricos arbitrarios. En particular, a redes, puesto que se puede definir un espacio métrico sobre los nodos de una red utilizando una medida apropiada de distancia entre los nodos. Ilustramos la utilidad de la (Co)varianza de Red caracterizando la relación entre dos redes que estructuran el conocimiento matemático: la redfuncional (que codifica el uso de conceptos matemáticos en artículos científicos) y la red estructural (que codifica las relaciones conceptuales).