II.2.2 Converting Raw Communication Data to Discrete Mathematical values
The communication archives need to be converted into variables suitable as input for statistics and machine learning. These variables can be social network analysis (SNA) metrics such as degree, or betweenness centrality (see section II.3), or other signals extracted from content and body language. They can be single variables attached to an individual, such as the network position, for example the betweenness centrality of an individual in the network. They can also be variables computed from time series, for example the total number of oscillations in betweenness centrality in a time interval. Finally, they can also be time series attached to an individual, for example the betweenness value of the individual calculated for each day’s e-mail network.
The same is true for analyzing content, where word vectors can be calculated for instance through tf/idf (term frequency/inverse document frequency). Tf/idf measures the frequency of a word within a document (e.g., an e-mail message), comparing it to the frequency of the word within the entire document collection (e.g., the entire e-mail archive). For more sophisticated analyses, word embeddings, for example using word2vec, can be calculated, that measure the probability distributions of n-grams in large document collections. N-grams are sequences of words, starting with the unigram representing single words, bigrams representing two words in sequence, trigrams representing three words in sequence, etc.. This approach is for example used to calculate word embeddings for the tribes of tribefinder (see section II.6).
To convert electrical signals into time series, for instance from sound files, or from brainwave scans, or measuring the action potential of plants with the plant spikerbox, various approaches can be used. The simplest method is to calculate average values per time interval, for example per second. Another option is to calculate the Euclidean distance between two time series to measure their similarity. A more differentiated approach is to compute MFCCs (Mel Frequency Cepstrum Coefficients) by doing a Fourier transformation, mapping the spectrum to the mel scale of evenly distanced pitches, and then doing a discrete cosine transformation, which will give a discrete value for each mel. This means that the sound wave or electrical signal is transformed into a series of discrete values per time unit.