Shades of #MAGA Social Network Analysis Textual Analysis Summary Data Provenance
Task Tool Impact on Data Integrity
Harvesting data Python script;
Twitter Search API
Moderate
Shortly before the dataset was harvested, Twitter expanded the number of characters permitted in a tweet from 140 to 280. Depending on the user's API endpoint, the full tweet would be returned in some cases and a truncated version in others (i.e. compatibility mode, the default for REST APIs). The Python script I used returned tweets in compatibility mode, truncating any characters after the 140 limit.

After discovering the error in the data pre-processing stage, I decided to proceed with the existing data set in light of what would be required to troubleshoot the updated script and re-collect the data. I felt that the dataset collected by the script was still preferable to using the Hawksey TAGS spreadsheet, which - in spite of returning the full text of the tweet - only captured only a small percentage (~10%) of the actual dataset.
Convert JSON to CSV Python script Low
The script selects specific fields from the JSON data and renders them as encoded in UTF-8. There should be no actual change to the data itself.
Remove URLs Python script Low
I used the regular expression module in Python to substitute URLs with null values. I tested the expression with a small subset of tweets; the only case in which data destruction may have occurred is if there were no spaces between the URL and the word following it, which was judged to be minimal.
Unescape HTML entities Python script Low
HTML entities like & were converted back to human readable punctuation to avoid them being read uninentionally by text analytics tools as words (the tools ignore the punctuation marks).
Remove duplicates Microsoft Excel Low
Because some senders used both hashtags in their tweets, it was necessary to delete overlapping tweets (for Gephi data only).
Remove truncated endings Notepad++ Medium
The Twitter API's compatibility mode returns the truncated tweet with the unicode point character "\u2026" (...) at the end. A regular expression was used to remove the character and any non-whitespace characters preceding it (so as to avoid partial words). Complete words that were directly followed by character would also have been replaced.
Replace Unicode code point values with NULL or whitespace Notepad++ Moderate
I researched the issue of escaping or converting code point values to exhaustively but was unable to find an optimal solution given my level of programming knowledge. The inelegant (and time-consuming) solution of using find and replace in Notepad++ led to some latin characters being undesirably altered (e.g. naïve becomes na ve)

 

Data
Provenance

The data don't speak for themselves: they must be cleaned, shaped and interpreted, leaving open the possibility for some dimensions to get lost in translation. The table at left describes the pre-processing tasks performed on the dataset, tools used and the potential for impact on the data's integrity.