| Task | Tool | Impact on Data Integrity |
|---|---|---|
| Harvesting data | Python script; Twitter Search API |
Moderate Shortly before the dataset was harvested, Twitter expanded the number of characters permitted in a tweet from 140 to 280. Depending on the user's API endpoint, the full tweet would be returned in some cases and a truncated version in others (i.e. compatibility mode, the default for REST APIs). The Python script I used returned tweets in compatibility mode, truncating any characters after the 140 limit. After discovering the error in the data pre-processing stage, I decided to proceed with the existing data set in light of what would be required to troubleshoot the updated script and re-collect the data. I felt that the dataset collected by the script was still preferable to using the Hawksey TAGS spreadsheet, which - in spite of returning the full text of the tweet - only captured only a small percentage (~10%) of the actual dataset. |
| Convert JSON to CSV | Python script | Low The script selects specific fields from the JSON data and renders them as encoded in UTF-8. There should be no actual change to the data itself. |
| Remove URLs | Python script | Low I used the regular expression module in Python to substitute URLs with null values. I tested the expression with a small subset of tweets; the only case in which data destruction may have occurred is if there were no spaces between the URL and the word following it, which was judged to be minimal. |
| Unescape HTML entities | Python script | Low HTML entities like & were converted back to human readable punctuation to avoid them being read uninentionally by text analytics tools as words (the tools ignore the punctuation marks). |
| Remove duplicates | Microsoft Excel | Low Because some senders used both hashtags in their tweets, it was necessary to delete overlapping tweets (for Gephi data only). |
| Remove truncated endings | Notepad++ | Medium The Twitter API's compatibility mode returns the truncated tweet with the unicode point character "\u2026" (...) at the end. A regular expression was used to remove the character and any non-whitespace characters preceding it (so as to avoid partial words). Complete words that were directly followed by character would also have been replaced. |
| Replace Unicode code point values with NULL or whitespace | Notepad++ | Moderate I researched the issue of escaping or converting code point values to exhaustively but was unable to find an optimal solution given my level of programming knowledge. The inelegant (and time-consuming) solution of using find and replace in Notepad++ led to some latin characters being undesirably altered (e.g. naïve becomes na ve) |
Data
Provenance
The data don't speak for themselves: they must be cleaned, shaped and interpreted, leaving open the possibility for some dimensions to get lost in translation. The table at left describes the pre-processing tasks performed on the dataset, tools used and the potential for impact on the data's integrity.