In v0.2 of the package, we include functionality to convert JSON files to various data frame formats. In order to use these features, we recommend the following workflow.
First, you should build your query using the build_query
function.
require(academictwitteR)
require(tibble)
#> Loading required package: tibble
my_query <- build_query(c("#ichbinhanna", "#ichwarhanna"), place = "Berlin")
my_query
#> [1] "(#ichbinhanna OR #ichwarhanna) place:Berlin"
Then, use the get_all_tweets
to collect data. Make sure
to specify data_path
and set bind_tweets
to
FALSE.
get_all_tweets(
query = my_query,
start_tweets = "2021-06-01T00:00:00Z",
end_tweets = "2021-06-20T00:00:00Z",
n = Inf,
data_path = "tweetdata",
bind_tweets = FALSE
)
The first format is the so-called “vanilla” format. This vanilla
format is the direct output from jsonlite::read_json
. It
can display columns such as text
just fine. But some
columns such as retweet_count
are nested in
list-columns.
In order to extract user information, it is additionally necessary to
set user = TRUE
. Please also note that the data frame
returned in this format is not a tibble. As such, we first need to
convert it to a tibble.
#> ================================================================================
#> # A tibble: 25 × 14
#> public_metrics$retweet_co…¹ conversation_id author_id entities$mentions text
#> <int> <chr> <chr> <list> <chr>
#> 1 48 14060074051803… 58755490 <df [1 × 3]> "RT …
#> 2 4 14056173864058… 97759337… <df [1 × 3]> "RT …
#> 3 4 14056160479909… 13065071… <df [1 × 3]> "RT …
#> 4 4 14056150555557… 97897581… <df [1 × 3]> "RT …
#> 5 4 14056130649684… 114774406 <df [1 × 3]> "RT …
#> 6 4 14056107240266… 47919307 <named list [0]> "Ihr…
#> 7 0 14053930335589… 94052353… <df [2 × 3]> "👩💻👩💻👩💻…
#> 8 0 14048087518576… 47919307 <df [1 × 3]> ".@j…
#> 9 20 14044409298812… 11508518… <named list [0]> "Oka…
#> 10 0 14043934574273… 30635588… <named list [0]> "#Ic…
#> # ℹ 15 more rows
#> # ℹ abbreviated name: ¹public_metrics$retweet_count
#> # ℹ 14 more variables: public_metrics$reply_count <int>, $like_count <int>,
#> # $quote_count <int>, entities$hashtags <list>, $urls <list>, lang <chr>,
#> # created_at <chr>, id <chr>, possibly_sensitive <lgl>,
#> # referenced_tweets <list>, source <chr>, geo <df[,1]>,
#> # context_annotations <list>, in_reply_to_user_id <chr>
The second format is the “raw” format. It is a list of data frames containing all of the data extracted in the API call. Please note that not all data frames are in Boyce-Codd 3rd Normal form, i.e. some columns are still list-column.
#> [1] "tweet.public_metrics.retweet_count" "tweet.public_metrics.reply_count"
#> [3] "tweet.public_metrics.like_count" "tweet.public_metrics.quote_count"
#> [5] "tweet.entities.mentions" "tweet.entities.hashtags"
#> [7] "tweet.entities.urls" "tweet.geo.place_id"
#> [9] "tweet.referenced_tweets" "tweet.context_annotations"
#> [11] "tweet.main" "user.public_metrics.followers_count"
#> [13] "user.public_metrics.following_count" "user.public_metrics.tweet_count"
#> [15] "user.public_metrics.listed_count" "user.entities.url"
#> [17] "user.entities.description" "user.main"
#> [19] "sourcetweet.main"
The third format is the “tidy” format. It is an “opinionated” format, which we believe to contain all essential columns for social media research. By default, it is a tibble.
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:testthat':
#>
#> is_null
#> # A tibble: 25 × 31
#> tweet_id user_username text conversation_id author_id lang created_at
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1406007405180… Phardiga "RT … 14060074051803… 58755490 de 2021-06-1…
#> 2 1405617386405… dorothee_goe… "RT … 14056173864058… 97759337… de 2021-06-1…
#> 3 1405616047990… dejools "RT … 14056160479909… 13065071… de 2021-06-1…
#> 4 1405615055555… LenaOetzel "RT … 14056150555557… 97897581… de 2021-06-1…
#> 5 1405613064968… jenniferhenk… "RT … 14056130649684… 114774406 de 2021-06-1…
#> 6 1405610724026… Tobias_Schul… "Ihr… 14056107240266… 47919307 de 2021-06-1…
#> 7 1405393033558… HTMIBerlin "👩💻👩💻👩💻… 14053930335589… 94052353… und 2021-06-1…
#> 8 1404808751857… Tobias_Schul… ".@j… 14048087518576… 47919307 de 2021-06-1…
#> 9 1404440929881… ASattelmacher "Oka… 14044409298812… 11508518… de 2021-06-1…
#> 10 1404393457427… dr_john_aus_b "#Ic… 14043934574273… 30635588… und 2021-06-1…
#> # ℹ 15 more rows
#> # ℹ 24 more variables: possibly_sensitive <lgl>, source <chr>,
#> # in_reply_to_user_id <chr>, user_url <chr>, user_verified <lgl>,
#> # user_name <chr>, user_protected <lgl>, user_profile_image_url <chr>,
#> # user_description <chr>, user_created_at <chr>, user_pinned_tweet_id <chr>,
#> # user_location <chr>, retweet_count <int>, like_count <int>,
#> # quote_count <int>, user_tweet_count <int>, user_list_count <int>, …
It has the following features / caveats:
tweet_id
, author_id
and
sourcetweet_id
respectively.text
field of a retweet is truncated.
However, the full-text original tweet is located in
sourcetweet_text
.sourcetweet_text
. If you need that data, please follow the
clue using the conversation_id
.text
by Twitter are not
available in the tidy format, e.g. list of hashtags, cashtags, urls,
entities, context annotations etc. If you need those columns, please
consider using the “raw” format above.