Data manipulation, kind of downsampling

2024/10/10 0:26:51

I have a large csv file, example of the data below. I will use an example of eight teams to illustrate.

home_team    away_team      home_score       away_score         year
belgium      france         2                2                  1990
brazil       uruguay        3                1                  1990
italy        belgium        1                2                  1990
sweden       mexico         3                1                  1990france       chile          3                1                  1991
brazil       england        2                1                  1991
italy        belgium        1                2                  1991
chile        switzerland    2                2                  1991

My data runs for many years. I would like to have total number of scores of each team every year, see example below,

team            total_scores          year
belgium         4                     1990
france          2                     1990
brazil          3                     1990
uruguay         1                     1990
italy           1                     1990
sweden          3                     1990
mexico          1                     1990france          3                     1991
chile           5                     1991
brazil          2                     1991
england         1                     1991
italy           1                     1991
belgium         2                     1991
switzerland     2                     1991



Here is a solution using the tidyverse (dplyr and tidyr), in particular the pivot functions from tidyr...

library(tidyverse)df %>% pivot_longer(cols = -year,   #splits non-year columns into home/away and type columnsnames_to = c("homeaway", "type"), names_sep = "_", values_to = "value", values_ptypes = list(value = character())) %>% select(-homeaway) %>%             #remove home/awaypivot_wider(names_from = "type",  #restore team and score columns (as list columns)values_from = "value") %>% unnest(cols = c(team, score)) %>% #unnest the list columns to year, team, scoregroup_by(year, team) %>% summarise(total_goals = sum(as.numeric(score)))# A tibble: 14 x 3
# Groups:   year [2]year team        total_goals<int> <chr>             <dbl>1  1990 belgium               42  1990 brazil                33  1990 france                24  1990 italy                 15  1990 mexico                16  1990 sweden                37  1990 uruguay               18  1991 belgium               29  1991 brazil                2
10  1991 chile                 3
11  1991 england               1
12  1991 france                3
13  1991 italy                 1
14  1991 switzerland           2

