= import("https://esm.run/@observablehq/plot@0.6.11") Plot
Tidying Unkempt Data
Introduction
Transport has always been a passion of mine. I have always adored Trains, Planes, Cars, Boats, and any other type of vehicle. As I matured I began to see transit through a new lens. The sights and sounds of large transport-oriented machinary no longer held my attention like they once did. A more humanistic view of transport caught my fancy and I began to take interest in transport equity for people from all socio-economic backgrounds.
It is well known that the automobile is a focal point in American society. That said, I was well into my teens before I noticed the consequences our car-centric culture causes for people who either can’t afford to own a car or choose not to own a car. This realization invigorated my interest in public transport and walkable infrastructure. In this data exploration, I hope to shed some light on the state of urban rail transport systems in America and compare them to that of other first world countries.
The Data
I found a fascinating data table on this Wikipedia page so I decided to pull it and analize it. Here is the code I used to pull it:
<- "https://en.wikipedia.org/wiki/List_of_metro_systems"
web_page
<-
Tables %>%
web_page read_html() %>%
html_elements("table")
<-
MetroList %>%
Tables ::pluck(1) %>%
purrrhtml_table() %>%
select(1:9)
Much to my dismay the data was riddled with miscellaneous junk. Here is the data inside MetroList:
Here are the problems I see in MetroList:
- The name of the column that holds the number of stations on each system has a long and irrelevent name.
- Several of the columns have units and footnote links in them.
I decided to import it to Rstudio and take a crack at cleaning it.
Tidying
Optimistically, I decided to call this new table TidyMetroList.
Renaming Columns
The following code renames the columns with a unified naming system:
<-
TidyMetroList %>%
MetroList set_names(c("City",
"Country",
"Name",
"ServiceOpened",
"LastExpanded",
"StationCount",
"Lines",
"SystemLength",
"RidershipMillions"))
Much better! There’s still more to go though. Next we can tackle the extra gunk in the columns.
Isolating the Data
We’ll do this part with regular expressions and the stringr
package. These are the main regular expressions we’ll use:
[]
: Grouping
- This element matches one of any character inside the brackets.
|
: Alternation
- This element sperates two expressions. The computer will match one or the other.
w
: Word
- This matches “word characters.” It is very inclusive with what it defines as a word character so it needs to be used with care.
s
: Whitespace
- This matches all forms of whitespace.
d
: Digit
- This matches any natural number in the closed interval [0,9].
.
: Dot
- This matches everything except linebreaks.
+
: Plus
- This matches one or more of the expression it follows.
?
: Optional
- This matches 0 or 1 of the expression it follows.
{}
: Quantifier
- This matches the quantity in the curly-brackets.
(?<=ExpressionHere)
: Positive Lookbehind
- This matches an expression before the main expression without including it in the result.
(?=ExpressionHere)
: Positive Lookahead
- This matches an expression after the main expression without including it in the result.
– Adapted from the reference book on regexr.com
If you want some practice with regular expressions, checkout regexr.com. They have a development environment that tests expressions in real-time and a reference book containing descriptions for every command.
This code uses regular expresssions to remove all of the gunk in the columns:
<-
TidyMetroList %>%
TidyMetroList mutate(ServiceOpened = as.numeric(str_replace_all(ServiceOpened, "\\[[\\w+\\s+]+\\]", "")),
LastExpanded = as.numeric(str_replace_all(LastExpanded, "\\[[\\w+\\s+]+\\]", "")),
Country = str_replace_all(Country, "\\[[\\w+\\s+]+\\]", ""),
Name = str_replace_all(Name, "\\[[\\w\\s]+\\]|\\([\\w\\s]+\\)", ""),
StationCount = as.numeric(str_replace_all(StationCount, "\\[[\\w+\\s+]+\\]", "")),
SystemLength = str_replace_all(SystemLength, "\\skm\\s|\\d+(\\.\\d+)?.mi|\\[[\\w+\\s+]+\\]|[\\(\\)]+", ""))
Lovely! Now there’s only one problem. Notice that the RidershipMillions
column still was not dealt with in the last command. That’s because the parenthetical statements in that column show the timeframe in which the ridership data was collected. We’d rather not get rid of this data so hastily.
A Special Case
The following code will snatch the data from the parenthesis and drop it in a new column called RidershipMillionsDataFrom
. Then it will remove the excess information from RidershipMillions
like we did above.
<-
TidyMetroList %>%
TidyMetroList mutate(RidershipMillionsDataFrom = str_extract(RidershipMillions, "(?<=\\().+(?=\\))"),
RidershipMillions = as.numeric(str_replace_all(RidershipMillions, "\\[[\\w+\\s+]+\\]|(\\().+(\\))", "")))
Ridership Data Analysis
Analysis
I would like to know what the average ridership is like for the five largest economies. So I start by removing any NA values with the command in the third line. Next, I group the dataset so the mutate command takes the mean of RidershipMillions
for each country. Now I filter out the five countries of interest. The select command picks out the columns I choose and the distinct command removes superfluous rows.
Caveat
This data set is not a great indicator of how widely used urban rail transport is in these countries because we have no data in this dataset for the population that each metro system serves. If we did it would be very interesting to compare the average ridership divided by the average service zone population.
<-
MetroRidershipGraphInput %>%
TidyMetroList filter(Country %in% c("United States", "Germany", "India", "China", "Japan")) %>%
na.omit() %>%
group_by(Country) %>%
mutate(RidershipMillionsAvg = mean(RidershipMillions)) %>%
select(Country, RidershipMillionsAvg) %>%
distinct()
Visual
For the visuals I used Observables. It is an excellent graphics generator that supplies a clear interface for creating a notebooks. If you would like to see the work I did to generate this code checkout my public notebook for this project.
.plot({
Plotcolor: {
type: "categorical",
domain: d3.range(5).map[`China`, `Germany`, `India`, `Japan`, `United States`],
range: ["#593196", "#009CDC", "#13B955", "#EFA31D", "#FC3939"]
,
}title: "Average Ridership in 5 Major Countries",
y: { label: "Millions of Riders" },
marks: [
.barY(await FileAttachment("MetroRidership.csv").csv({ typed: "auto" }), {
Plotx: "Country",
y: "Average Ridership (Millions)",
fill: "Country",
tip: true
,
}).ruleY([0])
Plot
] })
System Size Data Analysis
Analysis
First, we take the dataset and filter so we only have data on the five countries we are interested in. Then we remove any NA values. Now we move to mutations. We would like to see some metric of how many stations there are on each line. We will mutate StationCount
divided by SystemLength
and call this ratio StationDensity
. At this point we still have the data for each metro system in the dataset individually. We will now use group_by()
to group the data into 5 groups corresponding to the 5 countries. Now we can run a mutate to find the averages of StationCount
, Lines
, SystemLength
, and StationDensity
for each of the 5 countries. We will select the five columns of interest and remove any duplicated rows. Now we have all the data we need to make some really intriguing graphics.
Caveat
Upon looking at the data again, I realize that I could have built a variable more intuitive than StationDensity
. Station density is currently measured in \(\frac{Stations}{Kilometer}\). I think it would be easier to communicate this data if I flipped the ratio to measure in \(\frac{Kilometers}{Station}\). In this format, the data would show that average distance between 2 stations. If one were to divide this average in half we could see the average max distance a rider would have to walk in order to get to their destination from a rail station.
<-
MetroSizeGraphInput %>%
TidyMetroList filter(Country %in% c("United States", "Germany", "India", "China", "Japan")) %>%
na.omit() %>%
mutate(StationDensity = StationCount/SystemLength) %>%
group_by(Country) %>%
mutate(StationCountAvg = mean(StationCount),
LinesAvg = mean(Lines),
SystemLengthAvg = mean(SystemLength),
StationDensityAvg = mean(StationDensity)) %>%
select(Country, StationCountAvg, LinesAvg, SystemLengthAvg, StationDensityAvg) %>%
distinct()
Visuals
.plot({
Plotcolor: {
type: "categorical",
domain: d3.range(5).map[`China`, `Germany`, `India`, `Japan`, `United States`],
range: ["#593196", "#009CDC", "#13B955", "#EFA31D", "#FC3939"]
,
}title: "Average Distance Between Stations",
y: { label: "Kilometers per Station" },
marks: [
.barY(await FileAttachment("MetroSize.csv").csv({ typed: "auto" }), {
Plotx: "Country",
y: "Average Station Density",
fill: "Country",
tip: true
,
}).ruleY([0])
Plot
] })
.plot({
Plotcolor: {
type: "categorical",
domain: d3.range(5).map[`China`, `Germany`, `India`, `Japan`, `United States`],
range: ["#593196", "#009CDC", "#13B955", "#EFA31D", "#FC3939"]
,
}title: "Average System Length",
y: { label: "Kilometers of Rail Service" },
marks: [
.barY(await FileAttachment("MetroSize.csv").csv({ typed: "auto" }), {
Plotx: "Country",
y: "Average System Length",
fill: "Country",
tip: true
,
}).ruleY([0])
Plot
] })
.plot({
Plotcolor: {
type: "categorical",
domain: d3.range(5).map[`China`, `Germany`, `India`, `Japan`, `United States`],
range: ["#593196", "#009CDC", "#13B955", "#EFA31D", "#FC3939"]
,
}title: "Average Number of Stations per System",
y: { label: "Stations" },
marks: [
.barY(await FileAttachment("MetroSize.csv").csv({ typed: "auto" }), {
Plotx: "Country",
y: "Average Number of Stations",
fill: "Country",
tip: true
,
}).ruleY([0])
Plot
] })
System Maturity Data Analysis
Analysis
I would like to compare the age of these five metro systems. First I’ll filter out all the countries that are irelevant to our comparison. Next we need to find both the age of each system and the amount of time since each was expanded. For that we need to know the current date. Luckily, R has a function called Sys.Date()
that returns the current time and date. So we need to extract the year from Sys.date()
. We’ll use both our venerable regular expressions and the stringr
package to accomplish this. Then we can ensure that this string of 4 numbers is converted to a number with the as.numeric()
function. Now we mutate the difference between the column ServiceOpened
and our current year to get a new column called SystemAge
. We also mutate the difference between the column LastExpanded
and our current year to get a new column called YearsSinceExpansion
. Now we select only the columns we need and remove any NA vlaues from those columns. Next we can group by each country and find the average for each as explained above. Finally we select the columns Country
, AverageYearsSinceExpansion
, and AverageSystemAge
. If graphing this data sounds like a problem, you would be right. We need to create a categorical variable with the levels: AverageYearsSinceExpansion
and AverageSystemAge
. Thank goodness for the R function pivot_longer()
! It takes three arguments: - cols
- A column or columns to consider in the pivoting process. - names_to
- The name of a new column to hold the values of the column titles. (AverageYearsSinceExpansion
and AverageSystemAge
in our case). - values_to
- The name of a new column to hold the values of the columns being pivoted. The resulting dataframe is just below the code chunk.
<-
MetroMaturityGraphInput %>%
TidyMetroList filter(Country %in% c("United States", "Germany", "India", "China", "Japan")) %>%
mutate(SystemAge = (as.numeric(str_extract(Sys.Date(), "\\d{4}")) - ServiceOpened),
YearsSinceExpansion = (as.numeric(str_extract(Sys.Date(), "\\d{4}")) - LastExpanded)) %>%
select(Country, SystemAge, YearsSinceExpansion) %>%
na.omit() %>%
group_by(Country) %>%
mutate(AverageSystemAge = mean(SystemAge),
AverageYearsSinceExpansion = mean(YearsSinceExpansion)) %>%
select(Country, AverageYearsSinceExpansion, AverageSystemAge) %>%
distinct() %>%
pivot_longer(cols = !Country, names_to = "Types", values_to = "Values")
Visual
.plot({
Plotcolor: {
type: "categorical",
domain: d3.range(5).map[`China`, `Germany`, `India`, `Japan`, `United States`],
range: ["#593196", "#009CDC", "#13B955", "#EFA31D", "#FC3939"]
,
}title: "Transit System Maturity",
subtitle: "Urban Rail System Maturity in 5 Major Countries",
y: { label: "Years" },
marks: [
.frame({ strokeOpacity: 0.1 }),
Plot.barY(await FileAttachment("MetroMaturity.csv").csv({ typed: "auto" }), {
Plotfx: " ",
x: "Country",
y: "Values",
fill: "Country",
tip: true
,
}).ruleY([0])
Plot
] })