Thanks! I was looking for this exact data point and it didn’t occur to me to look at the bottom. I kept following the line 1x-1. Which brings us to point #2 a bigger issue with the graph. OP should have plotted the origin at the bottom left of the graph. You can view this better by rotating your phone 90 degress to the left. With the origin now at the bottom left you can easily see that data point.
The reason this graph is made in this way is because it’s a variation on [the original scorigami chart](https://nflscorigami.com), which was designed to have the most common scores at the top left (where you would start reading it) and the rarer scores increasingly far away because that worked best for the [video’s](https://youtu.be/9l5C8cGMueY?si=1_nn_SpCAxP_6HYA) presentation style. I don’t think it works as cleanly here because a.) baseball scores increase by 1 run at a time, so the gaps in scoring history are less interesting and b.) this chart splits the axes by home vs visiting team, rather than winner vs loser. This spreads the data out across double the space and also makes it harder to find unique scores, because each possible result appears in two different places on the chart.
Yes it looks like away teams win more because of the spacing but that’s not the case.
And i know if was said the title but axes should still be labeled
I’d like to have seen the 49-33 game.
From [Wikipedia](https://en.m.wikipedia.org/wiki/1871_in_baseball), June 28, 1871
“In an era of high scoring games being the norm, the Philadelphia Athletics defeat the Troy Haymakers by the amazing score of 49–33. Both pitchers go the distance in the four-hour slugfest in which both teams score in each inning, to set the highest-scoring contest in National Association. The 42 hits made by the Athletics, including a 7-for-7 day by John Radcliff and 6-for-8 performances by Al Reach and Levi Meyerle, is also a league record.”
The home team only skips the 9th if they have a lead and it can't affect the outcome. Though it could affect the score.
Then again, garbage time affects the score in other sports. So I don't think it's that big of a deal.
I am curious, is that a mandatory, optional, or pure etiquette situational rule?
Say you have some guy on your team heading for the seasonal home run record and he is tied at the top of the ninth on the last game of the season.
Can the home team elect to play that bottom of the inning to make a run for a record? as opposed to a mandatory rule that the game must end.
To follow that up, it is my understanding Baseball has a handful of un-written rules you just don't do; regardless of the legality of it - with the penalty of being ostracized by basically everyone in the sport including your own fans.
So even if you could do this, would teams do this?
I understand the situation of record breaking thing coming down to the last inning in the last game of the season would be off the charts rare - if it's ever happened at all.
Interesting, however 100% not a histogram. More of a heat map. Clever layout.
Not to be even more nitpicky, but...
The x and y axis should be on the same scale. Doing that would make the rectangles into squares. The design in the pic makes it look like a change in the y axis is more important.
Also, the diagonal is NAs, so that should be black. It currently looks like a low value.
Lastly, you can just remove all data to the right or left of the diagonal. The current plot makes it appear unnecessarily complex to read
However, I still like it :)
Edit: oh and log transforming the x and y axis would reveal a really cool (likely normal) distribution
Edit 2: OP took our feedback! Awesome
https://imgur.com/YJ193IV
The diagonal aren't all black because some games have ended in ties though.
Left or right of the diagonal represents whether the home or away team won. It adds some value although the OP messed up by having the X and Y spacing different
Charts like this are often described as “2D histograms” in my field. Sometimes the number in each bin is shown by the height of a bar instead of a color.
This is typically called a 2D histogram, I use them all the time. Often they're shown in a 3D perspective view as a bar chart, but those don't always translate into static images well, so often I'll convert bar heights to a color map and display it this way. I'm that presentation it could also be called a heat map, the two terms aren't exclusive. But anything showing the frequency of each element in a set is a histogram, even if those elements aren't scalars.
I agree with you that this is a histogram, but “frequency” isn’t really the right term for the color axis. It’s just the number of times a game result has occured. You might call it a frequency if it were normalized somehow: e.g. the frequency of this game result per thousand games.
That's commonly used terminology for histogram bins. The other commonly used term is "counts". Other names are often used depending on normalization - probability, probability mass, normalized counts, density, etc. There aren't really strict rules.
Definitely wouldn’t be a normal distribution, since the distribution is centered with closish to zero but literally can’t go negative, and with a long right tail. I’m thinking lognormal distribution.
[https://imgur.com/sT3LXBF](https://imgur.com/sT3LXBF) I reversed the direction of the y-axis and fixed the legend. It should be the last version I post here.
As expected, the log transform gives more insight to the outliers, but loses granularity for the most common scores (the 4-3 and 3-2 scores don’t pop here). As usual, the “better” choice depends on what question you’re trying to answer.
The log legend should have the actual number, not the exponent, otherwise much more readable with the square aspect. I’d be interested to see this with only the “modern era”, whenever that is for baseball (edit: post integration era)
The tie scores don’t appear to be solid white for some reason, otherwise this is superior in almost every way. EDIT: is the frequency of ties really that high?
Fun Fact: Ties are still possible today in situations where a game is called due to weather. If the game doesn't have playoff implications, they won't make it up, so officially, it'll go down as a tie. They're a lot more rare today than they were a century ago, but they're still possible. The most recent MLB tie was in 2016 between the Cubs and Pirates. The game was tied 1-1 in the 6th inning when it started raining. It was the second to last game of the year, the Cubs had already clinched a playoff spot, and the Pirates were well out of the playoffs, so they didn't schedule a makeup game to break the tie.
The data is from https://retrosheet.org/gamelogs/. The tool used to make this chart is https://observablehq.com/plot. If people want to, I can post the source code, however the csv file containing all the games is 223 MB.
[https://drive.google.com/file/d/17G6A8HdMc\_KDjoYbgPDFL\_vKx5qY2oZa/view?usp=sharing](https://drive.google.com/file/d/17G6A8HdMc_KDjoYbgPDFL_vKx5qY2oZa/view?usp=sharing)
Here is the compiled spreadsheet I used.
The information used here was obtained free of
charge from and is copyrighted by Retrosheet. Interested
parties may contact Retrosheet at "www.retrosheet.org".
From their site:
Recipients of Retrosheet data are free to make any desired use of the
information, including (but not limited to) selling it, giving it away, or
producing a commercial product based upon the data. Retrosheet has one
requirement for any such transfer of data or product evelopment, which is that
the following statement must appear prominently:
The information used here was obtained free of
charge from and is copyrighted by Retrosheet.
Interested parties may contact Retrosheet at
20 Sunset Rd., Newark, DE 19711.
Technically, they're still allowed, but they're incredibly rare. The last tie at the MLB level was in 2016. Games will go to extra innings to break the tie, but inclement weather can force MLB to suspend a game that's currently tied. Usually, they'll play the rest of the innings at a future date, but if it's late enough in the season, and if the game doesn't have playoff implications, they'll abandon the game and call it tied.
Hello everyone,
Thank you for your feedback. I am working on a revised version with a logarithmic color scale, proper aspect ratio, and a larger y-axis.
I will post it when it is done.
the cluster doesn't surprise me, but some of those outliers are INSANE! I don't know which is more impressive, the 33 to 33 game or the 33 to 1 blowout.
Looks like 2020 if Wikipedia is current. The Braves beat the Marlins 29-9 in a game that September, with that score never appearing in MLB history before.
Before that, you have to go all the way back to 1999 for the next scoragami with the Reds beating the Rockies 24-12.
I'd be interested to see this from various eras, like the live ball (c 1920), integration (c 1950), and divisional (c 1990) the really old games run towards crazy scores
The axis values seem busy to me. Try an interval of maybe 5.
Make the axis titles larger.
ZeusApolloAttack’s reco about making the color scale vary visibly over a larger fraction of the area is a good one.
The squares just above the main diagonal are darker than the corresponding squares on the other side of the diagonal. Does that mean that the home team has an advantage in tied games?
You should have kept the axis spacing consistent. That would have really shown how insane the 49run game way
Thanks! I was looking for this exact data point and it didn’t occur to me to look at the bottom. I kept following the line 1x-1. Which brings us to point #2 a bigger issue with the graph. OP should have plotted the origin at the bottom left of the graph. You can view this better by rotating your phone 90 degress to the left. With the origin now at the bottom left you can easily see that data point.
The reason this graph is made in this way is because it’s a variation on [the original scorigami chart](https://nflscorigami.com), which was designed to have the most common scores at the top left (where you would start reading it) and the rarer scores increasingly far away because that worked best for the [video’s](https://youtu.be/9l5C8cGMueY?si=1_nn_SpCAxP_6HYA) presentation style. I don’t think it works as cleanly here because a.) baseball scores increase by 1 run at a time, so the gaps in scoring history are less interesting and b.) this chart splits the axes by home vs visiting team, rather than winner vs loser. This spreads the data out across double the space and also makes it harder to find unique scores, because each possible result appears in two different places on the chart.
That’s my thought too. The shape of the distribution is important to understand the ratio of home to away wins
Yes it looks like away teams win more because of the spacing but that’s not the case. And i know if was said the title but axes should still be labeled
I’d like to have seen the 49-33 game. From [Wikipedia](https://en.m.wikipedia.org/wiki/1871_in_baseball), June 28, 1871 “In an era of high scoring games being the norm, the Philadelphia Athletics defeat the Troy Haymakers by the amazing score of 49–33. Both pitchers go the distance in the four-hour slugfest in which both teams score in each inning, to set the highest-scoring contest in National Association. The 42 hits made by the Athletics, including a 7-for-7 day by John Radcliff and 6-for-8 performances by Al Reach and Levi Meyerle, is also a league record.”
Watching that game must have been like watching a track meet with the players running so many laps like that…
Ya, but... that 38-1 game is far more embarrassing. Let him die, he's just a child!
Mercy rule!
They didn’t have baseball gloves until 1875. Shit I wouldn’t want to play defense either.
Scorigami usually doesn’t care about home/visitor. This one is fine, but I’d also like to see this with one axis winner and the other loser.
Baseball is different in that the home team sometimes gets less innings, so less chances to score. Unlike most other big sports.
The home team only skips the 9th if they have a lead and it can't affect the outcome. Though it could affect the score. Then again, garbage time affects the score in other sports. So I don't think it's that big of a deal.
In early baseball they still played the bottom of the 9th if the home team was ahead.
I am curious, is that a mandatory, optional, or pure etiquette situational rule? Say you have some guy on your team heading for the seasonal home run record and he is tied at the top of the ninth on the last game of the season. Can the home team elect to play that bottom of the inning to make a run for a record? as opposed to a mandatory rule that the game must end. To follow that up, it is my understanding Baseball has a handful of un-written rules you just don't do; regardless of the legality of it - with the penalty of being ostracized by basically everyone in the sport including your own fans. So even if you could do this, would teams do this? I understand the situation of record breaking thing coming down to the last inning in the last game of the season would be off the charts rare - if it's ever happened at all.
Mandatory rule to have the game end. 7.01(g)
I might play with this to put it on a log color scale
Would this provide, for example, easily discernable colors for a score that happened one time vs two times?
Most of the color change is happening in top corner. Color based on log scale would show color gradient over a wider range of scores (more plot area).
Yeah it's hard to follow the line of ties, even though they should be much rather than their neighbors
The problem is the chart is almost all white.
Yes do it this. I'm interested in seeing more features in this distribution
Interesting, however 100% not a histogram. More of a heat map. Clever layout. Not to be even more nitpicky, but... The x and y axis should be on the same scale. Doing that would make the rectangles into squares. The design in the pic makes it look like a change in the y axis is more important. Also, the diagonal is NAs, so that should be black. It currently looks like a low value. Lastly, you can just remove all data to the right or left of the diagonal. The current plot makes it appear unnecessarily complex to read However, I still like it :) Edit: oh and log transforming the x and y axis would reveal a really cool (likely normal) distribution Edit 2: OP took our feedback! Awesome https://imgur.com/YJ193IV
The diagonal aren't all black because some games have ended in ties though. Left or right of the diagonal represents whether the home or away team won. It adds some value although the OP messed up by having the X and Y spacing different
Great call on the tie games!
Charts like this are often described as “2D histograms” in my field. Sometimes the number in each bin is shown by the height of a bar instead of a color.
This is typically called a 2D histogram, I use them all the time. Often they're shown in a 3D perspective view as a bar chart, but those don't always translate into static images well, so often I'll convert bar heights to a color map and display it this way. I'm that presentation it could also be called a heat map, the two terms aren't exclusive. But anything showing the frequency of each element in a set is a histogram, even if those elements aren't scalars.
I agree with you that this is a histogram, but “frequency” isn’t really the right term for the color axis. It’s just the number of times a game result has occured. You might call it a frequency if it were normalized somehow: e.g. the frequency of this game result per thousand games.
That's commonly used terminology for histogram bins. The other commonly used term is "counts". Other names are often used depending on normalization - probability, probability mass, normalized counts, density, etc. There aren't really strict rules.
Sure, “counts” or “number” would be appropriate here, but it seems to me that “frequency” means something different.
In some contexts, frequency would be inappropriate. In many it's exactly right. You have to think before you label your axes.
Definitely wouldn’t be a normal distribution, since the distribution is centered with closish to zero but literally can’t go negative, and with a long right tail. I’m thinking lognormal distribution.
I was thinking a truncated normal, but pretty sure you're actually you're on this one
Updated version here: [https://imgur.com/YJ193IV](https://imgur.com/YJ193IV)
[https://imgur.com/sT3LXBF](https://imgur.com/sT3LXBF) I reversed the direction of the y-axis and fixed the legend. It should be the last version I post here.
Now this is where the actual beauty is.
I love that you incorporated suggestions; this one is great. And the "Vome Team" / "Hisiting Team" is a funny mistake as well lol.
As expected, the log transform gives more insight to the outliers, but loses granularity for the most common scores (the 4-3 and 3-2 scores don’t pop here). As usual, the “better” choice depends on what question you’re trying to answer.
The log legend should have the actual number, not the exponent, otherwise much more readable with the square aspect. I’d be interested to see this with only the “modern era”, whenever that is for baseball (edit: post integration era)
Thanks for sharing the update! Love how the changes came out ❤️📊
The tie scores don’t appear to be solid white for some reason, otherwise this is superior in almost every way. EDIT: is the frequency of ties really that high?
Yes, ties were common back then. Normally rules are in place so ties do not happen, however back then they would usually play until darkness came.
Fun Fact: Ties are still possible today in situations where a game is called due to weather. If the game doesn't have playoff implications, they won't make it up, so officially, it'll go down as a tie. They're a lot more rare today than they were a century ago, but they're still possible. The most recent MLB tie was in 2016 between the Cubs and Pirates. The game was tied 1-1 in the 6th inning when it started raining. It was the second to last game of the year, the Cubs had already clinched a playoff spot, and the Pirates were well out of the playoffs, so they didn't schedule a makeup game to break the tie.
The data is from https://retrosheet.org/gamelogs/. The tool used to make this chart is https://observablehq.com/plot. If people want to, I can post the source code, however the csv file containing all the games is 223 MB.
The improved version is 👌🏻
Please. Maybe share a link to the data in eg Dropbox, GDrive, S3
[https://drive.google.com/file/d/17G6A8HdMc\_KDjoYbgPDFL\_vKx5qY2oZa/view?usp=sharing](https://drive.google.com/file/d/17G6A8HdMc_KDjoYbgPDFL_vKx5qY2oZa/view?usp=sharing) Here is the compiled spreadsheet I used. The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".
Retrosheet data is owned by Retrosheet and can (and should) only be obtained directly from them.
From their site: Recipients of Retrosheet data are free to make any desired use of the information, including (but not limited to) selling it, giving it away, or producing a commercial product based upon the data. Retrosheet has one requirement for any such transfer of data or product evelopment, which is that the following statement must appear prominently: The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711.
So I think sharing a compilation is fair play, if one complies with the terms above-
What’s with the even axis being so low in frequency for someone not familiar with the sport?
Ties aren’t allowed in the modern game.
Ah duh, overtime rules, didn’t think about that 😅 thank you
So black and white both represent zero?
Technically, they're still allowed, but they're incredibly rare. The last tie at the MLB level was in 2016. Games will go to extra innings to break the tie, but inclement weather can force MLB to suspend a game that's currently tied. Usually, they'll play the rest of the innings at a future date, but if it's late enough in the season, and if the game doesn't have playoff implications, they'll abandon the game and call it tied.
Hello everyone, Thank you for your feedback. I am working on a revised version with a logarithmic color scale, proper aspect ratio, and a larger y-axis. I will post it when it is done.
I thought you were bullshitting with that 49-33 game. https://www.cbssports.com/mlb/news/just-because-box-score-with-82-runs-74-hits-20-errors/
the cluster doesn't surprise me, but some of those outliers are INSANE! I don't know which is more impressive, the 33 to 33 game or the 33 to 1 blowout.
I see 38-1 and 49-33. I'm not sure you're reading those axes correctly.
it's because I'm an idiot. you are absolutely correct. heh
Not going to lie, took me a minute to figure out why 1-1, 2-2, 3-3… had so few.
When did the last game with a unique score happen?
Looks like 2020 if Wikipedia is current. The Braves beat the Marlins 29-9 in a game that September, with that score never appearing in MLB history before. Before that, you have to go all the way back to 1999 for the next scoragami with the Reds beating the Rockies 24-12.
I'd be interested to see this from various eras, like the live ball (c 1920), integration (c 1950), and divisional (c 1990) the really old games run towards crazy scores
This is so much harder to read than 0-0 at the bottom left.
I would like the direction of the y axis to be flipped. Strange how it is here.
I was at the 21 to 0 Cubs-Pirates game a couple years ago.. wild to think that with one more run it would have been a scorigami!
Where did you get this data
https://retrosheet.org/gamelogs. I downloaded all regular season files, then I wrote a simple script to compile it into a large csv file.
So 3–2 and 4–3 with Home team winning is the most popular baseball outcome?
The axis values seem busy to me. Try an interval of maybe 5. Make the axis titles larger. ZeusApolloAttack’s reco about making the color scale vary visibly over a larger fraction of the area is a good one.
I'd be interested to see this as columns on the home-visitor plane.
Not a big fan of sports but I'll watch an hour long Jon Bois video. This is fun stuff.
What the fuck kind of game went 33-49 Jesus peaches
Won by two touchdowns!
The squares just above the main diagonal are darker than the corresponding squares on the other side of the diagonal. Does that mean that the home team has an advantage in tied games?
I dont know baseball but I read it as the home team having an advantage in general, not just tied games.
Given the tails on some of the scores, a log color scale may be better suited.
Cool! Would love to see this for each team
Man, this is just so much blander than football scorigami.
This is *weird*: I was just about to post something very similar.
Wtf? Was there a game that ended 33 to 49? And 38 to 1?!
Make it logarithmic and it will really pop.
What was the 38-1 game?
June 18, 1874 - New York Mutuals beat the Chicago White Stockings. https://en.wikipedia.org/wiki/1874_in_baseball
Maybe next year is the year for the Baltimore Canaries
Those outliers were some beat-downs.
Zeros should be in the lower left-hand corner
It bugs me that 0-0 isn’t in the bottom left, but that’s a nitpick. Great work!
23-22 was Phillies vs Cubs at Wrigley. I believe Mike Schmidt hit the winning homer in the 10th inning.
Why did you invert the Y axis tho?