Kimberly Fessel's Blog

Accuracy, Precision, and Recall — Never Forget Again!

2022-04-03T00:00:00+00:00

To design an effective supervised machine learning model, data scientists must first select appropriate metrics to judge their model’s success. But choosing a useful metric often proves more challenging than anticipated, especially for classification models that have a slew of different metric options.

Accuracy remains the most popular classification metric because it’s easy to compute and easy to understand. Accuracy comes with some serious drawbacks, however, particularly for imbalanced classification problems where one class dominates the accuracy calculation.

In this post, let’s review accuracy but also define two other classification metrics: precision and recall. I’ll share an easy way to remember precision and recall along with an explanation of the precision-recall tradeoff, which can help you build a robust classification model.

Model and Data Setup

To make this study of classification metrics more relatable, consider building a model to classify apples and oranges on a flat surface such as the table shown in the image below.

Most of the oranges appear on the left side of the table, while the apples mostly show up on the right. We could, therefore, create a classification model that divides the table down its middle. Everything on the left side of the table will be considered an orange by the model, while everything on the right side will be considered an apple.

What is accuracy?

Once we’ve built a classification model, how can we determine if it’s doing a good job? Accuracy provides one way to judge a classification model. To calculate accuracy just count up all of the correctly classified observations and divide by the total number of observations. This classification model correctly classified 4 oranges along with 3 apples for a total of 7 correct observations, but there are 10 fruits overall. This model’s accuracy is 7 over 10, or 70%.

While accuracy proves to be one of the most popular classification metrics because of its simplicity, it has a few major flaws. Imagine a situation where we have an imbalanced dataset; that is, what if we have 990 oranges and only 10 apples? One classification model that achieves a very high accuracy predicts that all observations are oranges. The accuracy would be 990 out of 1000, or 99%, but this model completely misses all of the apple observations.

Furthermore, accuracy treats all observations equally. Sometimes certain kinds of errors should be penalized more heavily than others; that is, certain types of errors may be more costly or pose more risk than others. Take predicting fraud for example. Many customers would likely prefer that their bank call them to check up on a questionable charge that is actually legitimate (a so-called “false positive” error) than allow a fraudulent purchase to go through (a “false negative”). Precision and recall are two metrics that can help differentiate between error types and can still prove useful for problems with class imbalance.

Precision and Recall

Both precision and recall are defined in terms of just one class, oftentimes the positive—or minority—class. Let’s return to classifying apples and oranges. Here we will calculate precision and recall specifically for the apple class.

Precision measures the quality of model predictions for one particular class, so for the precision calculation, zoom in on just the apple side of the model. You can forget about the orange side for now.

Precision equals the number of correct apple observations divided by all observations on the apple side of the model. In the example depicted below, the model correctly identified 3 apples, but it classified 5 total fruits as apples. The apple precision is 3 out of 5, or 60%. To remember the definition of precision, note that preciSIon focuses on only the apple SIde of the model.

Recall, on the other hand, measures how well the model did for the actual observations of a particular class. Now check how the model did specifically for all the actual apples. For this, you can pretend like all of the oranges don’t exist. This model correctly identified 3 out of 4 actual apples; recall is 3 over 4, or 75%. Remember this simple mnemonic: recALL focuses on ALL the actual apples.

Precision-Recall Tradeoff

So what are the benefits of measuring precision and recall instead of sticking with accuracy? These metrics certainly allow you to emphasize one specific class since they are defined for one class at a time. That means that even if you have imbalanced classes, you can measure precision and recall for your minority class, and these calculations won’t get dominated by the majority class observations. But it turns out that there’s also a nice tradeoff between precision and recall.

Some classification models, such as logistic regression, not only predict which class each observation belongs to but also predict the probability of being in a particular class. For example, the model may determine that a specific fruit has 80% probability of being an apple and 20% probability of being an orange. Models like these come with a decision threshold that we can adjust to divide the classes.

Let’s say you’d like to improve the precision of your model because it’s very important to avoid falsely claiming that an actual orange is an apple (false positive). You can just move the decision threshold up, and precision gets better. For our apple-orange model, that means shifting the model line to the right. In the example image, the updated model boundary yields perfect precision of 100% since all predicted apples are actually apples. When we do this, however, recall will likely decrease because moving the threshold up leaves out actual apples in addition to the erroneous oranges. Here, recall dropped to 50%.

Okay, what if we want to improve recall? We could make our decision threshold lower by moving our model line to the left. We now capture more actual apples on the apple side of our model, but as we do this, our precision likely decreases since more oranges sneak into the apple side as well. With this update, recall improved to 100% but recall declined to 50%.

Monitoring and selecting an appropriate precision-recall tradeoff allows us to prioritize certain types of errors, either false positives or false negatives, as we adjust the decision threshold of our model.

Conclusion

Precision and recall offer new ways to judge classification model predictions as opposed to the standard accuracy computation. With apple precision and recall, we focus in on the apple class. High precision assures that what our model says is an apple actually is an apple (preciSIon = apple SIde), but recall prioritizes correctly identifying all of the actual apples (recALL = ALL apples).

Precision and recall allow us to distinguish between different types of errors, and there’s also a great tradeoff between precision and recall because we can’t blindly improve one without often sacrificing the other. The balance between precision and recall can also help us build more robust classification models. In fact, practitioners often measure and try to improve something called the F1-score, which is the harmonic average between precision and recall, when building a classification model. This ensures that both metrics stay healthy and that the dominant class doesn’t overwhelm the metric like it generally does with accuracy.

Choosing an appropriate classification metric is a critical early step in the data science design process. For example, if you want to be sure not to miss a fraudulent transaction, you’ll likely prioritize recall for cases of fraud. Though in other situations, accuracy, precision, or F1-score may be more appropriate. Ultimately, your choice of metric should be intimately linked to the goal of your project, and once it’s determined, that metric of choice should drive your model development and selection process.

Python for Data Science: An Interview with Course Report

2020-09-15T00:00:00+00:00

In a recent interview with Course Report, I discussed the basics of Python and how Python is used for data science. Python serves as an all-purpose programming language, so data scientists, engineers, analysts, and web developers alike utilize Python to build end-to-end projects, ready for launch into production. Python also has incredibly simple syntax, which makes it a great first programming language for beginners. We chat about these topics and many more in the video!

You can also check out a write up of our interview on the Course Report blog.

Delorean for Datetime Manipulation

2020-07-25T00:00:00+00:00

This year’s pandemic necessitated different conference formats for data science professionals. The organizers of PyOhio decided to ask speakers to create 5- or 10-minute pre-recorded talks to be streamed continuously while participants discussed the content in a live chat session. The format was a success! And I am proud to have created this video all about the Python library Delorean.

Delorean makes working with datetimes in Python much less of a burden. Its simple syntax allows users: to do datetime arithmetic, to handle time zone shifts, to convert datetimes into human language like “3 days ago,” and to generate equally spaced datetime intervals.

Check out my video for a look at Delorean (along with many, many Back to the Future references) or watch the full PyOhio 2020 conference playlist on YouTube.

Measuring Statistical Dispersion with the Gini Coefficient

2020-06-05T00:00:00+00:00

If you work with data long enough, you are bound to discover that a dataset’s mean rarely–if ever–tells you the full data story. As a simple example, each of the following groups of people have the same average pay of $100:

100 people who make $100 each
50 people who make $150 each and 50 people who make $50
1 person who makes $10,000 and 99 people who make nothing

The primary difference, of course, is the way that money is distributed among the people, also known as the statistical dispersion. Perhaps the most popular measurement of statistical dispersion is standard deviation or variance; however, you can leverage other metrics, such as the Gini coefficient, to obtain a new perspective.

The Gini coefficient, also known as the Gini index or the Gini ratio, was introduced in 1912 by Italian statistician and sociologist Corrado Gini. Analysts have historically used this value to study income or wealth distributions; in fact, despite being developed over 100 years ago, the United Nations still uses the Gini coefficient to understand monetary inequities in their annual ranking of nations. But the Gini coefficient may be utilized much more broadly! After a more thorough mathematical explanation, let’s apply the Gini coefficient to a few non-standard use cases that do not involve international economies: baby names and healthcare pricing.

Defining Gini

The first step in understanding the Gini coefficient requires a discussion about the Lorenz curve, a graph developed by Max Lorenz for visualizing income or wealth distribution. To trace out the Lorenz curve, begin by taking the incomes of a population and sorting them from smallest to largest. Then build a line plot where the $x$-values represent the percentage of people seen thus far and the $y$-values represent the cumulative proportion of wealth attributed to this percentage of people. For example, if the poorest 30% of the population holds 10% of a population’s wealth, the curve should pass through the scaled $x,y$ coordinates (0.3, 0.1). Note also that if wealth is distributed evenly among all members of a population, the Lorenz curve follows a straight line, $x=y$. See the figure below for an illustration of a hypothetical Lorenz curve along with the line of equality.

The Gini coefficient measures how much a population’s Lorenz curve deviates from perfect equality or how much a set of data diverges from equal values. The Gini coefficient typically ranges from zero to one¹, where

zero represents perfect equality (e.g. everyone has an equal amount) and
one represents near perfect inequality (e.g. one person has all the money).

For all situations in between, the Gini coefficient $G$ is defined as \[G = \frac{A}{A + B}\] where $A$ signifies the region enclosed between the line of perfect equality and the Lorenz curve, as indicated in the figure above, while $A + B$ represents the total triangular area.

Each of the three situations discussed in the introduction produce an average of $100 per person. The Gini coefficient, however, varies greatly for each scenario as seen in the figure below.

Gini in Python

To calculate a dataset’s Gini coefficient with Python, you have the option of computing the shaded area $A$ with something like scipy’s quadrature routine. If this style of numerical integration proves slow or too complicated for applications at scale, you can utilize an alternative, equivalent definition of the Gini coefficient.

The Gini coefficient may also be expressed as half of the data’s relative mean absolute difference, a normalized form of the average absolute difference among all pairs of observations in the dataset. \[ G = \frac{\sum\limits_i \sum\limits_j |x_i - x_j|}{2\sum\limits_i\sum\limits_j x_j}\]

The calculation simplifies further if the data consist of only positive values as it becomes unnecessary to evaluate all possible pairs. Sorting the datapoints in ascending order and assigning a positional index $i$ yields \[G = \frac{\sum\limits_i (2i - n - 1)x_i}{n\sum\limits_i x_i}, \] which is even speedier to compute.

The best Python implementation of the Gini coefficient that I’ve found comes from Olivia Guest. I will subsequently leverage her vectorized numpy routine to calculate Gini in the case studies that follow.

Case #1: Baby Names

So far we have mostly addressed the Gini coefficient in the context of its original field of economics. This metric generalizes, however, to provide insight whenever statistical dispersion plays a critical role. I will now illustrate two atypical applications to demonstrate how using the Gini coefficient augments the workflow of exploratory data analysis.

The Social Security Administration of the United States (SSA) hosts public records on the names given to US babies for research purposes. Aggregating these data for children born since 1950, I discovered that 18 out of the top 20 most popular names more commonly associate with male children. So where are the females?

Slightly more male babies are actually born each year, and certainly more male babies have been registered with the SSA (53% male vs 47% female); nonetheless, I was still surprised to see such a large proportion of male names in my quick popularity chart. Digging into the data further, I found that even though fewer females appear in the data, there have been consistently more unique female names each year.

Statistical dispersion appears to play a significant role. To put it back in financial terms, some male names like the ones on my top 20 list are just extremely “wealthy.” (The most popular name, “Michael,” accounts for over 3% of all male children born since 1950.) These ultra-popular masculine names likely pass down from generation to generation. Females babies, on the other hand, are distributed more widely across a variety of names, so extra names share in the “wealth” of female children. We can verify this theory by returning to the Gini coefficient.

Consider how female children disperse across each name. Some names in the dataset account for only 5 babies² since 1950, while “Jennifer” represents nearly 1.5 million individuals. Tallying up all females born with each name since 1950 and sorting the names from least to most popular, we find the Gini coefficient to be 0.96, implying a huge disparity in the most popular versus the most unique names.

Male names exhibit a very similar Lorenz curve but with a little more skew, registering a Gini coefficient of 0.97. The difference between male and female coefficients appears insignificant, but consider an alternative viewpoint. Instead of aggregating across time, calculate a yearly Gini coefficient for each gender. Plotting both the female and male Gini coefficients for each year since 1950 demonstrates a clear and persistent pattern where the male coefficient presents consistently higher.³ Thus male names experience more statistical dispersion than female monikers. Also of note, the Gini values for both genders have ticked downward since the 1990s, indicating a trending preference toward more diverse naming conventions.

In a final look at this dataset, let’s examine popularity trends for individual names over time. Now utilize Gini by grouping the female data by name and calculating the Gini coefficient as it pertains to yearly frequencies; that is, for any given name, sort each year of the dataset by that name’s least to most popular year in order to compute Gini. Names with lower Gini coefficients demonstrate similar levels of popularity throughout the entire time span, while higher coefficients imply uneven popularity levels. The figure below compares popularity trends for the names “Scarlett” and “Miriam.” Both names represent about 60,000 female babies in the dataset; however, the sharp increase in babies named “Scarlett” generates a large Gini coefficient while “Miriam” sees a low Gini value since the name has consistently been given to roughly 1,000 babies every year since 1950.

Case #2: Healthcare Prices

Now shift to this 2017 healthcare pricing dataset hosted by the Centers for Medicare and Medicaid Services, a federal agency of the United States. These data, aggregated as procedural averages for individual hospitals, include the charges and eventual payments for over 500 separate inpatient procedures for Medicare patients. I applied Gini coefficient calculations to determine which, if any, procedures require better billing standardization. The underlying basis for my analysis boils down to this: the higher the Gini coefficient, the greater the disparity in what different hospitals charge for a given procedure. Procedures with large Gini values could then necessitate regulation or more transparent cost details.

The procedure, or diagnosis related group (DRG), with the highest Gini coefficient in this dataset⁴ is labeled as, “Alcohol/Drug Abuse or Dependency w Rehabilitation Therapy.” This perhaps elicits little surprise given that rehabilitation therapies vary widely both in terms of treatment length and illness severity; we probably expect a wide range in what assorted hospitals charge. In fact, all diagnoses with the largest Gini coefficients, such as coagulation disorders and psychoses, can vary in severity. Procedural charges that show the most uniformity among the hospitals, on the other hand, mostly describe one-time cardiac events such as value replacement, percutaneous surgeries, or observation for chest pain.

Gini coefficients among average hospital charges per diagnosis related group (DRG)
Highest Gini	Lowest Gini
Alcohol/Drug Abuse or Dependence w Rehabilitation Therapy	Aortic and Heart Assist Procedures except Pulsation Balloon w MCC
Coagulation Disorders	Angina Pectoris
Alcohol/Drug Abuse or Dependence, Left AMA	Cardiac Valve & Oth Maj Cardiothoracic Proc w/o Card Cath w/o CC/MCC
Psychoses	Heart Transplant or Implant of Heart Assist System w MCC
Other Respiratory System Diagnoses w MCC	Perc Cardiovasc Proc w/o Coronary Artery Stent w/o MCC

So what about billing regulation? Do we need more safeguards in place to be sure hospitals are charging similar amounts for similar procedures? Well, more cost transparency certainly doesn’t hurt, especially for treatments that range in duration or intensity, but let’s go back to the dataset. In addition to the information about the amounts hospitals charge, the data also contain the total payments that the hospitals actually received. Applying the same type of analysis to the payments received yields much lower Gini values. In fact, the Gini coefficient is lower for the average payments received than the hospital charges, for every single procedure. This curious insight signals that the contracts in place for Medicare payments already do quite a lot to moderate and regularize procedural costs.⁵

Conclusion

The Gini coefficient continues to provide insight over 100 years after its inception. As a good general-purpose measure of statistical dispersion, Gini can be used broadly to explore and understand data from nearly any discipline. Currently, the most popular metric for understanding data spread is likely standard deviation; however, there are several key differences between standard deviation and the Gini coefficient. Firstly, standard deviation retains the scale of your data. You report the standard deviation of US incomes in dollars, while you might give the standard deviation of temperatures in degrees Celsius. The Gini coefficient, however, has no measurement unit, also called scale invariance. Secondly, standard deviation is unbounded in that it can be any non-negative value, but Gini typically ranges between zero and one. Gini’s scale invariance and strict bounds make comparing statistical dispersion between two dissimilar data sources much easier. Lastly, standard deviation and the Gini coefficient judge statistical dispersion through different lenses. Gini reaches its maximum value for a non-negative dataset if it contains one positive and the rest zeros. Standard deviation reaches its maximum if half the data live at the extreme maximum and the other half register at the extreme minimum.

Certain limitations apply to the Gini coefficient despite its many benefits. Like other summary statistics, Gini condenses information thereby losing the granularity of the original dataset. Gini is also many-to-one, which means various different distributions map to the same coefficient. The Gini coefficient proves to be quite sensitive to outliers such that a singular extreme datapoint (large or small) can increase Gini dramatically. Yet, economists have also criticized the Gini coefficient for being undersensitive to wealth changes in upper and lower echelons. Researchers have go on to introduce several alternative metrics to study different aspects of income inequality, such as the Palma ratio, which explicitly captures financial fluctuations for the richest 10% and the poorest 40% of a population.

No matter which metric you choose to understand statistical dispersion, building data intuition certainly goes beyond simple estimates of the mean or median. The Gini coefficient, long since popular in the field of economics, provides excellent insight about the spread of data regardless of your chosen subject area. As demonstrated in this post, Gini could be tracked over time, calculated for specific segments of your data, or used to detect processes requiring better price standardization. Its applications are limitless, and it might just be the missing component of your EDA toolkit.

Check out this code on GitHub!

The Gini coefficient is strictly non-negative, $G \geq 0$, as long as the mean of the data is assumed positive. Gini can theoretically be greater than one if some data values are negative, which occurs in the context of wealth if some people contribute negatively in the form of debts owed. ↩
The Social Security Administration does not include names that are given to fewer than 5 babies per gender per state due to privacy reasons; therefore, five children for one given female name since 1950 signifies the absolute minimum allowed. ↩
The Gini values displayed in the yearly figure are less than the aggregate because popular names tend to stay popular year after year thus bolstering naming inequality and increasing the Gini coefficient. ↩
Some diagnosis related groups (DRGs) occur at as few as one hospital for the entire year. I have filtered the dataset down to procedures that are documented by at least 50 hospitals to avoid high variance issues. ↩
The payments hospitals receive are strictly less than the amounts they charge. Decreasing a dataset’s mean while holding its standard deviation fixed actually increases the Gini coefficient. Here we observe just the opposite effect so statistical dispersion must be lessened in the payments received. ↩

Web Scraping in Python: Real Python Podcast

2020-06-05T00:00:00+00:00

I recently sat down with Christopher Bailey at the Real Python Podcast to discuss web scraping as well as my PyCon 2020 tutorial: “It’s Officially Legal so Let’s Scrape the Web.” In this podcast we talk about web scraping tools and techniques, HTML basics and data cleaning, as well as a recent change to the legal landscape regarding scraping.

Check out the YouTube video above or listen to the podcast at Real Python.

Let's Scrape the Web: PyCon 2020 Video Tutorial

2020-05-04T00:00:00+00:00

Web scraping empowers you to write computer programs to collect data from websites automatically and recent legal rulings support your right to do so. This tutorial covers the breadth and depth of web scraping: from HTML basics through pipeline methods to compile entire datasets. My video provides step-by-step instructions on utilizing Python libraries like requests and BeautifulSoup as well as links to supplementary tutorial resources in the form of Google Colab or Jupyter notebooks.

Check out the supplementary materials via Google Colab (Scraping Basics and Scraping Wikipedia) or on GitHub.

Level Up: spaCy NLP for the Win

2020-02-21T00:00:00+00:00

Natural language processing (NLP) is a branch of artificial intelligence in which computers extract information from written or spoken human language. This field has experienced a massive rise in popularity over the years, not only among academic communities but also in industry settings. Because unstructured text makes up so much of the data we collect today (e.g. emails, text messages, and even this blog post), many practitioners regularly use NLP at the workplace and require straightforward tools to reliably parse through substantial amounts of documents. The open-source library spaCy meets these exact demands by processing text quickly and accurately, all within a simplified framework.

Released in 2015, spaCy was initially created to help small businesses better leverage NLP. Its practical design offers users a streamlined approach for accomplishing necessary NLP tasks, and it assumes a more pragmatic stance toward NLP than traditional libraries like NLTK, which were developed with a more research-focused, exploratory intention. spaCy can be quite flexible, however, as it allows more experienced users the option of customizing just about any of its tools. spaCy is considered a Python package, but the “Cy” in spaCy indicates that Cython powers many of the underlining computations. This makes spaCy incredibly fast, even for more complicated processes. I will illustrate a selection of spaCy’s core functionality in this post and will end by implementing these techniques on sample restaurant reviews.

Please continue to the ODSC blog to read my full post covering this introduction to spaCy.

Math for Data Science: An Interview with Course Report

2020-02-17T00:00:00+00:00

I recently sat down with Course Report to discuss the math needed to become a data scientist. Blending coding skills with mathematics lies at the heart of data science, so understanding fundamental math concepts is critical for a successful career within the field. Linear algebra, calculus, probability, and statistics are the four math disciplines that fuel the bulk of data science. In this interview, I discuss the role each topic plays in data science; I also work through an example problem from all four subjects.

Please continue to the Course Report blog for a write-up of the interview.

Down and Up: A Puzzle Illustrated with D3.js

2020-01-05T00:00:00+00:00

On a recent vacation my husband and I happened upon an entertainment shop that was well stocked with board games, dice, playing cards, etc. We quickly found an item that both of us, absolute nerds that we are, deemed an essential purchase: a book by Boris A. Kordemsky called The Moscow Puzzles: 359 Mathematical Recreations. No, we didn’t spend our entire vacation solving all 359, but we did bring the book home with us and have continued working through them–often over a glass of wine in the evenings.

One particular puzzle recently caught my attention for several reasons. I’ll come back to those reasons in a bit, but for now, the problem is called “Down and Up” and it goes like this:

Suppose you have two pencils pressed together and held vertically. One inch of the pencil on the left, measuring from its lower end, is smeared with paint. The right pencil is held steady while you slide the left pencil down 1 inch, continuing to press the two pencils together. You then move the left pencil back up and return it to its former position, all while keeping the two pencils touching. You continue these actions until you have moved the left pencil down and up 5 times each. Assume the paint does not dry or run out during this process. How many inches of each pencil are smeared with paint after your final movement?

Take a minute to solve this problem before proceeding if you’d like–spoilers ahead!

First Thoughts

When I first heard this problem, I initially thought that perhaps the paint is not smeared to the right pencil at all and perhaps only one inch of paint appears on the left pencil throughout the entire process. (Did you also expect this?) But the second time I read through the problem I started to visualize what might actually be happening. The solution became much more clear as soon as I tried to make a mental picture of the process. Since my husband was solving the problem with me, I made him this sketch to share what I was thinking:

I managed to distinctly envision the situation, arrive at a solution, and communicate my thought process just with this simple sketch. For many math puzzles a rough picture provides all you need find the answer, but if my crude drawing hasn’t fully conveyed the solution to you, no worries. Let’s dive in a bit more methodically with a much nicer illustration.

Problem Setup

From the problem directions, we know that initially only the left pencil is smeared with paint. Recall though that the left pencil presses directly against the right. This means paint immediately transfers to the right pencil as they are squeezed together. So both pencils are smeared with one inch of paint even before any of the five down-up movements occur.

Solving and Illustrating the Full Problem

The problem gets a little more complicated as the left pencil moves down and up, but returning to a visual interpretation once again helps immensely. Also feel free to reread the problem statement at any point to regain your bearings.

Both pencils are currently smeared with one inch of paint. Then the left pencil moves down one inch while both pencils continue pressing together. Can you envision what happens when the left pencil moves down? Yes! A clean portion of the left pencil makes contact with the bottom of the right pencil; therefore, another inch of paint transfers over to the left.

The left pencil now lingers one inch lower than the right. One inch of the right pencil is smeared with paint, but paint covers two inches of the left pencil. The left pencil moves up in the next step of the problem, coming back to its original position. So the two pencils realign, but what happens to the paint? Since the left pencil continually makes contact with the right, paint smears over to the right pencil and coats two inches of both pencils at the end of the first down-and-up cycle.

The four remaining cycles proceed similarly, with paint transferring first to the left pencil and then to the right. Finally after five rounds of movements, both pencils are smeared with a total of six inches of paint: an initial inch plus five more inches, one for each of the down-up cycles.

This problem ultimately hinges on the ability to translate the problem statement into an explanatory visual. To further contextualize this solution, I created an interactive figure with D3.js. Below both pencils start with one inch of paint as described in the problem setup. Use the “Move Pencil” button to convince yourself of the answer I provided.

Note: these pencils are six fictitious inches long. After the fifth movement, the pencils reach equilibrium in that paint completely covers them. Hit the “Reset” button at any time to start over.

Backstory and Problem Extensions

Earlier I mentioned this problem caught my eye for several reasons. The first reason is exactly what we have been discussing. I marveled at how tricky the problem sounds initially as opposed to how simple it becomes as soon as you construct an appropriate mental image of the situation.

The second reason this puzzle piqued my interest is its history. As explained in Kordemsky’s book, Leonid Mikhailovich Rybakov, a Soviet mathematician who lived in the early 20th Century, created this “Down and Up” problem. I deeply appreciate math problems that pervade through many time periods and geographies. Solving such puzzles allows me to feel more connected to the past and to other mathematicians around the globe.

Finally, this problem sparked my curiosity because Rybakov first thought it up when returning home from a successful duck hunt. Kordemsky encourages readers to contemplate why this could be the case but goes on to explain in his “Answers” section. From The Moscow Puzzles book:

Looking at his boots, Leonid Mikhailovich noticed that their entire lengths were muddied where they usually rub each other while he walks.
“How puzzling,” he thought, “I didn’t walk in any deep mud, yet my boots are muddied up to the knees.”
Now you understand the origin of the puzzle.

Just as the paint smeared the entire length of both pencils, Rybakov’s boots were covered from tip to top because mud had transferred from one boot to the other as he walked.

I continued to think about how this concept might apply to other situations, and I came up with one amusing but slightly unpleasant example. Consider two lines of contra dancers in which the first dancer in the first line unfortunately feels unwell. If this dancer’s sickness is highly communicable, she will, of course, pass along her malady to her dance partner who is positioned across from her. Sometimes in contra dancing participants exchange dance partners by shifting the two lines laterally. Regrettably, when this happens the newly infected dancer will pass the disease back across the line, and eventually the entire group of dancers become ill. Try out my widget below to see this application in action.

Conclusion

I hope you have enjoyed this discussion on one of my new favorite math puzzles along with these illustrative D3 visuals. Making a mental image of a math puzzle is not always easy, but it can be invaluable when solving problems like these–especially if you are a visual learner like myself. The next time you feel stuck on an interview question, check to see if sketching or imagining the physical setup of the problem helps. For me it often does.

I also hope you have enjoyed learning a little about the backstory behind this puzzle. Some of the world’s best math puzzles were created long ago, so I believe looking to the past when attempting to sharpen our minds benefits us greatly. Furthermore, expanding this kind of problem to new applications, like I did with the contra dancers, helps solidify core concepts and builds intuition for future brainteasers. It also makes math problems more enjoyable because you relate them to your own life. So now it’s your turn – can you think of any other “Down and Up” scenarios?

Check out my D3 code on GitHub!

Pencils and Paint

Contra Dancers

How to Gather Data from YouTube

2019-11-12T00:00:00+00:00

Since its 2005 inception, YouTube has entertained, educated, and inspired more than one billion people. It now ranks as the 2nd most visited website on the planet, and its users upload 300 hours of video content every minute. YouTube clearly dominates as the world’s premier source of cute baby moments, epic sports fails, and hilarious cat videos, but its vast troves of content can also be leverage to strengthen a wide variety of data science projects. In this post, I share how you can gain access to three types of YouTube data: the videos themselves for use in computer vision tasks, the video transcripts for natural language processing (NLP), and video search results for hybrid machine learning efforts.

Please continue to the Metis blog to read my full post covering data collection from YouTube.