A Senior Engineer’s Clever Approach to Sampling Large Datasets

March 13, 2025

The flood of business data keeps growing—but the ability to analyze it quickly isn’t keeping up. Sorting through millions of transactions or pinpointing a root cause across countless data points drains time and resources. Without the right tools, decision-making slows, patterns go unnoticed, and opportunities slip away.

Senior Computer Engineer, Tommy Stopak

At MoreSteam, one of our senior engineers, Tommy, set out to develop a new feature for EngineRoom that would tackle this exact challenge. The goal? To create a tool that could sample large datasets, making analysis more manageable without losing the original signal. The key challenge was ensuring that the sampling process not only simplified the data but also preserved its statistical integrity.

So what is data sampling? Data sampling is a fundamental statistical method that allows analysts to work with a subset of a large dataset while maintaining insights about the whole. Sampling is especially valuable when:

  • Processing the entire dataset is impractical due to size or computational limits.
  • Speeding up analysis is necessary for quick decision-making.
  • Preserving statistical accuracy is important for drawing valid conclusions.

However, a poorly chosen sample can lead to biased or misleading results, making it crucial to select a truly representative subset. That’s why two critical-to-quality characteristics (CTQCs) guided Tommy’s approach: (1) reducing dataset size for easier analysis and (2) preserving the original resolution to retain meaningful patterns.

To test the new sampling algorithm, this engineer took an interesting approach. He turned to NBA shot data, a dataset containing the x and y coordinates of shots made on the basketball court. When plotted on a scatter plot, the data paints a familiar picture—the three-point arc and the paint, or the key, around the basket. The “paint” refers to the rectangular area near the hoop, typically painted a different color from the rest of the court, and it's where much of the action happens—such as rebounds, post moves, and key plays.

A basketball court

A basketball court from above, displaying the area known as the "Paint".

Tommy selected this dataset because its distinct visual patterns made it easy to compare the sampled and original versions, ensuring key structures remained intact. As he adjusted the sampling algorithm, the visual comparison of the sample vs. the original gave him an idea of whether the sample was accurately depicting the original.

EngineRoom® Data Analysis Software featuring a visual graph of 3-point data.

It's crucial to visually inspect your data—not just rely on statistics like r², mean, median, or mode. Numbers alone don’t always tell the full story. For example, both of these scatter plots have the same mean and r².

Engineroom Output

Check out "Descriptive Statistics Don’t Tell the Whole Story" to read the full article.

The beauty of this approach is that it translates directly to real-world applications. In businesses that deal with massive amounts of data—like credit card companies, for instance—getting a quick, reliable analysis is critical. Companies often analyze 90 days of data to capture weekly trends, but sifting through that much information is resource-intensive. The solution? A tool like EngineRoom’s Samplifier can help companies focus on what’s important, while maintaining the statistical integrity of the dataset and making the analysis process far more manageable.

By visualizing the effect of the sampling process, EngineRoom’s new Samplifier demonstrates how a well-optimized algorithm could solve a major problem for businesses: how to make sense of large, complex datasets and draw actionable insights without drowning in data.

See how EngineRoom’s Samplifier helps you transform big data into actionable insights faster—without losing critical details. Try it free for 30 days!

Lindsay Van Dyne

Vice President of MarketingMoreSteam.com LLC

Lindsay Van Dyne joined the MoreSteam team in 2014. She is responsible for developing and executing the company's marketing strategy. Her marketing experience includes technical aspects of search engine optimization (SEO), digital content marketing strategies, lead generation, website development, event management, and partner relationships. Before switching to the marketing team, Lindsay spent several years as MoreSteam's eLearning product manager. During that time, she led the eLearning team through an entire UI transformation, developing a new user interface for training, expanding the language offerings significantly, and adding features like notes & highlighting and a user dashboard for training stats. Lindsay's drive and spicy personality bring a fresh perspective to the leadership and marketing teams, encouraging others to think creatively.

Lindsay received a B.S. in Chemical Engineering from the University of Notre Dame and a B.S. in Computational Physics and Mathematics from Bethel College.

Use Technology to Empower Your Continuous Improvement Program