Troubleshooting: Large Datasets
Troubleshooting: Large Datasets
Problems with large data sets: While larger samples provide several benefits, among them increased precision and statistical power, reduced impact of outliers and better representation of the population under study, they do have their limitations as well. Large data sets can be just as prone to bias or errors as smaller samples - and it is generally harder to identify data quality issues in a large data dump. Without having confidence that the data are representative and without major errors, the results are unreliable at best, and patently wrong at worst.
Having said that, EngineRoom does allow you to analyze a relatively large amount of data at one time. The current limitation on EngineRoom data sets is 10MB. This equates to about 10,000 rows of data but this is not a hard-limit and depends on the number of columns and the complexity of the data.
As EngineRoom continues to explore how we can provide a better experience in this area, we have collected a few tips for how to deal with data of this size in EngineRoom.
There are a few options:
Copy and Paste
The first method for a data set that is too large to upload is to copy and paste parts of that data source until you have a complete set.
Try copying a column, or 2 or 3, depending on the number of rows you have.
Alternatively, copy several rows at a time and paste them into EngineRoom. Again, this will depend on the number of columns you have.
Sample Your Data
Random Sampling
The most effective way to deal with large data sets is to sample your data appropriately.
1. In Excel, add a column next to your data (in front or at the end).
2. Fill the column with random numbers by putting in "=rand()" in the first cell and copying that information down your sheet by dragging the square handle or double clicking on the square handle.
3. Copy the column of random numbers and Paste Special > Values Only to maintain those numbers.
4. Sort the dataset by the random numbers.
5. From here, you can copy the number of rows from your dataset that makes sense for the type of tool you would like to run.
(This method is adapted from this video beginning at 3:02: https://www.youtube.com/watch?v=LpZqdvaJQAQ)
Stratified Random Sampling
There are additional considerations if you need to have representative samples from subgroups contained within the data set. This will add a few more steps. You would essentially take a random sample from each subgroup, with the sample size from each subgroup being proportional to the relative size of the subgroup in the larger data set. This ensures that each subgroup is represented in the combined sample.
Was this helpful?