« Meta WHAT? | Data mart Data Modelling 101 »
Importing Data into Excel
I’ve seen lots of Business Intelligence (BI) solutions, (data marts, data warehouses and the accompanying reports and dashboards) using all sorts of different tools. But I’ll tell you- NO tool has yet been as successful as Microsoft Excel for providing a do it yourself data analysis platform to import data into. Now, I’m not suggesting that Excel (even when used with the upcoming Datamartist tool 🙂 ) will make traditional data marts obsolete. Clearly the billions of dollars being spent on “enterprise BI”are not going to dry up. But there are enough times you have to wait- or your needs are “too specific”- for a large BI project. Often the existing data marts or data warehouses will be the source of raw data. But you will still need to prepare data for Excel import. In the next few posts I’m going to discuss various aspects of using excel for data analysis. In this first part, I’ll talk about data size in excel and performance which is important – when should you import the data? Import the HUGE raw file, or treat it before import to reduce its size?
Data Size Limits in Excel
There are different types of limits-
- The size in rows and columns the actual spreadsheet has.
- Excel’s (and your PC’s) ability to crunch the numbers in a reasonable time. (RAM, CPU)
- The size of the files involved and load and save times.
In Excel 2003, a spreadsheet has rows 1 to 65 536 and columns A to IV. This makes it a grid 256 X 65536. In Excel 2007 the spreadsheet is much, much larger, with rows from 1 to 1 048 576 and columns from A to XFD. (Making a grid 16384 X 1 048 576). Now before you get too excited about how much space you have in 2007, the reality is that limits number 2 and 3Â define how you can actually use that space. But it is more and more is good.
So lets kick the tires on large data sets in Excel 2007. For these very informal tests I’m using a Quad-core workstation with 4Gb of RAM, so the results I get represent a best case compared to a typical laptop or desktop PC. First of all- putting a million rows of data in Excel 2007 (even a “narrow table” of only 3-4 columns) slows everything down. Delete a column, and you’ll often see a 5-10 second freeze-up while excel churns away in the background- roughly the same amount of time needed to save the file. Plus, when I push it I’ve had it lock up on me a few times- requiring some Ctrl-Alt-Del action to kill it. Even a narrow table such as this makes the Excel file be at minimum 15-20 megabytes. For the particular text file I used, the .txt version was 9 Mb, the .xlsx file was double the size at 18 Mb. I added a few columns and the file quickly became 80 Mb.
Also, strangely, doing exactly the same thing multiple times results in very different times to complete- when I’m mentioning times its the average of 2-3 trials (see graph).
 All in all, although Excel 2007 can technically store a million rows, I’d advise against it. There are other reasons its a pain- scroll bars and page-up page-down don’t scale well to 1M rows- its just hard to copy 250000 rows accurately- takes for ever to get to the end, and then you overshoot by a mile, and page up again forever to find it etc. etc. (And yes you can use the Go To command on the Home>Editing>Find and Select>Go to – but a model of ease its not.
I can tell you, however, that using all the other features on more reasonable data sets (up to say, 100 k rows), I LOVE what it can do in terms of analysis and reporting. Once you have the data in reasonable result sets, there is no better place to have it than in Excel if you want full control in my opinion. But how to get it there. Next posts: how to link to data in Access and build a mini personal data mart. We’ll learn how to make a personal data mart given the currently available tools. (And you just know there will be some posts later where I show you how to do the same thing, but using Datamartist. ) Update: Datamartist now available. Download the tool now, and find a whole new way to transform and managed your data, including managing huge data imports into excel.