Stress test of XSSF proves exponential time (performance concerns)

May 14, 2013 at 10:14 PM
I did a quick stress test of npoi 2.0 beta and I simply wrote a string in the first cell of each row up to 100k. The output was as follows:

Start time: 5/14/2013 5:05:39 PM
[00:00:00.0170017] 0 rows written
[00:00:02.7792779] 10000 rows written
[00:00:11.1951194] 20000 rows written
[00:00:27.7817779] 30000 rows written
[00:00:53.5283523] 40000 rows written
[00:01:30.1910182] 50000 rows written
[00:02:16.2836270] 60000 rows written
[00:03:14.6894670] 70000 rows written
[00:04:21.9641938] 80000 rows written
[00:05:38.7868753] 90000 rows written
[00:07:05.4055363] 100000 rows written

As you can see the curve here will be exponential and this is really terrible performance. After using the performance wizard in VS 2012, I found that 88% of the time was spent in GetLastKey. This means that the majority of the time was spent seeking in the sheet to the appropriate row (my theory is supported by the above data, each incremental set is longer to write). I am hoping to improve this by updating the active cell with each cell set, but I get a not implemented error when I call SetActiveCell. Is there a work around here?

My code:
class Program
{
    static void Main(string[] args)
    {
        IWorkbook workbook = new XSSFWorkbook();
        ICell cell;
        ISheet sheet = workbook.CreateSheet("StressTest");
        int i = 0;
        int rowLimit = 100000;
        DateTime originalTime = DateTime.Now;

        System.Console.WriteLine("Start time: " + originalTime);

        for (i = 0; i <= rowLimit; i++)
        {
            cell = sheet.CreateRow(i).CreateCell(0);
            //sheet.SetActiveCell(i, 0);
            cell.SetCellValue("ZOMG PLEASE SURVIVE THIS STRESS TEST");

            if(i % 10000 == 0)
            {
                System.Console.WriteLine("[" + (DateTime.Now - originalTime) + "]" + " " + i + " rows written");
            }
        }

        FileStream sw = File.Create("test.xlsx");
        workbook.Write(sw);
        sw.Close();

        //prompt user so we do not close the window with our data :)
        System.Console.Read();
    }
}
May 14, 2013 at 10:40 PM
I tried using cell.SetAsActiveCell() and there was no improvement to performance... tiny bit worse actually.

Start time: 5/14/2013 6:36:50 PM
[00:00:00.0190019] 0 rows written
[00:00:02.8392839] 10000 rows written
[00:00:11.4071406] 20000 rows written
[00:00:28.3778375] 30000 rows written
[00:00:55.7105705] 40000 rows written
[00:01:33.2173208] 50000 rows written
[00:02:21.7171703] 60000 rows written
May 15, 2013 at 3:13 PM
Edited May 15, 2013 at 3:30 PM
Also, if you simply change XSSFWorkbook to HSSFWorkbook (Still using NPOI 2.0 beta)... these are the results

Start time: 5/15/2013 11:09:36 AM
[00:00:00.0270027] 0 rows written
[00:00:00.0540054] 10000 rows written
[00:00:00.0920092] 20000 rows written
[00:00:00.1160116] 30000 rows written
[00:00:00.1980198] 40000 rows written
[00:00:00.2220222] 50000 rows written
[00:00:00.2820282] 60000 rows written
(crash due to row limit)

The speed difference is concerning.

Now we can compare to NPOI 1.2.3 which must use HSSFWorkbook... these are the results

Start time: 5/15/2013 11:27:18 AM
[00:00:00.0180018] 0 rows written
[00:00:00.0380038] 10000 rows written
[00:00:00.0550055] 20000 rows written
[00:00:00.0880088] 30000 rows written
[00:00:00.1070107] 40000 rows written
[00:00:00.1900190] 50000 rows written
[00:00:00.2450245] 60000 rows written

This is even faster, though the fact that we are talking about tenths of a second is still good.
Coordinator
May 24, 2013 at 4:06 AM
Hi zosea,

Thank you for your stress test result. The reason that HSSFWorkbook is much faster than XSSFWorkbook is because the internal implement of xls and xlsx. xls is using binary to save data but xlsx uses xml. Binary serialization is always fast enough but xml serialization doesn't. Moreover, we do find that there is big performance issue with XmlSerializer. That's why it looks so slow.
Aug 5, 2013 at 6:59 PM
Why not keep the rows as an enumerated list and then have a render method that does the xml serialization at the end. That should increase performance handsomely.
Coordinator
Nov 17, 2013 at 11:15 PM
Hi zosea,

I suggest you try the latest source code in github. The performance has been improved a lot.
Nov 21, 2013 at 9:39 PM
I still have bad timings using the new OOXML. I did however replace all the SortedDictionary<> in the GetLastKey calls with Dictionary<> and my times improved to this. This probably increases memory load however.

Start time: 11/21/2013 4:34:32 PM
[00:00:00.0850085] 0 rows written
[00:00:00.7930793] 10000 rows written
[00:00:02.5712571] 20000 rows written
[00:00:05.4945494] 30000 rows written
[00:00:09.4819481] 40000 rows written
[00:00:14.7574756] 50000 rows written
[00:00:21.1301128] 60000 rows written
[00:00:28.4608458] 70000 rows written
[00:00:37.0177014] 80000 rows written
[00:00:47.3060936] 90000 rows written
[00:00:58.1362526] 100000 rows written

Granted its slower than HSSF, but not exponentially slower as before.
Coordinator
Nov 24, 2013 at 12:50 PM
Hi All,

I'm glad to say that the performance is greatly improved after changing the way of implementing GetLastKey()

Start time: 2013/11/24 21:47:55
[00:00:00.0439453] 0 rows written
[00:00:00.1054687] 10000 rows written
[00:00:00.1826171] 20000 rows written
[00:00:00.2646484] 30000 rows written
[00:00:00.3271484] 40000 rows written
[00:00:00.4130859] 50000 rows written
[00:00:00.4892578] 60000 rows written
[00:00:00.5498046] 70000 rows written
[00:00:00.6513671] 80000 rows written
[00:00:00.7246093] 90000 rows written
[00:00:00.8037109] 100000 rows written

Here is the result with your program. Don't be surprised :)

This fix will be included in NPOI 2.0 beta 2.
Apr 1, 2015 at 2:06 PM
Hi All,

I have similar issue but with cells not rows. I am adding lot of cells (> 10 000) in a single row. According to performance analysis, I have problem with GetLastKey on XSSFRow.
I am looking into the source code and I think, the same fix (commit 4cdb29409b6a5059c94664a93b442b4912394c89), that was used on XSSFSheet.GetLastKey can be applied also to XSSFRow.GetLastKey.

Is possible to apply such fix?
Thank you, Libor.
Apr 1, 2015 at 3:35 PM
Edited Apr 1, 2015 at 3:36 PM
Update:

I've tried to update the source code and it helped. Data export in my application duration reduced from 7 minutes to several seconds.

XSSFRow.cs:
private SortedList<int, ICell> _cells;
int GetFirstKey(IList<int> keys) {
        return keys[0];
    }
int GetLastKey(IList<int> keys) {
        return keys[keys.Count - 1];
    }
and some type updates, on few places, where it was necessary for compilation (all in XSSFRow.cs).

Libor.