.netsystem.data.datatable

Datatable Performance in C#.Net


  public DataTable FetchData(string sQuery)
  {
    DataTable dtable = new DataTable();           

   using (SqlConnection conn = new SqlConnection(conString))
   {
      conn.Open();

      using (SqlCommand sqlCmd = new SqlCommand(sQuery, conn))
      {
        SqlDataReader sdRead = sqlCmd.ExecuteReader();
        dtable.Load(sdRead);
      }
    }

return dtable;
}


 Datatable dt = FetchData(string sQuery);
 foreach(DataRow row in table.Rows)
  ClassA obj = new ClassA(row);

 // Some manipulations
  //.....

Class A
{

  int id;
  int name;

 A(DataRow dr)
 {

  id = dr["ID"];
  name = dr["Name"];

}
}

I need to retrieve nearly 15,00,000 rows from the database.

I need suggestions for two scenarios.

  1. I call the above method 1-5 times so obviously it creates 1-5 connections. If it turns to 10-20 times wt about performance? (or create one global connection and open 1 connection and process all then close at the end.)

  2. What about datatable? Any alternative. I think I need disconnected architecture here for this much of rows. I need to stuff my own class objects with the retrieved data (Or Iterate datareader and stuff the List<objects> inside the FetchData()).

Any suggestions?


Solution

  • First you could try to pre-process as many things as possible within the database (that is what they are good at.).

    Then you will have to scale down the data you fetch and parallelize operations. Let me explain some of the problems here :

    1. You try to load a huge amount of data through one network pipe
    2. Then You try to store all the huge amount of data in main memory
    3. Then after loading and allocating everything, you do some computation on the data

    One easier step would be to partition your data in smaller sets (200 rows maybe). Then you will run many threads in parallel, (one connection by thread) and every thread will fetch a small amount of rows, put them in memory and compute whatever you want (then release unused memory which will be proportional with number_of_workers x rows_loaded_by_worker).

    Next you finetune the process by playing on the number of rows fetched (100 - 10000) and the number of parallel workers.

    Note that your SQL Query needs to be efficient for retrieving many smaller sets of data (i.e. use EXPLAIN, just to be sure there is no table scan involved, otherwise such an approch would fail).