I am trying to undertake some planning for the future and are not too sure
how to go about it
I have allot of data - and its going to grow very rapidly in the coming
years - the core database table relates to electricity usage 15 min data. I
need to get a better understanding of what the ultimate limitations will be
on how the data is currently been managed. Current table size is 85 million
rows.
The data is saved as two tables as follows
Main Table - 18 million rows
CREATE TABLE [datData] (
[ID] [int] IDENTITY (1, 1) NOT NULL ,
[Field_ID] [int] NOT NULL ,
[Date] [datetime] NOT NULL ,
[Value] [decimal](18, 4) NOT NULL CONSTRAINT [DF_datData_Value] DEFAULT (0)
) ON [PRIMARY]
GO
With following indexes
CREATE UNIQUE CLUSTERED INDEX [IX_datData] ON
[dbo].[datData]([Field_ID], [Date]) WITH IGNORE_DUP_KEY , FILLFACTOR = 90
ON [PRIMARY]
GO
CREATE INDEX [datData0] ON [dbo].[datData]([Field_ID]) WITH FILLFACTOR =
90 ON [PRIMARY]
GO
This table is generally accessed using a simple select statement with a
WHERE clause as follows - this accounts for most of the accesses to this
table - these select statements are generally copies into temporary tables
and manipulated further
Field_ID=123 And Date BETWEEN '1-Jan-2004' AND '31-Jan-2004 23:59:59'
It is also not unusual to request max and min dates for a particular Field_ID.
At the moment everything works fine - the queries placed on the data have
not appeared to slow down - everything still appears to be operating as fast
as it was when there was only 5 million rows in the table.
I would like to get a feel for the upper limit of this table. The table
will continue to grow in both number of Field_IDs and the number of rows for
each Field ID - current there are around 5,000 field IDs - that’s an average
of 17,000 rows per field ID - or roughly 170 days of electricity data per
Field_ID.
In the future the number of field IDs could easily grow to over 100,000 and
some of these fields will contain 10 years of data. This would place the
number of rows at 35,040,000,000 or 35 billion - is this a problem?
At what point should I start to remove data from this table into an archive
- will I ever need to do this? Will there come a point when this data set
size becomes un-workable?
Anyone that can help in managing a data set of this size would be much
appreciated
matthew
Hi, Matthew
>From your description, I understand that you use the term "field" to
reffer to a place where electricity is consumed, not to a column (as
some people are used to do). Also, I understand that this table will
store exactly one row for each "field", for any given date (i.e. for
any "field", at a particular Date, there is only one Value). In this
case, I propose to use a table like this:
CREATE TABLE datData (
Field_ID int NOT NULL
CONSTRAINT [FK_datData_Fields]
FOREIGN KEY REFERENCES Fields (Field_ID),
ReadingDate smalldatetime NOT NULL ,
EnergyValue decimal (18, 4) NOT NULL
CONSTRAINT DF_datData_EnergyValue DEFAULT (0),
CONSTRAINT PK_datData PRIMARY KEY (Field_ID, ReadingDate)
)
I have changed the following:
- I change the names of the "Date" and "Value" column, because they
were too vague (moreover, "Date" is a reserved keyword); you should
change the column names to something more appropriate in your
particular case (I was only guessing with these names);
- I added a primary key, to enforce the uniqueness on
Field_ID+ReadingDate; also, this creates a clustered unique index on
these columns. For the specified queries, I think this index is the
most appropriate;
- I added a foreign key to references a "Fields" table, to enforce
referential integrity (I hope you have a "Fields" table);
- I dropped the ID column, because it seems unnecessary;
- I changed the data type of the ReadingDate column to smalldatetime
instead of datetime (because it is stored on 4 bytes instead of 8
bytes); you should do this, only if you do not need to store dates
after June 6, 2079 (I guess you don't need this) and if you do not need
accuracy under one minute (the smalldatetime datatype has an accuracy
of one minute, whereas the datetime datatype has an accuracy of about 3
milliseconds).
Regarding the number of rows in this table, I do not have any
experience with tables with billions of rows. When the performance gets
worse and if more hardware is available, you should try a distributed
partitioned view. See:
http://msdn.microsoft.com/library/en...es_06_17zr.asp
Razvan
Tuesday, March 27, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment