The aim of this report is to look at database warehousing by looking at and critically evaluating the dimensions and facts as the foundation to this warehouse. The warehouse is a large collection of data that gathered via multiple sources and is then filtered and refined to a suitable form for the end user. This will be where a critical evaluation of the Dimensional model as far as how it works and what is good and bad about this type of data model.
Secondly there will be information on Database administration, specifically on back up, restore and recovery and how they are applied within a database.
Thirdly another critical analysis will be carried out for security and other components including user accounts. This will be looking at whether this type of system is resilient to threats, what is the biggest threat to any form of database and if there is more than one. Lastly, is there a possible means to reduce the significance of potential threats?
This report will have the following layout.
- 1.1 Executive summary
- 2.1: Critical evaluation of Database Warehouse dimensional model.
- 3.1: Database administration Backup restore and recovery
- 4.1: A critical analysis of security and user accounts,
The aim of database warehousing is for any decision maker to analyse data quickly and is seen as a form of data extraction according to KIMBALL (1998). They go on to say that the aim of the Database warehouse, especially the multi-dimensional model is to generate a database structures which can be easily understood by any user. This also enables them to create and really customise their search queries. KIMBALL (1998) also mentions that another aim is to maximize the research efficiency, by reducing the database complexity.
Trujillo et al (2001), seems to think that most developers agree that data warehouse, multidimensional database or MDB, and online analytical processing or OLAP applications focus on a version of dimensional modelling called multidimensional, this offers two possible benefits.
Breslin (2004) had looked at data warehousing and examined both Kimball and Inmon works and found that Inmon was prevalent in the development and understanding of relational databases that use normalisation and use Entity Relationship Diagram or ERD, while Kimball was a significant contributor to the dimensional model. Dimensional modelling begins with tables as opposed to entity-attribute data models such as Entity Relationship Diagram or ERD. The tables within the dimensional model use a combination of both fact and dimension.
If you need assistance with writing your essay, our professional essay writing service is here to help!Essay Writing Service
Fact tables contain elements that have a numerical value such as how many of something or date/year, while dimension tables contain relevant attributes of the measurable units in the fact tables. An example of this fact/dimensional model is from Morgan (2019) for reference see Appendix figure 2. where they had to create 10 questions that would be used as the foundation for the dimensional database. In this example majority of the questions used were asking how many Students passed, which is the fact part and was highlighted red. The module or course per year was the dimension or condition and was highlighted green.
Some notable differences between both the Fact/Dimension and relational tables, is that the Fact/Dimension often contain repeating data; this breaks relational database normalization rules especially around 2nd and 3 third form. However, dimensional modelling breaks normalization rules in order to achieve a high level of performance in the data warehouse, while keeping it end-user friendly.
Trujillo et al (2001) believes that the multidimensional model apparently relates to how data analysers think and make decisions. In theory, this helps users understand data in a more simplified way. There needs to be more understanding about how data analysers think and assess data. However, the focus would likely shift from the data warehousing and look to shift the focus towards psychology. This was only briefly mentioned at the start of the paper. Also, people think differently and is it possible to cover every perspective?
The second point, this approach is meant to help predict what the end users want to do, by in theory facilitating performance improvements. Although, this is not guaranteed if you don’t know what end users want to do or achieve with particular data. This is provided that the person who is using this dimensional modelling has clear defined intentions for what data they want to find out to ensure the data has context and purpose. An example is a name of person, there are many names but when you add more characteristics or attributes like age and gender, that name has more of a purpose as it is an identifiable person and adds more context.
As edureka (2017) mentions the advantage of this method is that the strategic questions can be addressed by looking at and studying previous historical trends. This is usually from data that has been collected over a period time usually long term from 2 years onwards. This is because 2 years or less worth of data is not enough to analyse thoroughly because unexpected events can happen at any given time.
Simitsis et al (2005) have defined the method to manipulate data within a data warehouse is commonly Known as Extraction Transformation Loading or ETL. Simitsis (2005) pp. 564-575 then go on to describe the aim and purpose of these tools and the process can be seen below:
- “ The identification of appropriate information from the sources;
- The extraction of this information;
- This part focuses on the customization and integration of the incoming information from multiple sources into a common format;
- The cleaning and refinement of the resulting data, while operating within the imposed database and business rules,”
In this example figure 1 from Morgan (2019) and edureka (2017) this example shows the basic concept of the ETL process. This can be broken down into query and result, so the query parts are both circled in red and the result is in blue showing that the data is initially gathered from numerous sources and operational systems.
The data is refined, and any irregularities or discrepancies are removed. Once the data is refined it is stored in a way that enables easy access and increase the chances of the end user understanding the results gathered within the data warehousing collection.
Figure 1 Edureka. (2017). Data Warehouse Tutorial For Beginners | Available: https://www.youtube.com/watch?v=J326LIUrZM8. Last accessed 09/06/2019.
Edureka (2017) also mentions that a database warehouse maintains historical data, predominantly of anything that is over a month or older such as a year or more. Whereas transactional systems usually focus on recent data, such as for the last day, week, or month. A database warehouse can store years worth of data. Historical data is needed by the key decision makers, to be able to detect possible deviations, trends, and long-term relationships to possibly predict future activities and to enable the key decision makers the means to make decisions based on this information.
Sherman (2015) describes notable aspects of dimensional modelling such as facts, dimensions, and attributes. The one noticeable thing is that the fact/dimension model has more repetitive data. In Morgan (2019) Assignment 2 both types of relational and dimensional design were used in the creation of a database.
Figure 2 shows an example of fact and dimensions being used where in the question the highlighted red part was the fact. An example of that in this question was “How many students have passed” While the dimensions were highlighted in green “their course overall per year” so the course and year was identified as a dimension.
While figure 3 ,see appendix, shows an example of a relational database that has been normalised to the third form and also identifies both the primary/foreign keys with all tables and the attributes within them. This also covers the data types that will be used for all tables.
Facts, dimensions, and attributes can be categorised into schemas. Some examples include star and, snowflake being the 2 commonly used ones in database design.
The type of schema can also be a determining factor depending on variables such as the type of reporting that the model needs to facilitate and the type of Business Intelligence (BI) tool that will be used.
Building a dimensional model also includes additional pieces such as calendar and time dimensions. As an example from appendix figure 4. From Technologies (2016) the Star schema uses descriptions as a part of the fact and dimensions tables, while the snowflake design incorporates more relational database principles such as the relationships through primary and foreign keys and uses more of this method.
In database management or administration there is a necessity to have the means to be able to back up, restore and have a means to recover data in the event of unexpected circumstances or possible disasters.
Let’s Learn (2016) go on to clarify a possible solution to manage how data can be protected and managed. This is achievable through a data controller method known as RAID (Redundant Array of Independent Disks) a form of storage that uses multiple disk drives that act as one drive that can provide some degree of resiliency dependent on the level in question.
Raid 0 is the one that offers no protection in that the data is not mirrored or duplicated. This means that there is only one copy of the data and if the predominant storage drive is faulty or damaged in any way. This will almost guarantee data loss, with no chance to restore or recover it.
Let’s Learn (2016) also mentions about other Raid levels such as Raid 1 is an improvement on Raid 0 where the data is duplicated thus offering redundancy. So in the event that one drive is faulty, the data can be recovered from the second one. However, this does affect the read and write speed. The read part is better in that data can be read from either drive however the write speed is slower, but the data can be written and saved in parallel. This means that data is saved to both drives at the same time.
Powercert (2015) also mentions about raid 0 and one, the difference is they then go on to five and 10. Data within a Raid 5 configuration is not duplicated but striped across multiple drives with parity. Powercert goes on to say “Parity is used to rebuild data in the event of disk failure” this would be where the total amount of data would have equally shared over the four drives and if one drive goes down anyone of the three remaining ones can be used to restore data see appendix figure 5. A minimum of 3 drives are needed and parity does consume a large amount of data. Raid 10 is a combination of both Raid 1 and 0. This means it shares the drive duplication of 1 with the read and write speed of 0.
From a data storage perspective, data storage can be seen as having both advantages and disadvantages in relation to shared parallel systems according to Furtado (2009), the disk storage is shared within systems like a Data warehouse. This has the potential advantage over individual systems by avoiding the need to assign and store specific parts of the data in each section, as the data is all stored within a specific space which is a disk subsystem that both shares it resources and incorporate a data controller like RAID.
On the other hand, this can also be seen as a disadvantage in regard to possible limitations in scalability with the storage. As all the main processing units share the same storage system and thus can impose significant restrictions on anyone looking to either increase or decrease the size of the system in question.
For some time, IT professionals have regarded backups as being a necessary process to be carried out as a part of their maintenance operations
While Powercert (2019) describes in great detail the types of back up and where they would be used. These are known as full, incremental and differential back up.
Full Back up is a large scale back up that backups all data and can restore all data within one session. however, if a full back up is being carried out for a very large organisation on a multi-national to international scale. A full back up would not necessarily be the most effective, as it would take more time to complete this task as it is recreating a full copy of all data. PowerCert and many other authors recommend doing this once a week usually suggesting at the start of the week.
Incremental Back up is a smaller scale back up that can be carried out more frequently and covers any changes from the last full or incremental back up. Incremental backups can be carried out daily.
Differential Back up is where data has changed since the last full back up and covers that day as well as the previous day. PowerCert (2019) example see figure 6 appendix shows that on Monday a full back up is done. Then Tuesday back up will only have Tuesday’s data, then Wednesday will have both Tuesday and Wednesday’s data and when you get to Friday it will have the data from Tuesday to Friday.
After data has been backed up and in the event of data loss or unexpected fault, data then needs to be restored. Similarly, to the previous section there are a range of restorations that are directly connected to their respective back up. PowerCert (2019) describes these below.
Full restoration of data can be done from one tape and session and is the fastest one as the full back up is only needed to complete this task and restore all data in the event of a significant fault or error.
The Restoration of an Incremental back up seems to be faster initially for example as it could only be data for a day. However, if it is a large scale restoration of data with incremental backups is required. Incremental restoration would take the longest as the full back up is used first and then restore all the incremental data in the correct order by day and time.
The restoration of data from a Differential back up only needs the last full back up and the last differential back up. As such this type of back up is in the middle between full and incremental and the same for restoration of data as it shares similarities with the incremental in needing the full back up initially but is difference in that only the last of the differential back up is needed and not the previous ones as copies will be within as they cover that particular day and the one previously.
By carrying out backups on both a small and large on a regular basis this can enable large scale data within a database or a company to be either recovered or restore to normal without losing any data. Gotseva et al (2011) goes on to outline the types of back up models and what they do. These are full recovery and simple.
Full recovery model: Is where the database and logs such as event and transaction are backed up and created. Reconstruction of data can be done at any moment of time during the backup process. This would be done in conjunction with a copy of a full back up.
Simple recovery model: is where the transaction log backup is not created, only a full or differential backup would be able. This is used for any databases or files that could be seen as not necessary to regularly manage or update and could be ok to lose in the event of data loss.
Gotseva et al 2011 then proposed in their work that a recovery plan should have a combination of a weekly back up with daily differential to enable maximum coverage and minimalize the chances of data loss or other unexpected errors. Their recommendation seems reasonable in theory as this should maximise the operation time as well as making allowances for maintenance-based tasks and even having a recovery plan in place in the event of unexpected circumstances.
The topic of Security of any system is a contested and often debated topic whether it’s in IT or specific context like online databases.
According to Singh et al (2016) they believed this image is a representation of threats in 2012 for databases and Structured Query Language or SQL injection showed to be the most common one at 42% so nearly half of the threats was from this alone. The next one was unknown or other threats just under 25%. This shows in some respect that SQL injections are a significant issue. However, while looking at this there was no mention of where this was taken from, so there is some question as to how valid this image is in the topic of security threats to databases and data warehousing as a whole.
Figure 7 Singh et al “SQL injection: Types, methodology, attack queries and prevention,”New Delhi, 2016, pp. 2872-2876.
Some individuals such as Halfond (2006) looks at the topic of security within databases and identified SQL injection as one of if not the most serious threats for databases and Web applications. Any platform that require any form of data verification are vulnerable to SQL injection.
Halfond (2006) made a reference to The Gartner Group, who conducted a study that looked at over 300 Web sites and online services has shown that most of them could be vulnerable to SQLIAs. SQL injection refers to a class of code-injection attacks in which data provided by the user is included in an SQL query in such a way that part of the user’s input is treated as SQL code. This is where an attacker can submit SQL commands directly to the database.
Figure8 Singh, et al, “SQL injection: Types, methodology.” (2016) pp. 2872-2876.
SQL injection is a process where the user injects query language to a database to gain access to a system, database or personal information.
The example above see fig 7. shows that initially the extraction of information for users, database name and operating system level is gathered. The next part is where the input fields are attacked by adding marks such as ‘, ; and OR to force an error. The third part is where the actual injection occurs. This then extracts all information for the hacker to gain access to the OS. Once in the OS this will give the user access to the system files and the ability to execute commands and cause unsurmountable damage.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.View our services
The cause of SQL injection vulnerabilities is believed to be relatively simple and believed to be well understood: insufficient validation of user input. To look at potentially addressing this problem, it is believed that developers have proposed a range of coding guidelines that promote defensive coding practices. The question with this is if it is so simple to address why has it not happened? And it remains one of the biggest threats.
Tutorial point (2019) also raised another security problem earlier in this report is that cookies have been known to be another possible threat to security. In that fragments of historical data are kept on the system, which can indirectly pass information onto third parties as a result can provide another way to allow unauthorised users possible access.
From a security perspective, cookies on websites and other online can also be seen as another potential threat, however a lot of websites want users to accept them before going further on a website. This can be seen as both having good and bad points about them. The good point is that they provide users the means to go back to a web page later. So, they are convenient to find previously viewed pages. The bad point is that they can pass information on to unknown sources and even provide another opportunity for any attacks such as sql injection attacks.
Now from a security point of view, from a user point of view. They do not have the same access to access information in the same way as an administrator, but they also disclose personal details that sometimes are held in these databases which seems to be a risky place with this possible threat still around.
It has shown that having prior insight or having a clear idea of what you want from your database and even data warehouse is essential in order for decision makers to go through and identify any particular trends or patterns and make decisions from moving forward whilst being aware of any threats to achieving a desired goal or outcome. Otherwise that data is as good as useless.
In regard to security of a database or any web-based application that requires verification of specified data i.e. login to an account such as online banking and email and many more. It’s believed that the nature of SQL injections are seen to be the biggest threat and that there is some form of possible solution in implementing more stringent forms of data verification. Even going through legacy code could be a good thing however it can be rather time consuming. So a recommendation is to reduce the amount of legacy code where possible to reduce the chances of being hacked through this particular area. As a result, there is some form of security implemented as far as preventing end users access to edit a system or database externally but there will always be attempts to gain unauthorised access. Another significant threat is that which came under unknown or other as these can be of a uncertain nature. A way to minimalize or mitigate the possible threats is to be more proactive and where possible attack these weaknesses.
- Breslin.M. (2004). Data Warehousing Battle of the Giants: Comparing the Basics of the Kimball and Inmon Models:. Business Intelligence Journal. P6-20.
- Edureka. (2017). Data Warehouse Tutorial For Beginners | Data Warehouse Concepts: Data Warehousing . Available: https://www.youtube.com/watch?v=J326LIUrZM8. Last accessed 09/06/2019.
- Furtado, P. (2009) “A Survey of Parallel and Distributed Data Warehouses,” International Journal of Data Warehousing and Mining, 5(2), pp. 57–77. doi: 10.4018/jdwm.2009040103.
- Gotseva.D, Gancheva.V, Georgiev.I (2011) (PDF) DATABASE BACKUP STRATEGIES AND RECOVERY MODELS. Available from:https://www.researchgate.net/publication/331160799_DATABASE_BACKUP_STRATEGIES_AND_RECOVERY_MODELS accessed Aug 03 2019.
- Halfond W.G.J, Viegas.J and Orso.A. (2006). A Classification of SQL Injection Attacks and Countermeasures. Available at: https://moodle.glos.ac.uk/moodle/pluginfile.php/726001/mod_resource/content/1/halfond.viegas.orso.ISSSE06.pdf. Last accessed 02/06/19
- Ito.A (2001) Database system with backup and recovery mechanisms. US7031986B2
- KIMBALL, R. The Data Warehouse lifecycle toolkit. New York: Wiley computer Publishing, 1998.
- Let’s Learn. (2016). RAID levels. Available at: https://www.youtube.com/watch?v=KmbEjYVtByQ&t=328s. Last accessed 01/08/2019
- Morgan, S., (2019). Assignment 1 on Moodle – CT6013 Advanced Database Systems, University of Gloucestershire.
- Morgan, S., (2019). Assignment 2 on Moodle – CT6013 Advanced Database Systems, University of Gloucestershire.
- PowerCert Animated Videos. (2015). What is RAID 0, 1, 5, & 10?. Available: https://www.youtube.com/watch?v=U-OCdTeZLac. Last accessed 05/08/2019.
- PowerCert Animated Videos. (2019). Incremental vs Differential vs Full Backup Explained. Available: https://www.youtube.com/watch?v=o-83E6levzM. Last accessed 05/08/2019
- Singh N., Dayal M., Raw R. S. and Kumar S., “SQL injection: Types, methodology, attack queries and prevention,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 2872-2876.
- Simitsis ,A. Vassiliadis P. and Sellis T., “Optimizing ETL processes in data warehouses,” 21st International Conference on Data Engineering (ICDE’05), Tokoyo, Japan, 2005, pp. 564-575. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1410172&isnumber=30564 last accessed on 07/06/2019
- Sherman.S. (2015). Chapter 9 – Dimensional Modeling. In: Business Intelligence Guidebook. Massachusetts, America: Morgan Kaufmann. Page 513-525
- Tutorial point. (2019). Data Warehousing – Overview. Available: https://www.tutorialspoint.com/dwh/dwh_overview.htm. Last accessed 01/06/19.
Trujillo J., Palomar M.,. Gomez J and Il-Yeol Song, “Designing data warehouses with OO conceptual models,” in Computer, vol. 34, no. 12, pp. 66-75, Dec. 2001.
- Wong, D M , Lei, C H (2003) “Partitioned access control to a database” United States Oracle Corporation (Redwood Shores, CA) URL: http://www.freepatentsonline.com/6578037.html Last accessed on 08/06/2019
Figure 1 Edureka. (2017). Data Warehouse Tutorial For Beginners | Data Warehouse Concepts: Data Warehousing . Available: https://www.youtube.com/watch?v=J326LIUrZM8. Last accessed 09/06/2019.
Figure 2: Morgan, S., (2019). Assignment 2 on Moodle – CT6013 Advanced Database Systems, University of Gloucestershire.
Red = Fact
Green = Dimension
How many students have passed their course overall per year
How many students per campus per year?
How many students from all courses passed in a given year?
- Departmental Heads
How many students on a course per year?
How many Students in a department per year?
How many modules being ran per campus per year?
- Academic Subject Leaders
What percentage of students passed per module per year?
- Module Leaders
How many students on a particular module per year?
- Accommodation Manager
How many students per accommodation per year
How many students per accommodation per campus per year.
Figure 3. Morgan, S., (2019). Assignment 2 on Moodle – CT6013 Advanced Database Systems, University of Gloucestershire.
Figure 4: Technologies is the most challenges. (2016). Explain Star Schema & Snow Flake Design. Available: https://www.youtube.com/watch?v=KUwOcip7Zzc. Last accessed 06/08/2019.
Star design Snowflake design
Figure 5 PowerCert Animated Videos. (2015). What is RAID 0, 1, 5, & 10?. Available: https://www.youtube.com/watch?v=U-OCdTeZLac. Last accessed 05/08/2019.
Figure 6: Differential Back up from PowerCert Animated Videos. (2019). Incremental vs Differential vs Full Backup Explained. Available: https://www.youtube.com/watch?v=o-83E6levzM. Last accessed 05/08/2019
Figure 7: N. Singh, M. Dayal, R. S. Raw and S. Kumar, “SQL injection: Types, methodology, attack queries and prevention,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 2872-2876.
Figure 8: N. Singh, M. Dayal, R. S. Raw and S. Kumar, “SQL injection: Types, methodology, attack queries and prevention,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 2872-2876.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: