Distributed databases

Distributed database . (BDD) is a set of multiple logically related databases.

Summary

[ hide ]

1 Features
2 History
3 Start of distributed databases
4 Evolution
5 Components
- 1 Hardware involved
- 2 Distributed Database Management System Software (DDBMS)
- 3 Distributed Transaction Manager (DTM)
- 4 Database Manager System (DBMS)
- 5 Node
6 Important considerations
- 1 Distributed Scheduler
- 2 Blockage and Concurrency Detection
  - 2.1 Locks
  - 2.2 Concurrency
  - 2.3 Solutions
  - 2.4 Two-phase lock (2PL)
  - 2.5 Time-stamp
- 7 Distributed Transaction Manager (DTM)
  - 1 Definition of transactions
  - 2 Transaction properties
  - 3 Types of transactions
  - 4 Role of the handler
  - 5 Distribution of data
    - 5.1 Replicated
    - 5.2 Partitioned
    - 5.3 Hybrid
    - 5.4 Criteria for choosing the distribution
  - 6 Security
    - 6.1 Concepts
    - 6.2 The inference problem
  - 7 Types of architectures / implementations
    - 7.1 Multi Distributed Database
    - 7.2 Federated Database
    - 7.3 Implementation Objectives
  - 8 Advantages and Disadvantages
    - 1 Advantages
    - 2 Disadvantages
  - 9 Existing Products
  - 10 Sources
  - 11 External links

characteristics

The databases are distributed among different locations interconnected by a network of communications, which have the capacity for autonomous processing. This indicates that you can perform local or distributed operations. A Distributed Database System (SBDD) is a system in which multiple database sites are linked by a communications system in such a way that a user at any site can access data anywhere on the network exactly as if the Data were being accessed locally.

History

The need to store data in bulk led to the creation of database systems. In 1970 Edgar Frank Codd wrote an article with the name: “A Relational Model of Data for Large Shared Data Banks”. With this article and other publications, you defined the relational database model and rules for evaluating a relational database administrator.

Starting Distributed Databases

Originally the information was stored centrally, but over time the needs increased and this caused certain inconveniences that it was not possible to solve or make efficient in a centralized way. These problems drove the creation of distributed storage, which nowadays provide essential characteristics in the management of information ; that is, the combination of communication networks and databases.

Evolution

There are several factors that have made databases evolve into distributed databases. In the business world there has been a globalization and at the same time the operations of companies are increasingly geographically decentralized. Also the power of personal computers increased and the cost of Mainframes no longer made sense. In addition, the need to share data has grown the market for distributed databases.

Components

Hardware involved

The hardware used does not differ much from the hardware used in a normal server. At first it was believed that if the components of a database were specialized they would be more efficient and faster, but it was found that decentralizing everything and adopting a “shared-nothing” approach was cheaper and more effective. So the hardware that makes up a distributed database comes down to servers and the network.

Distributed Database Management System Software (DDBMS)

This system consists of transactions and distributed database administrators. A DDBMS involves a set of programs that operate on various computers, these programs may be subsystems of a single manufacturer’s DDBMS or it could consist of a collection of programs from different sources.

Distributed Transaction Manager (DTM)

This is a program that receives processing requests from query or transaction programs and translates them into actions for database administrators. DTMs are responsible for coordinating and controlling these actions. This DTM can be proprietary or in-house developed.

Database Handler System (DBMS)

It is a program that processes a certain portion of the distributed database. It is responsible for retrieving and updating user and general data in accordance with the commands received from the DTM.

Node

A Node is a computer running a DTM or a DBM, or both. A transaction node runs a DTM and a database node runs a DBM.

Important considerations

Distributed Scheduler

The scheduler is in charge of ordering a set of transactions or operations that you want to perform on a database. Any order in which you decide to do this set of operations is called scheduling. Part of the scheduler’s job is to perform these operations so that they are serializable and recoverable.

Two schedules are serializable (or equivalent) if

Each read operation reads data values that are produced by the same write operation on both schedules (i.e. they are the same)
The final write operation on each element of the data is the same on both schedules

Detection of blockages and Concurrency

Locks

A crash is generally when an action to be performed is waiting for an event. To manage the blocks there are different approaches: prevention, detection, and recovery. It is also necessary to consider factors such as there are systems where allowing a block is unacceptable and catastrophic, and systems where detection of the block is too expensive.

In the specific case of distributed databases using resource locking, requests to test, set or release locks requires messages between the transaction handlers and the scheduler. For this there are two basic ways:

Autonomous: Each node is responsible for its own resource locks.
A transaction on an item with “n” replicas, requires 5n messages
Request for appeal
Approval of the petition
Transaction message
Successful transaction acknowledgments
Resource Release Requests
Primary Copy – A primary node is responsible for all resource locks
A transaction on an item with copies requires 2n + 3 messages
A request for the resource
An approval of the petition
n transaction messages
n successful transaction acknowledgments
A Request for Release of Appeal

It can be defined that two operations conflict, which must be resolved if both access the same data, and one of them is write and if they were performed by different transactions.

Concurrence

The most common example of a mutual lock is when a resource A is being used by a transaction A that in turn requests a resource B that is being used by a transaction B requesting resource A. Among the specific examples for databases distributed data we can highlight:

Missed update: when two concurrent transactions erase the effect of each other
Inconsistent extraction: accessing information partially modified by a transaction

Solutions

Concurrency control and lock detection and handling is an area of much study in distributed databases, despite this there is no accepted algorithm to solve the problem. This is due to several factors of which the following three are considered the most decisive:

The data may be duplicated on a BDD, therefore the BDD handler is responsible for locating and updating the duplicate data.
If a node fails or communication with a node fails while performing an upgrade, the handler must ensure that the effects are reflected once the node recovers from the failure.
Synchronization of transactions on multiple sites or nodes is difficult since the nodes cannot obtain immediate information on the actions performed on other nodes concurrently.

For the control of mutual locks, no viable solution has been developed and the simplest and most used way is to implement a maximum waiting time for lock requests.

Because of these difficulties, more than 20 concurrency control algorithms have been proposed in the past, and yet new ones continue to appear. A literature review shows that most of the algorithms are variants of the 2PL (2-phase locking) or the time-stamp algorithm. These two basic algorithms are explained below.

Two-phase lock (2PL)

The 2PL algorithm uses read and write locks to prevent conflicts between operations. It consists of the following steps for a T transaction:

Get read lock for an L element (shared lock)
Gets write lock for an E element (exclusive lock)
Read element L
Write in element E
Release L lock
Release E lock

The basic rules for handling locks are: different transactions cannot simultaneously access an element (read-write or write-write), and once a lock is released you cannot request another one, that is, the transaction locks they will grow as long as you don’t release any and after releasing some you can only release the others.

Examples of the 2PL algorithm are

The basic one in which the previously explained scheme is followed with the variant that the write lock is requested for all copies of the element.
Primary Copy 2PL: Instead of requesting a lock for each copy of the write item, a primary or primary copy is requested.
Voting 2PL: All nodes are asked to vote to see if the lock is granted.
Centralized 2PL: The lock handler is centralized and all lock requests are handled by the.

Before implementing a 2PL concurrency control algorithm, it is necessary to consider different factors such as what is the smallest atomic unit that the system allows to block, what is the synchronization interval for all copies of an element, where the tables with the information on the blockages and, lastly, how likely it is that a mutual blockage will occur due to the above factors.

Time-stamp

Each transaction made is assigned a unique timestamp (literally: timestamp) on the originating node. This seal is attached to every read and write request. In the event of a conflict between two write operations trying to access the same element, this is resolved by serializing it with respect to the stamps they have. Although there are several concurrency control algorithms based on timestamps, very few are used in commercial applications. This is largely because the distributed system is required to have a synchronized clock that is rarely implemented.

Distributed Transaction Manager (DTM)

Defining transactions

A transaction is a sequence of one or more operations grouped as a unit. The start and end of the transaction define the consistency points of the database. If an action in the transaction cannot be executed, then no action within the sequence that makes up the transaction will take effect.

Transaction properties

Atomicity: A transaction is an atomic unit of processing, this is done or not done.
Consistency: If a transaction is executed in a consistent state, the result will be a new consistent state.
Isolation: A transaction will not make its modifications visible to other transactions until it is fully executed. In other words, a transaction does not know if other transactions are running on the system.
Durability: Once a transaction runs successfully and makes changes to the system, these changes should never be lost due to system failures.

Transaction types

A transaction can be classified in different ways depending basically on three criteria:

Application areas. First of all, transactions can be executed in non-distributed applications. Transactions that operate on distributed data are known as distributed transactions. On the other hand, since the results of a transaction that commits are durable, the only way to undo the effects of a committed transaction is through another transaction. These types of transactions are known as compensatory transactions. Finally, heterogeneous environments present heterogeneous transactions on the data.
Duration time. Taking into account the time that elapses from the start of a transaction until a commit is made or aborted, transactions can be batch or online. These can also be differentiated as short and long life transactions. Online transactions are characterized by very short response times and by accessing a relatively small portion of the database. On the other hand, batch transactions take relatively long times and access large portions of the database.
Considering the structure that a transaction can have, two aspects are examined: whether a transaction can also contain sub-transactions or the order of the read and write actions within a transaction.

Handler function

The transaction manager is in charge of defining the structure of the transactions, maintaining consistency in the database when a transaction is executed or the execution of one is canceled, maintaining reliability protocols, implementing algorithms to control concurrency and synchronize transactions that are running simultaneously.

The handler receives transaction processing requests and translates them into actions for the scheduler.

The COMMIT operation signals the successful completion of the transaction: it tells the transaction handler that a logical unit of work has been successfully completed, that the database is (or should be) in a consistent state again, and that it is they can make permanent all the modifications made by that work unit.

The ROLLBACK operation, instead, signals the unsuccessful termination of the transaction: it tells the transaction handler that something went wrong, that the database might be in an inconsistent state, and that all modifications made so far by the unit Work logic should be rolled back or overridden.

Data distribution

One of the most important decisions that the designer of distributed databases must make is the positioning of the data in the system and the scheme under which it wants to do it. For this there are four main alternatives: centralized, replicated, fragmented, and hybrid. The centralized form is very similar to the Client / Server model in that the BDD is centralized in one place and the users are distributed. This model only offers the advantage of having distributed processing since in the sense of availability and reliability of the data nothing is gained.

Replicated

The BDD replication scheme is that each node must have its full copy of the database. It is easy to see that this scheme has a high cost in the storage of information. Because the update of the data must be carried out on all copies, it also has a high cost of writing, but all this is worth it if we have a system in which we will write a few times and read many, and where Data availability and reliability are of utmost importance.

Partitioned

This model consists in that there is only one copy of each element, but the information is distributed through the nodes. One or more disjoint fragments of the database is hosted on each node. Since the fragments are not replicated this lowers the cost of storage, but also sacrifices the availability and reliability of the data. Something to consider when you want to implement this model is the granularity of the fragmentation. Fragmentation can also be done in three ways:

Horizontal: Fragments are subsets of a table (analogous to a restrict)
Vertical: Fragments are subsets of attributes with their values (analogous to a project)
Mixed: Fragments are stored as a result of restricting and projecting a table.

A significant advantage of this scheme is that (SQL) queries are also fragmented so that their processing is parallel and more efficient, but it is also sacrificed with special cases such as using JOIN or PRODUCT, in general cases involving several fragments of the BDD.

For a fragmentation to be correct it must comply with the following rules:

Must be Complete: If a relation R is fragmented into R1, R2, …, Rn, each element of R’s data must be in some Ri.
It must be Rebuildable: It must be possible to define a relational operation that obtains the relationship from the fragments.
The fragments must be Disjoint: If the fragmentation is horizontal, then if an element e is in Ri, this element cannot be in any Rk (for k other than i). In the case of vertical fragmentation, the primary keys need to be repeated and this condition must only be met for the set of attributes that are not the primary key.

Hybrid

This scheme simply represents the combination of the partition and replication scheme. The relationship is partitioned and at the same time the fragments are selectively replicated through the BDD system.

Criteria for choosing the distribution

Data location: the data should be placed where it is accessed most often. The designer must analyze the applications and determine how to place the data in such a way that local data accesses are optimized.
Reliability of the data: By storing several copies of the data in geographically remote places, it is possible to maximize the probability that the data will be recoverable in the event of physical damage occurring anywhere.
Data availability: As with reliability, storing multiple copies ensures that data elements are available to users, even if the node they usually access is unavailable or fails.
Storage Capacities and Costs: Although storage costs are not as great as transmission costs, nodes can have different storage and processing capacities. This should be carefully analyzed to determine where to put the data. The cost of storage is significantly decreased by minimizing the number of copies of the data.
Processing load distribution: One of the reasons a BDD system is chosen is because you want to be able to distribute the processing load to make it more efficient.
Communication cost: The designer should also consider the cost of using network communications to obtain data. Communication costs are minimized when each site has its own copy of the data, on the other hand when the data is updated it must be updated on all nodes.
Use of the system: it must be taken into account what will be the main type of use of the BDD system. Factors such as the importance of data availability, writing speed, and the ability to recover from physical damage must be taken into account to choose the correct scheme.

Security

For several years now, databases have been widely used in government departments, commercial companies, banks, hospitals, etc. Currently the scheme under which the databases are used is being changed, they are no longer used only internally, but there are many external accesses of different types. These changes that have been introduced in the use of databases have created the need to improve security practices since the environment is no longer as controlled as the old scheme.

Concepts

The most important security issues are authentication, identification, and enforcement of appropriate access controls. The multi-level security system. This consists of many users with different levels of permissions for the same database with information at different levels. Two approaches to this model have been investigated in distributed databases: distributed data and centralized control, and distributed data and control.

In the distributed data approach and centralized control it is divided into two solutions: partitioned and replicated. In the first of these, what you have is a set of nodes and each one operates at a certain level of security, so the user with permission level X accesses the server that handles the data for X. The replication arose because If someone with high security rights wanted to consult data with a low level of security, they should send their request to a server with a low level of security, so sensitive information could be disclosed. In the replicated scheme then the data is cascaded so that the highest level has an entire copy of the database, and the lowest only the lowest level information.

The inference problem

The inference problem consists of users trying to execute queries on the DB and they inferring information about the legitimate answer that the database must answer. Data mining tools make this problem even more dangerous by making it easier for any newbie to deduce important patterns and information from simply testing queries.

Types of architectures / implementations

In a distributed database system, there are several factors to consider that define the system architecture:

Distribution: System components are located on the same computer or not.
Heterogeneity: A system is heterogeneous when there are components that run on different operating systems, from different sources, etc.
Autonomy: It can be presented at different levels, which are described below:

Design autonomy: Ability of a system component to decide questions related to its own design.
Communication autonomy: Ability of a system component to decide how and when to communicate with other DBMS (Database Management System).
Autonomy of execution: Ability of a system component to execute local operations as desired.

Multi Distributed Database

When a distributed database is very homogeneous it is said to be a multi-distributed database.

Federated Database

When a distributed database has a lot of local autonomy, it is said to be federated.

Implementation Goals

When implementing a distributed database there are certain common objectives:

Location transparency. It allows users to access the data without knowing the location of the data. This level of transparency can be achieved by using Distributed Transaction Managers, who are able to determine the location of data and issue actions to the appropriate timers, which can be run when Distributed Transaction Administrators have access to directories. of data locations.
Duplication transparency. For replication transparency to be possible, transaction managers must translate transaction processing requests into actions for the data manager. For reads the transaction manager selects one of the nodes that stores the data and runs the read. To optimize the process, the transaction manager needs information about the performance of various nodes with respect to the query site, so they can select the best performing node. Updating and writing duplicate data is often more complicated since the transaction handler must issue a write action for each of the timers that stores a copy of the data.
Concurrency transparency. When multiple transactions are executed at the same time, the results of the transactions should not be affected. Concurrency transparency is achieved if the results of all concurrent transactions are logically consistent with the results that would have been obtained if the transactions had been executed one by one, in any sequential order.
Failure transparency. It means that despite failures the transactions are processed correctly. Faced with a failure, the transactions must be atomic, it means that all or none of them are processed. For this type of problem it is important to have a backup of the database, and thus be able to restore it when it is convenient. The system must detect when a locality fails and take the necessary measures to recover from the failure. The system should no longer use the location that failed. Finally, when this locality is recovered or repaired, there must be mechanisms to reintegrate it into the system with the minimum of complications.
Processing location. The data should be distributed as close as possible to the applications that use it to maximize the location of the processing, this principle responds to minimize remote access to the data. Designing a distribution that maximizes processing locality can be done by adding the number of local and remote references corresponding to each candidate fragmentation and assigning the fragmentation choosing the best solution. Configuration independence. Configuration independence allows you to add or replace hardware without having to change existing software components in the distributed database system.
Database partitioning. The database is distributed so that there is no overlap or duplication of data held at different locations, as there is no duplication of data, the costs associated with redundant data storage and maintenance are avoided. If the same data segment is used in more than one location, the availability of the data is limited. Reliability may also be limited when a failure occurs in the calculation system of a locality, affecting the availability of data for that locality that is not available to users anywhere in the system.
Data fragmentation. It consists of subdividing the relations and distributing them among the sites of the network. Its objective is to find alternative ways of dividing the instances (tables) of relations into smaller ones. Fragmentation can be done by individual tuples (horizontal fragmentation), by individual attributes vertical fragmentation) or a combination of both (hybrid fragmentation). The main problem with fragmentation lies in finding the appropriate unit of distribution. A relationship is not a good unit for many reasons.

Usually the views of a relationship are made up of subsets of relationships. In addition, applications locally access subsets of relationships. Therefore, it is necessary to consider subsets of relationships as a distribution unit. By decomposing a relationship into fragments, each treated as a distribution unit, it allows the concurrent processing of transactions. The set of these relationships, will cause the parallel execution of a query to be divided into a series of subqueries that will operate on the fragments.

When the views defined on a relationship are considered as a distribution unit that are located in different places on the network, we can choose two different alternatives: The relationship will not be replicated and is stored in a single site, or there is replica in all or some of the sites where the application resides. The consequences of this strategy are the generation of a volume of remote accesses that may be unnecessary with a mishandling of these replicas. Also, unnecessary replicas can cause update execution problems and may not be desirable if storage space is limited.

The disadvantages of fragmentation are that if they can be defined by mutually exclusive fragments and when recovering the data from two fragments located in different sites, it is necessary to transmit the data from one site to another and perform the join operation on them (Join ), which can be expensive. Semantic control when the attributes involved in a dependency a relationship is broken down into different fragments and these are located in different sites can be very expensive because it is necessary to search a large number of sites.

Advantages and disadvantages

Advantage

Reflects an organizational structure – the fragments of the database are located in the departments to which they are related.
Local autonomy – a department can control the data that belongs to it.
Availability – a failure of one part of the system will only affect a fragment, rather than the entire database.
Performance – the data is generally located near the site with the highest demand, also the systems work in parallel, which allows to load balance the servers.
Economy – it is cheaper to create a network of many small computers, than to have a single very powerful computer.
Modularity – you can modify, add or remove systems from the distributed database without affecting the other systems (modules).

Disadvantages

Complexity – You must ensure that the database is transparent, you must deal with several different systems that can present unique difficulties. The design of the database has to be worked taking into account its distributed nature, therefore we cannot think of making joins that affect various systems.
Economy – the complexity and infrastructure required means that more manpower will be needed.
Security – you must work on the security of the infrastructure as well as each of the systems.
Integrity – Integrity becomes difficult to maintain, applying integrity rules across the network can be very expensive in terms of data transmission.
Lack of experience – Distributed databases are a relatively new and rare field, so there is not much staff with adequate experience or knowledge.
Lack of standards – There are no tools or methodologies to help users convert a centralized DBMS to a distributed DBMS yet.
Database design becomes more complex – in addition to the difficulties generally encountered when designing a database, designing a distributed database must consider fragmentation, replication, and location of fragments at specific sites.

The Interplay of Bitcoin in Shaping Distributed…

Distributed databases

Related Posts:

by Abdullah Sam

Leave a Comment Cancel reply