Big data processing department

RU / EN

Big data processing department performs research related to Big Data technologies, solving scientific and technical problems of obtaining, transmission, distribution, storage, and processing of ERS data, and creating intelligent data banks and metadata.

The department’s team has many years of experience in the field of information technologies, parallel processing and distributed storage of ERS data, along with fundamental scientific results obtained in the execution of six research projects for Russian Foundation for Basic Research (RFBR).

At present, the team includes three Doctors of Science and two PhDs with a total staff of 7 employees. The department is headed by S. B. Popov, a Doctor of Engineering.

The department’s basic equipment includes hardware and software for processing large structured and unstructured data sets:

specialised software and hardware system for storage and analytics of structured data IBM Puredata for Analytics (Netezza) with 96-TB capacity (including 4-fold compression of data)
IBM System X servers for distributed storage and analytical processing of unstructured data using the IBM Infosphere BigInsights software, including the management server IBM x3630 M4 (2x Intel Xeon Processor E5-2450v2; 96 GB RAM, 2x 600 GB HDD) and four data servers IBM x3630 M4 (2x Intel Xeon Processor E5-2450v2; 96 GB RAM, 8 TB HDD).

Brief Description of tasks and the work plan

At present, the formation and transformation of ERS data streams is described by a significant increase in the amount of information and a wider set of data stored due to the use of hyperspectral ERS sensors, and applying more unstructured information from sources of different kinds. Trending methods of monitoring the Earth's surface intended to use hyperspectral data streams generated on a regular basis with a minimum update period require qualitative changes of information technologies for production, transmitting, distributed storage, and processing of ERS data. The most promising approach for implementation of these changes is to use the Big Data technology. The migration to distributed systems, a combination of high-performance computing systems with heterogeneous architecture and new-generation data storage methods will solve the problem of storage, transformation and analysis of large-scale multi-component data streams, along with extraction and formalisation of knowledge of the whole complex information.

New scientific results are mostly based on the particular type of the processed space data (digital ERS images and related information) which is defined by geographical references, the difference between the coordinate systems and their representation, multitemporal, multizone (multispectral) and multiscale nature of the images, their large sizes and, as a result, a huge amount of digital data causing some problems during transmission, storage, etc.

Within the project, the Big Data concept of storing and processing data is supposed to be used as a major architectural solution for creating information storages of ERS data. This primarily relates to the choice of the distributed approach both in storage and data processing.

The distributed architecture provides scalability of the volume of stored information and performance of computing subsystems of a single ERS data processing complex, adapting the structure of software and hardware when changing technologies for incoming information pre-processing and expanding the tasks of complex thematic analysis.

Big Data technologies enlarge the range of existing data sources and allow storing all, even unstructured, raw data with all the associated metadata, relationships, and temporal and spatial markers.

Through subsequent selection, structuring, aggregation of data, and forming a range of complementary and / or alternative mathematical models based on this data, the information is gradually transformed into knowledge. Regular reception of ERS data with a minimal period of update ensures great contribution to the creation of knowledge. It is the knowledge based on ERS data that allows improving the decision-making process relevant to society due to the structure, activity, and the ability of this knowledge to be expanded and clarified by the accumulation of new facts and establishment of new relations.

To solve these problems, the project is proposed to develop efficient methods for representing and processing ERS data to make it applicable for the hardware and software infrastructure of distributed storage and analysis of big data.

Taking into account the significant increase in the data volume when using hyperspectral imagery, it is proposed to consider a set of hypercube fragments (instead of a single hypercube of data) as a conceptual storage / processing unit.

The distributed nature of the spatial data storage requires the development of effective data decomposition patterns taking into account technological constraints on the software and hardware used and the information structure of processing algorithms.

The project will study different options for mapping the distributed representation of hyperspectral data within the Hadoop infrastructure; in particular, the fragments obtained from the specific decomposition pattern can serve as units of the data storage array in a distributed file system, or as objects in non-relational databases using the key-value model or the column-oriented storage model.

For implementing these options, it is expected to summarize the currently used basic approaches to store spatially ordered data within the Big Data technology. It is proposed to increase the intelligence level of data access using the new distributed storage of hyperspectral images represented as a federation of independent interactive services. Every fragment of hyperspectral image is interacted as a remote object according to its external interface. Such an increase in the level of indirect data access ensures transparency and compatibility of various formats of storage / use of data when organising a multistage processing within a distributed system using applications from different manufacturers. The "intelligence" of storage units and their relations to other fragments of the particular image enable decentralised background optimisation of data allocation within the distributed system, the mutual matching, geometric correction of the fragments, along with coordinate referencing.

As an additional line of research, it is planned to develop an entirely new method for hyperspectral data representation without use of ordered scanning that is the transition from the hyperspectral imagery representation as a single three-dimensional array with the implicit scan to an unordered set of pixels explicitly storing the information about the spatial coordinates. This type of representation fits the computing paradigm MapReduce in case of distributed pixel-level processing. However, the operations that involve processing a number of neighbouring points of the underlying surface (the analogue of the local image processing based on a sliding window) require the development of fundamentally new algorithms.

The general plan of work includes the following steps:

Step of 2014:

Development of efficient patterns for spatial decomposition of ERS data, including hyperspectral data
Development and research of a mapping method for distributed representation of hyperspectral data using the Hadoop infrastructure and distributed file system storage.

Step of 2015:

Development of a method for representation of hyperspectral data without use of an ordered scan.
Development and research of mapping methods for distributed representation of hyperspectral data within the Hadoop infrastructure using non-relational databases using the key-value model and / or the column-oriented storage model.
Development and research of methods for distributed processing of hyperspectral data within Hadoop.

Step of 2016:

Development of calculating methods for hyperspectral data distributed processing without use of an ordered scan.
Creation of a distributed storage of hyperspectral images in the form of federation of independent interactive services.
Creation of intelligent data banks and metadata of space images using the Big Data technology.