GPU Accelerated Parallel Heuristic for the 2D Knapsack Problem
08/02/20 Computer Science Reference this
Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.
Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.
A GPU accelerated parallel heuristic for
the 2D knapsack problem with rectangular pieces
Abstract: The 2D Knapsack Problem is one of the typical NPhard problems in combinatorial optimization, which is easy to be described but hard to be solved. Given a set of rectangular pieces and a rectangular container, the twodimensional knapsack problem (2DKP) consists of orthogonally packing a subset of the pieces within the container such that the sum of the values of the packed pieces is maximized. If the value of a piece is given by its area the objective is to maximize the covered area of the container. It is still very CPU timeconsuming when solving such a large problem instance. As graphic process units (GPUs) have been evolved to support general purpose computing, they are taken as a major accelerator in scientific and industrial computing. In this paper, we present an optimized parallel heuristic efficiently accelerated on GPUs and the proposed optimization algorithm can provide higher quality solutions within a reasonable computational time for 2DKP.
Keywords: GPU, parallel heuristics, combinatorial optimization, algorithms, 2D Knapsack Problem, local search, greedy algorithm
 INTRODUCTION
The twodimensional knapsack problem (2DKP) with rectangular pieces is a class of wellstudied and popular combinatorial optimization problems because it has many realworld applications in the paper, glass, wood and metal industries. This paper deals with the 2DKP problem with a set of small rectangular pieces (R1,R2,….Rn); Ri = (wi, hi) and a larger rectangle, known as a container. The task is for a feasible arrangement of a subset of the pieces in the container that maximizes the total value of the packed pieces. If the value of a piece is given by its area, the aim is to maximize the covered area of the container. An arrangement of pieces, also known as a packing plan, is regarded as feasible if each piece is placed orthogonally (i.e. parallel to the container edges), is completely inside the container and if no two placed pieces overlap.
This problem has several realworld applications and is proven to be NPhard. 2D packing problems result, for example, where goods have to be packed on pallets in horizontal layers. And the spacesaving arrangement of adverts on the pages of newspapers, or the effective positioning of components on chips when designing integrated circuits, lead to 2D packing problems.
Solving combinatorial optimization problems is complex and timeconsuming for CPUs, especially for largescale problems like 2DKP. Metaheuristics methods can be used to obtain satisfactory resolution (approximate optimum) in a reasonable time. Local search metaheuristics (LSMs) are one of the most widely researched single solutionbased approaches with many variants such as iterated local search, Genetic algorithms, Hill climbing, tabu search and simulated annealing. Parallelism is a way not only to reduce the time complexity but also to improve the quality of the solutions provided.
Today, Graphics Processing Units (GPUs) have evolved from fixed function rendering devices to programmable and parallel processors. The demand from the market for realtime, highdefinition 3D graphics motivates GPUs to become highly parallel, multithreaded, manycore processors with tremendous computational power and highbandwidth memory. Therefore, the GPU architecture is designed such that more transistors are devoted to data processing than to data caching and flow control. GPU accelerated heuristics can be more computational efficient than CPUbased LSMs. Furthermore, it is an important problem to make LSM algorithms on GPU optimized for the best efficiency.
In this paper, we propose an optimized parallel heuristics on GPUs and test the algorithm with a typical 2DKP in computational science. So, our proposed parallel approach with GPU; which incorporates the genetic algorithm, local search method and greedy algorithm, is proposed to be as an effective parallel heuristic approach to find an improved optimal solution for 2DKP.
9781538676936/18/[email protected] IEEE
 BACKGROUND
1) Genetic Algorithm: Genetic Algorithm [7] is an optimization technique based on natural evolution that includes the idea of surviving of the most fitted element in a searching algorithm. In nature the most appropriate individuals have more possibilities of surviving and it will bring more adapted successors, those will be healthier than their parents. The same idea is applied by creating a new generation of solutions that are nearer to the demanded solution. A genetic algorithm consists of these steps: 1) encoding, 2) evaluation, 3) selection,
4) crossover, 5) mutation, 6) decoding. Basic genetic algorithm is as follows:
—————————————————————–
Basic Genetic Algorithm:
——————————————————————
Step1: Initialize population with random solutions. Step2: Evaluate each candidate.
Step3: Repeat (iterations).
a) Select parents.
b) Recombine pairs of parents.
c) Mutate the resulting children.
d) Evaluate children.
e) Select individuals for the next
generation.
Until Terminating condition is satisfied.
——————————————————————–
2) Local Search Heuristic: Local search [8] is a heuristic method for solving computationally hard optimization problems. Local search can be used on problems that can be formulated as finding a solution maximizing a criterion among a number of candidate solutions. Local search algorithms move from solution to solution in the space of candidate solutions by applying local changes, until a solution deemed optimal is found or a time bound is elapsed.Fields within local search include Hill climbing, tabu search, Simulated annealing.
Local search methods have a tendency to become stuck in suboptimal regions or on plateaus where many solutions are equally fit.Tabu search enhances the performance of local search by relaxing its basic rule. First, at each step worsening moves can be accepted if no improving move is available (like when the search is stuck at a strict local minimum). In addition, prohibitions (henceforth the term tabu) are introduced to discourage the search from coming back to previously visited solutions.Also, Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set
of cities). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time.
3) Greedy Algorithm: A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. In mathematical optimization, greedy algorithms solve combinatorial problems having the properties of matroids.
In general, greedy algorithms have five components:
 A candidate set, from which a solution is created
 A selection function, which chooses the best candidate to be added to the solution
 A feasibility function, that is used to determine if a candidate can be used to contribute to a solution
 An objective function, which assigns a value to a solution, or a partial solution, and
 A solution function, which will indicate when we have discovered a complete solution
Greedy algorithms produce good solutions on some mathematical problems.We can make whatever greedy choice seems best at the moment and then solve the subproblems that arise later. The choice made by a greedy algorithm may depend on choices made so far, but not on future choices or all the solutions to the subproblem. It iteratively makes one greedy choice after another, reducing each given problem into a smaller one.
 GPU ARCHITECTURAL CONSTRAINTS Until recently the only viable choice as a platform for
parallel programming was the conventional CPU processor, be it single or multicore. Usually many of them were arranged either tightly as multiprocessors, sharing a single memory space, or loosely as multi computers. CPU does not have a lot of arithmeticlogic units (ALU), but a large cache and an important control unit. As a result, the CPU is specialized to manage multiple and different tasks in parallel that require lots of data. Thereby, data are stored within a cache to accelerate its accesses. Graphic Processing Units (GPUs) are nowadays available in nearly all personal computers and they offer a potential reduction in the computational time. GPUs have evolved into a highly parallel, multithreaded and manycore environment. The modern GPU architecture is outstanding in performance with respect to power consumption, price and space occupied.
A GPU consists of a set of Stream Multiprocessors (SMs); each one features multiple processing units with the ability of executing thousands of operations concurrently. Inside a SM,
Fig. 1. GPU Architecture
threads are grouped into warps. A warp executes 32 threads in a SIMD (Single Instruction, Multiple Data) manner that is all threads within a warp execute the same operation on multiple data points. There are at least two types of memory: global and shared. Shared memory is usually limited to few Kilobytes per SM and global memory allows to store a large amount of data (e.g., 6GB).
Fig. 2. GPU threads model
A thread on GPU can be seen as an element of data to be processed. Compared to CPU threads, GPU threads are lightweight. It means that changing the context between two threads is not a costly operation. Regarding their spatial
organization, threads are organized within so called thread blocks.
A kernel is executed by multiple equally threaded blocks. Blocks can be organized into a onedimensional or two dimensional grid of thread blocks, and threads inside a block are grouped in a similar way. Each thread is provided with a unique id that can be used to compute on different data. The advantage of grouping is that the number of blocks processed simultaneously by the GPU is closely linked to hardware resources. All the threads belonging to the same thread block will be assigned as a group to a single multiprocessor, while different thread blocks can be assigned to different multiprocessors. Hence, a major issue is to control the threads parallelism to meet the memory constraints. From a hardware point of view, graphics cards consist of streaming multiprocessors, each with processing units, registers and onchip memory. Since multiprocessors are organized according to the SPMD model, threads share the same code and have access to different memory areas.
Fig. 3. CPU – GPU communications
The CPU is considered as a host and the GPU is used as a device coprocessor. This way, each GPU has its own memory and processing elements that are separate from the host computer. Data must be transferred between the memory space of the host and the memory of GPU during the execution of programs.
Each processor device on GPU supports the single program multiple data (SPMD) model, i.e. multiple autonomous processors simultaneously execute the same program on different data. For achieving this, the concept of kernel is
defined. The kernel is a function callable from the host and executed on the specified device simultaneously by several processors in parallel. Communication between the CPU host and its device is done through the global memory.
 RELATED WORK
Andreas Bortfeldt and Tobias Winter [1] proposed a genetic algorithm (GA) , which is addressing the guillotine case of the 2DKP as well as the nonguillotine case. Moreover, an orientation constraint may optionally be taken into account and the given piece set may be constrained or unconstrained. The GA is subjected to an extensive test using wellknown benchmark instances. In a comparison to recently published methods the GA yields competitive results according to them.
Defu Zhang and Furong Ye and Yain Whar Si and Stephen
 H. Leung [2] proposed a hybrid algorithm based on variable neighbourhood. Their proposed algorithm is a hybrid metaheuristic that combines an improved heuristic algorithm with a variable neighbourhood search.Different neighbourhoods are constructed based on the concept of block patterns. The proposed algorithm has three interesting features. First, a leastwaste strategy is used to improve the constructive heuristics. Second, a better sorting sequence is selected to generate an initial solution. Finally, different neighbourhoods are constructed based on block patterns. The computational results from a diverse set of problem instances show that the proposed algorithm performs better than algorithms reported in the literature for most of the problem sets compared.
I (Mohammad Rashid) [3] also previously proposed a hybrid algorithm for 2DKP that integrates a genetic algorithm with localsearch heuristics and a greedy algorithm. In our previous approach we keep the evolutionary technique of genetic algorithms using operators like mutation, crossover and the selection of the most fitted element, but the crossover operator is enhanced with a local search heuristic. We also use the local search heuristic for its strong climbing ability as well as to find local optima efficiently. The greedy algorithm is responsible for generating the start generation as well as carrying out postoptimization of the previously best solution.
 DESIGN AND METHODS
In this work we present a GPU accelerated parallel heuristics for 2DKP that integrates a genetic algorithm with localsearch heuristics and a greedy algorithm as we most recently proposed in our previous work. In our new approach, we parallelize our previously proposed heuristics efficiently accelerated on GPU by leveraging hundreds of GPU cores/threads, establish an efficient CPUGPU communication and find a better optimal solution than all other previous works.
———————————————————————
Proposed Parallel Heuristic:
————————————————————————–
Step1:Allocate 200 GPU threads.
Step2: For each GPU threads.
a) Run the Main 2DKP Algorithm in parallel to find the best packing plan.
b) Send each thread obtained packing plan (sum of the values of the packed pieces) to CPU.
Step3: CPU compares all the packing plans sent by all 200 GPU threads and determine the best packing plan that maximize the covered area of the container.
—————————————————————————
Below is our previously proposed algorithms to find optimal solution for 2DKP.
———————————————————————
Our Main 2DKP Algorithm:
————————————————————————–
Step1: Generate a series of random initial solutions (Packing plan with a sequence of rectangle pieces).
Step2: Calculate fitness (height and width) of every rectangle pieces of initial solutions.
Step3: Repeat (iterations).
a) Pick a random solution and apply local search heuristic to select parents from initial population for cross over and mutation.
b) Apply a greedy heuristic GreedyPacking() to generate greedy solution from parents.
c) Calculate fitness of newly generated solution and add newly generated solution in initial population.
d) Sort combined population (initial population
+ new greedy solution) by height/width and select best solution(Packing plan) for next population.
e) Set initial population = next population Until Terminating condition is satisfied (for given
number of generations ).
Step4: Call GreedyPostOptimization() algorithm to carry out postoptimization of the previously best solution
—————————————————————————
In our greedy approach, we considered the following conditions and fitness while building a greedy heuristic.
When h1>=h2,
if w = wi and h1 = li, then fitness = 6 if w = wi and h2 = li, then fitness = 5 if w = wi and h1 <li, then fitness = 4 if w >wi and h1 = li, then fitness = 3 if w = wi and h1 >li, then fitness = 2 if w >wi and h2 = li, then fitness = 1 if w >w, then fitness = 0
When h1<h2,
if w = wi and h2 = li, then fitness = 6 if w = wi and h1 = li, then fitness = 5 if w = wi and h2 <li, then fitness = 4 if w >wi and h2 = li, then fitness = 3 if w = wi and h2 >li, then fitness = 2 if w >wi and h1 = li, then fitness = 1 if w >w, then fitness = 0
For cases with fitness = 3, fitness = 1 or fitness = 0, if the piece i is packed into a space that causes the remaining width (w wi ) to be leq s than the minimum width of all remaining unpacked pieces, it is obvious that waste will occur. In this case, we present the leastwaste strategy: select one unpacked piece j so that its width is maximal amongst the unpacked pieces and wj is less than or equal to w and its perimeter is not less than that of piece i . This strategy can therefore make up the disadvantage of the scoring strategy and may reduce the waste.
The proposed heuristic can be outlined as follows:
————————————————————————
Our GreedyPacking() Algorithm:
————————————————————————–
Step1: for all unpacked pieces do.
Step2: select the piece i with the highest fitness according to the above mentioned conditions and fitness
Step3: if the fitness of the piece i is 3 or 1 or 0 then
select the piece j according to the lease waste
strategy.
if the piece j can be found then pack the piece j instead of packing i
else pack the piece i else pack the piece i
Step4: Return greedy solution.
—————————————————————————
In addition, we proposed the following GreedyPostOpti mization() algorithm to carry out postoptimization of the previously best solution.This algorithm stores a list of free rectangles that represents the free area left in the bin at some packing step.
———————————————————————–
Our GreedyPostOptimization() Algorithm:
————————————————————————–
foreach Free Rectangle
Compute the free area left in the bin
Select an unpacked rectangle, which has smaller height/width than the free area
Pack the piece in that free area

EXPERIMENTAL RESULTS
We set up a CUDA programming environment in Visual Studio to build the parallel heuristic on NVIDIA GPU using C++ to find the optimal solution for the 2DKP. After paralleliz ing our proposed heuristics with GPU, we find an improved optimized results as follows:
GPUThreads CoveredArea Maximized CoveredArea
1 19
2 21
10 18
30 17 23
80 20
140 23
200 22
We can see that different GPU threads return different covered areas. As we consider only the maximum, the sum of the covered area can be maximized by approximately 35.29%. So, we can say that we have found much better optimal solution by parallelizing the heuristic on GPUs.
 CONCLUSIONS
In this work, we have presented a GPU accelerated parallel heuristic integrating a greedy approach into a genetic algorithm with localsearch. The experimental results presented expose the improvement achieved by our new parallel heuristic over previous work as detailed in the previous section. All in all, the results presented show that the sum of the covered area can be maximized by approximately 35.29%. This improvement is due to the parallelization with GPU.
REFERENCES
[1] Andreas Bortfeldt and Tobias Winter, A Genetic Algorithm for the TwoDimensional Knapsack Problem with Rectangular Pieces, http://www.fernunihagen.de. German, 2008.
[2] Defu Zhang and Furong Ye and Yain Whar Si and Stephen C. H. Leung, A hybrid algorithm based on variable neighbourhood for the strip packing problem, Journal of Combinatorial Optimization. China, 2016.
[3] Mohammad Harun Rashid, A Greedy Local Search Metaheuristic with GeneticAlgorithmforthe2DKnapsackProblemwithRectangularPieces, Pace University. USA, 2017.
[4] Jakob Puchinger and Guunther Raidl, An Evolutionary Algorithm for Column Generation in Integer Programming: An Effective Approach for 2D Bin Packing, https://hal.archivesouvertes.fr/hal01299556. UK, 2016.
[5] Jukka Jylnki, A Thousand Ways to Pack the Bin – A Practical Approach to TwoDimensional Rectangle Bin Packing, 2010.
[6]E. HOPPER and B. C. H. TURTON, A Review of the Application ofMeta Heuristic Algorithms to 2D Strip Packing Problems, Artificial Intelligence Review. UK, 2001.
[7]Andrea Lodi, Algorithms for TwoDimensional Bin Packing and Assign ment Problems, UNIVERSITA DEGLI STUDI DI BOLOGNA. 1999.
end
—————————————————————————
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out moreCite This Work
To export a reference to this article please select a referencing style below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please: