-
Notifications
You must be signed in to change notification settings - Fork 471
Fix for #2872. Makes exporttable command volume aware. #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
EdColeman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intent to add tests?
...manager/src/main/java/org/apache/accumulo/manager/tableOps/tableExport/WriteExportFiles.java
Outdated
Show resolved
Hide resolved
...manager/src/main/java/org/apache/accumulo/manager/tableOps/tableExport/WriteExportFiles.java
Show resolved
Hide resolved
ddanielr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue with the filepath logic.
...manager/src/main/java/org/apache/accumulo/manager/tableOps/tableExport/WriteExportFiles.java
Outdated
Show resolved
Hide resolved
|
@AlbertWhitlock Updates look good. There's a comment that you should remove, but otherwise the logic issues seem to be fixed. Are you planning on adding tests for this export behavior? |
test/src/main/java/org/apache/accumulo/test/ExportTableCommandWitlMultipleVolumes.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/ExportTableCommandWitlMultipleVolumes.java
Show resolved
Hide resolved
...manager/src/main/java/org/apache/accumulo/manager/tableOps/tableExport/WriteExportFiles.java
Outdated
Show resolved
Hide resolved
… WriteExportFiles
test/src/main/java/org/apache/accumulo/test/ExportTableCommandWitlMultipleVolumes.java
Outdated
Show resolved
Hide resolved
...manager/src/main/java/org/apache/accumulo/manager/tableOps/tableExport/WriteExportFiles.java
Outdated
Show resolved
Hide resolved
|
|
||
| // get second volume name | ||
| String[] baseDir2Array = baseDirArray; | ||
| baseDir2Array[2] = baseDir2Array[2] + "2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test fails for me since it is trying to use a directory that does not exist: dgarguilo2 is not a user and is not a directory. I suggest looking into using a @TempDir like we do with other tests so a directory is sure to be created and cleaned up for this test.
Details
2023-07-25T16:42:33,408 [init.Initialize] DEBUG: creating instance directories for base: file:/home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1
2023-07-25T16:42:33,427 [init.Initialize] INFO : Directory file:/home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1/version/11 created - call returned true
2023-07-25T16:42:33,430 [init.Initialize] INFO : Directory file:/home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1/instance_id created - call returned true
2023-07-25T16:42:33,457 [init.Initialize] INFO : Created instanceId file file:/home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1/instance_id/7f6b05aa-4141-4781-ab1b-6a7f7667fae4 in hdfs
2023-07-25T16:42:33,457 [init.Initialize] DEBUG: creating instance directories for base: file:/home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2
2023-07-25T16:42:33,462 [init.Initialize] INFO : Directory file:/home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2/version/11 created - call returned false
2023-07-25T16:42:33,463 [init.Initialize] INFO : Directory file:/home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2/instance_id created - call returned false
2023-07-25T16:42:33,465 [init.Initialize] ERROR: Problem creating new directories
java.io.IOException: Mkdirs failed to create file:/home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2/instance_id
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:561) ~[hadoop-client-api-3.3.5.jar:?]
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:549) ~[hadoop-client-api-3.3.5.jar:?]
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1210) ~[hadoop-client-api-3.3.5.jar:?]
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1165) ~[hadoop-client-api-3.3.5.jar:?]
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1499) ~[hadoop-client-api-3.3.5.jar:?]
at org.apache.accumulo.server.fs.VolumeManagerImpl.createNewFile(VolumeManagerImpl.java:177) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.createDirs(Initialize.java:291) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.doInit(Initialize.java:175) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.execute(Initialize.java:543) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.main(Initialize.java:583) ~[classes/:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.accumulo.start.Main.lambda$execMainClass$1(Main.java:118) ~[classes/:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2023-07-25T16:42:33,473 [init.Initialize] ERROR: FATAL: Problem during initialize
java.io.IOException: Problem creating directories on [file:/home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2, file:/home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1]
at org.apache.accumulo.server.init.Initialize.doInit(Initialize.java:176) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.execute(Initialize.java:543) ~[classes/:?]
at org.apache.accumulo.server.init.Initialize.main(Initialize.java:583) ~[classes/:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.apache.accumulo.start.Main.lambda$execMainClass$1(Main.java:118) ~[classes/:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent of the new test was to create and use a new volume. It is not possible to create a new volume in a test however, so I think the best that can be done is to use different directories which can be done with junits TempDir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this a bit more and looking at the code, I don't think what I suggested will work. In fact I am not sure of a way to write a test for this code. It seems that, to properly test this code, two volumes need to be present. However, an additional volume can not be created or expected to already exist on the machine for a test.
I am able to get the new test to pass only after manually creating a new directory for the test to create resources inside of. For example, with my username dgarguilo, the test expects to be able to create the following directories:
- /home/dgarguilo/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v1
- /home/dgarguilo2/github/accumulo/test/target/mini-tests/org.apache.accumulo.test.ExportTableCommandWithMultipleVolumesIT_testExportCommand/volumes/v2
The second of which I had to create manually. Obviously this can't be expected to happen for a test to pass.
In conclusion, after manually creating the needed directory and having the test pass, it seems like these changes are working but I don't think we can keep the test as is. I am not sure how to move forward with these changes. Should the new IT be removed and just rely on manual verification of these changes or should something else happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, an additional volume can not be created or expected to already exist on the machine for a test.
I think this statement is incorrect. You should be able to create more than one volume for the test. A volume is simply a URI. You can certainly create two volumes (with file:///path/to/tempdir/vol1 and file:///path/to/tempdir/vol2) by creating these two temporary directories at the beginning of the test, and configuring those in MiniCluster before it starts up. What is preventing that?
However, you don't even need to do that much. The table doesn't need to be online or any of its files be read, or actually exist. You can just create an offline table, and add metadata entries that look like the table has files in separate volumes. We don't need to actually read the files to do the export and verify that the separate distcp files are created. You can just check that the files were created and contain the expected entries, based on the mock entries you inserted into the metadata of the offline table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say creating the directory manually, do you simply mean 'mkdir dir' in your shell? If that is all that is needed then we can certainly create the directory in the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that new directories can be created by a test. I think the difference in this case is that the distcp files that are created use the root of the path (or rather the second index) to name those files:
String keyValueString = entry.getKey().toString();
String[] keyValueArray = keyValueString.split("/");
String volumeName = keyValueArray[2];
createDistcpFile(fs, exportDir, exportMetaFilePath, volumeFileMap.get(entry.getKey()), volumeName);
So in the case of creating a temporary directory from the test, the distcp files would be created with the same name unless we have a way of creating directories with different subdirectories at the level at which the code above uses. That is what I was trying to convey in my previous comment, not that its impossible to create a directory from a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if you're saying that the code in this PR is creating separate distcp files whose names collide (which shouldn't happen... this PR should create uniquely named files if it's creating more than one as the proposed solution), or if you're saying that all the local files look like they're part of the same volume and end up in the same file because all files in file:/// looks like a single filesystem. I think I would need specific file names and resulting file contents, as an example, to understand what you're saying.
However, I've noticed another problem with the current implementation. The current implementation seems to be looping over the currently configured volumes and writing out separate files for each of those volumes configured, if it sees files that belong in that volume. However... what about files in the metadata that are not part of any currently configured volume? The number, name, and organization of the files should be based on the volumes represented in the absolute paths of the files seen in the metadata table... not just grouped by whether they match one of the currently configured volumes. The current implementation looks like it would just ignore those files.
Without diving in to implement this myself, I'm not exactly sure what the expected behavior is for file:/// paths, but the basic functionality I would expect the following set of mock files seen to have 3 separate groups, each with 2 files:
group1:
hdfs://nn1:8020/accumulo/path/to/table1/tablet1/file1.rf
hdfs://nn1:8020/accumulo/path/to/table1/tablet2/file1.rf
group2:
hdfs://nn1:8021/accumulo/path/to/table1/tablet3/file1.rf
hdfs://nn1:8021/accumulo/path/to/table1/tablet4/file1.rf
group3:
hdfs://nn3:8020/accumulo/path/to/table1/tablet5/file1.rf
hdfs://nn3:8020/accumulo/path/to/table1/tablet6/file1.rf
This should be true, even if only hdfs://nn1:8020 were currently configured in instance.volumes, and nn2 and nn3 were previous volumes that are being decommissioned, but data still exists on them. This should also account for the volume replacements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Christopher....I am only making those volumes as a place for the rfiles to be held for the test. I cannot just place non-existent files entries into the metadata and then run the exportTable command. In WriteExportFiles class, it checks that the files exist. When I create local directories to hold the rfiles, the test passes. It creates the multiple distcp files and then deletes the directories it used for the test. Your concern that the output should be based on the entries in the metadata table is correct.......this test setup is just a way to get some entries into the metadata table and then run the exportTable command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's frustrating that the WriteExportFiles class requires the files to actually exist. That definitely limits the ways this can be tested as an integration test. But, we can still have good unit test coverage, if WriteExportFiles can be written with a package-private method that does the grouping of the input files to the designated output file, without the files existing. You might need to rewrite some of the implementation code to be more testable.
But, regardless of the test code, the implementation is still incorrect, in that it loops over instance.volumes, and looks for entries in the metadata table that matches them. What it should do is loop over the entire set of files found in the metadata table, and group them based on filesystem, regardless of what is set in instance.volumes. You can write a unit test for the grouping function that doesn't require a running system or real files to exist. You just pass a list of file names that could be seen in the metadata table, and verify the grouping function has properly grouped them by destination filename.
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| public class ExportTableMakeMultipleDistcp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this class is doing. It looks like a unit test class, but does not follow the naming convention of unit test class. It also has a mix of JUnit4 and JUnit5 imports. It doesn't seem to be unit testing the implementation, but seems to have a separate implementation that it's using.
My suggestion last week was to have a unit test that tests the appropriate method in WriteExportFiles, and that if necessary, WriteExportFiles might need to be refactored to make it more easy to write a unit test against the file grouping function. This looks like of like that, except it doesn't call any WriteExportFiles methods to test that class' implementation. It just provides a new implementation here, so it's not clear what is being tested.
For example, if WriteExportFiles had a method like:
public Map<String, Set<String>> groupDataFilesByOutputFile(Set<String> uniqueFiles);Then, it would take a set of input files and group them by output filename. You could write a unit test with a particular set of uniqueFiles, and verify that the output mapping was correct.
Alternatively, if it had a mapper function, like:
Function<String,String> outputFileMapper;Then, you could write a unit test that verified that for any given input file name, it properly mapped to the expected output file name. (as in, for each uniqueFiles, assertEquals(expectedOutputFile, outputFileMapper.apply(inputFile));) A grouping method could use a function like this, and you could either write a unit test against the grouping method or the mapping function, or both.
Will make a new distcp.txt file for each volume found in a table.